Words and Intelligence II
Text, Speech and Language Technology VOLUME 36
Series Editors Nancy Ide, Vassar College, N...
11 downloads
762 Views
3MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Words and Intelligence II
Text, Speech and Language Technology VOLUME 36
Series Editors Nancy Ide, Vassar College, New York Jean Véronis, Université de Provence and CNRS, France Editorial Board Harald Baayen, Max Planck Institute for Psycholinguistics, The Netherlands Kenneth W. Church, AT & T Bell Labs, New Jersey, USA Judith Klavans, Columbia University, New York, USA David T. Barnard, University of Regina, Canada Dan Tufis, Romanian Academy of Sciences, Romania Joaquim Llisterri, Universitat Autonoma de Barcelona, Spain Stig Johansson, University of Oslo, Norway Joseph Mariani, LIMSI-CNRS, France
Words and Intelligence II Essays in Honor of Yorick Wilks Edited by
Khurshid Ahmad Trinity College, Dublin, Ireland
Christopher Brewster University of Sheffield, UK
Mark Stevenson University of Sheffield, UK
A C.I.P. Catalogue record for this book is available from the Library of Congress.
ISBN 978-1-4020-5832-5 (HB) ISBN 978-1-4020-5833-2 (e-book)
Published by Springer, P.O. Box 17, 3300 AA Dordrecht, The Netherlands. www.springer.com
Printed on acid-free paper
All Rights Reserved © 2007 Springer No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work.
Contents
Biographies of the Editors
vii
List of Contributors
ix
Introduction
xi
1.
Yorick Alexander Wilks: A Meaningful Journey Mark Maybury
2.
Metaphor, Semantic Preferences and Context-Sensitivity John A. Barnden
3.
Towards a New Generation of Language Resources in the Semantic Web Vision Nicoletta Calzolari
63
Information Access and Natural Language Processing: A Stimulating Dialogue Robert Gaizauskas, Horacio Saggion and Emma Barker
85
4.
1
39
5.
Three Steps in Wilks Work: From Theory to Resources to Practice Gregory Grefenstette
107
6.
Preference Syntagmatics Patrick Hanks
119
7.
Historical Ontologies Nancy Ide and David Woolner
137
8.
An Amorphous Object Must be Cut by a Blunt Tool Makoto Nagao
153
9.
Homer, the Author of The Iliad and the Computational-Linguistic Turn Sergei Nirenburg v
159
vi
Contents
10. Philosophical Engineering Nigel Shadbolt
195
11. Machine Translation and the World Wide Web Harold Somers
209
12. Semantic Primitives: The Tip of the Iceberg Karen Spärck Jones
235
13. Molecules, Meaning and Post-Modernist Semantics John Tait and Michael Oakes
255
Biographies of the Editors
Khurshid Ahmad holds the Chair of Computer Science at Trinity College, Dublin, Ireland; he was the founding Head of the Department of Computing, University of Surrey, England and held the Chair of Artificial Intelligence. He is interested in cross-modal interaction in human cognition and the simulation of such interaction in information systems. His research interests are in the areas of information extraction, neural networks, ontology and terminology systems, knowledge management and financial engineering. He has been working with major EU IT organizations and universities on problems related to terminology of specialist domains and the ontological commitment of the domain. His current projects include automatic summarisation of video streams for surveillance which involves studies of attention and distraction amongst humans; he has been working on the automatic identification of ‘sentiments’ in financial news reports that influence the behaviour of financial markets; and he published on multi-modal information fusion in child language and numerosity development, and on the simulation linguistic deficit amongst aphasic patients, using multi-net neural computing systems. In knowledge management he has been dealing with the diffusion of knowledge from research laboratories onto the first stage of industrialisation, patents, using a corpus based study of lexical preferences amongst scientists. He has published over 150 papers in various topics, edited two books and written one. He obtained his doctorate in theoretical nuclear physics in 1975 and has enjoyed being curious about complex systems, language and the evolution of ideas ever since. His work has been supported by the EU programmes of R&D in IT and by the UK Research Councils including the EPSRC, ESRC and AHRC. He is the Fellow of the British Computer Society. Christopher Brewster is a Research Fellow in the Department of Computer Science at the University of Sheffield. He has worked in a number of language technology projects, having had experience especially in computational corpus linguistics and lexicography. He worked on the problem of knowledge acquisition and maintenance in the EPSRC Advanced Knowledge Technologies (www.aktors.org) contributing especially to research on ontology learning, the appropriacy of ontologies for knowledge representation, and ontology evaluation. He was lead scientist on the Abraxas project (http://nlp.shef.ac.uk/abraxas/) which focussed on ontology learning techniques. He is currently Project Manager of the Companions Project (www.companions- project.org). He has published a number of papers on the subject of ontologies, and organised several workshops. vii
viii
Biographies of the Editors
Mark Stevenson is a lecturer and EPSRC Advanced Research Fellow at Sheffield University, where he is a member of the Natural Language Processing group. He has previously worked for Reuters Ltd. In London where he led projects on the application of language technology to business problems and also acted as the industrial contact on European projects. In 2001/2002 he was an inaugural Reuters Foundation visiting Fellow at the Center for the Study of Language and Information (CSLI), Stanford University. His research interests include word sense disambiguation, semantic similarity and information extraction and retrieval. His PhD was supervised by Yorick Wilks and explored the application of a diverse set of knowledge sources to the word sense disambiguation problem. His thesis was published by CSLI Publications. In addition, he has published around 50 papers in journals, collected volumes and international conferences.
List of Contributors
Khurshid Ahmad, Department of Computer Science, Trinity College Dublin, Dublin, Eire Emma Barker, Department of Computer Science, University of Sheffield, Sheffield, United Kingdom John A. Barnden, School of Computer Science, University of Birmingham, Birmingham, United Kingdom Christopher Brewster, Department of Computer Science, University of Sheffield, Sheffield, United Kingdom Nicoletta Calzolari, Istituto di Linguistica Computazionale del CNR Pisa, Italy Robert Gaizauskas, Department of Computer Science, University of Sheffield, Sheffield, United Kingdom Gregory Grefenstette, CEA LIST, Fontenay aux Roses, France Patrick Hanks, Masaryk University, Brno, Czech Republic Nancy Ide, Department of Computer Science, Vassar College, Poughkeepsie, NY, United States of America Mark Maybury, MITRE Corp., Bedford, MA, United States of America Makoto Nagao, National Institute of Information and Communications Technology, Tokyo, Japan Sergei Nirenburg, National Institute for Language and Information Technologies, University of Maryland, Baltimore County, MD, United States of America Michael Oakes, Department of Computer Science, University of Sunderland, Sunderland, United Kingdom Horacio Saggion, Department of Computer Science, University of Sheffield, Sheffield, United Kingdom ix
x
List of Contributors
Nigel Shadbolt, School of Electronics and Computer Science, University of Southampton, Southampton, United Kingdom Harold Somers, School of Informatics, University of Manchester, Manchester, United Kingdom Karen Spärck Jones, Computer Laboratory, University of Cambridge, Cambridge, United Kingdom Mark Stevenson, Department of Computer Science, University of Sheffield, Sheffield, United Kingdom John Tait, Department of Computer Science, University of Sunderland, Sunderland, United Kingdom David Woolner, Marist College, Poughkeepsie, NY, United States of America
Introduction
It has been said of the brothers Wilhelm and Alexander von Humboldt that between them they were the last people to have known all that there was to know, to have had a mastery of the best that contemporary science knew and to have made significant contributions, to be that rare thing Renaissance men. In a world of ever-greater specialisation, especially in academia, the ability to cross intellectual boundaries, bring together ideas beyond the confines of one’s narrow discipline and yet make significant intellectual contributions has become ever rarer. In bringing together this celebration of Professor Yorick Wilks, it has been the ambition of the editors to provide the reader with a taste, an inkling of that which cannot be conveyed on the written page but only in the person of Yorick. He is a renaissance man in an age where such concepts have been forgotten. He is a bridge between a bewildering variety of contemporary research, and simultaneously a link between some of the most advanced thought in the broadly interpreted field of Artificial Intelligence (AI) and the long tradition of philosophy, literature, and general intellectual creativity that have fundamentally informed his academic research. This comes across in part when one considers his career, more so when one reads his writings but is most apparent in person. Modern scientists have become specialised, experts in only one specific domain. In contrast, Yorick Wilks has remained a universalist, actively contributing to a wide range of topics, from the details of machine translation or information extraction to the philosophical implications of certain AI positions or the current political situation in the world. A long history of widely cited publications in a great variety of academic organs bear testament to a highly productive and influential career which is honoured in this Festschrift and manifested in the accompanying volume of selected papers by Yorick Wilks. Artificial Intelligence and Natural Language Processing (NLP) have been the primary areas of concern for Yorick over the years of his career and yet this has not detracted from Yorick’s capacity to have competence and make contributions across a large range of topics. His academic passions have included AI, its philosophical foundations, architectures for NLP, computational syntax and semantics, lexica and lexical resources, word sense disambiguation, machine translation, knowledge representation and acquisition, belief systems and agents, human-computer dialogue and information extraction. Some, but not all of his passions, are conveyed and celebrated in the contributions contained in this volume. xi
xii
Introduction
During his career Yorick has led a succession of successful research groups (detailed in Mark Maybury’s biographical paper). In this capacity he has a particular talent for obtaining research grants and has successfully funded large research groups in the UK and USA. At the University of Sheffield he built one of the most successful research groups in the world in the field of NLP. However, there is much more to leading such a group than writing good grant proposals. Yorick has always had a vision for his groups, a broad concept of what it was trying to achieve while concurrently allowing individual researchers freedom to pursue their own interests, to be creative. He always has confidence that if you put good people together, give them an appropriate degree of freedom, academic creativity and innovation will thrive. What cannot be conveyed by the written word, and which we can only convey superficially in this introduction are the human qualities that have accompanied the research and learning. Yorick has a breadth both in his humanity and in his culture, a tolerance and understanding of fellow human beings, a good humour and generosity of spirit. This is apparent in the freedom he has given his students and his ability to create a fertile productive environment in order to allow research to flower. Furthermore, he embodies a sense of vision and a depth of knowledge, a deep insight into human qualities and a tolerance for human frailties, all of which are combinations both unusual and refreshing. Yorick has always conveyed passion in both his work and leisure. He has a natural ability to make people feel at ease and is a famed raconteur. Part of this comes out of Yorick’s immense breadth of interests. He has always led a double life, having over the years a very successful amateur acting career (which nearly became professional according to some anecdotes). Furthermore, he is someone who in a previous generation would have been approvingly described as “well-read” and this broad culture informs both his scientific output and interests and the daily interaction he has with his colleagues. Life in AI and NLP would have been the poorer without the person of Yorick to bring his joie de vivre. It has always been a privilege to be a research student supervised by Professor Wilks. Yorick is both visionary and practical, encouraging the student to read a text, whether from last year’s conference or a hundred years ago, and place it in the context of their current concerns. It is the content of the ideas that interest him in a student’s work without any concern for formality or procedure. And this is another area where Yorick’s capacity for seeing the potential of people is most apparent, a potential which they will usually not be aware of themselves. Above all Yorick has been able to create an environment in which one is encouraged to publish, attend conferences and carve out an independent research career. As much as he can Yorick has always sought to support students through their studies, financially and intellectually. It was in this spirit of developing students that Yorick was instrumental in founding the CLUK (Computational Linguistics UK) series of conferences focussed on the needs of graduate students in the UK. One of Yorick’s strength is his ability to collaborate with academics and endusers with different interests, in synthesising complex ideas across different disciplines, and finally in articulating such ideas. There are many instances of such
Introduction
xiii
collaborations across the world, many reflected in the contributions in this Festschrift. Collaborating with Yorick has always been an extremely stimulating experience right from the start of the conception of the project, through the writing of proposal texts (in which Yorick excels), to the realisation of the project and its outcomes in collaboratively written papers or software. His capacity to bring together collaborators from different countries, different cultures, let alone entirely different academic communities is one much celebrated and recognised. The collection begins with a biographical essay on Yorick written by Mark Maybury. Maybury has collected details and anecdotes from a wide circle of Yorick’s friends and acquaintances to present an amusing and insightful account of a life full of meaning in all senses. In his paper entitled “Metaphor, Semantic Preferences and Context-Sensitivity,” John Barnden discussed Yorick’s work on metaphor which he interprets as being “utterance-based” while arguing for a context-based approach. Nicoletta Calzolari, in her paper “Towards a new generation of Language Resources in the Semantic Web vision”, notes Yorick’s early and prophetic understanding of the importance of natural language corpora, and while reviewing a number of language resources related projects in which Yorick has been involved, makes the case for a continuing need for infrastructure focussed HLT research. Robert Gaizauskas, a long time colleague of Yorick’s at the University of Sheffield, has contributed a paper co-authored with Emma Barker and Horacio Saggion on “Information Access and Natural Language Processing: A Stimulating Dialogue.” This considers the role of NLP in relation to IR and information access in general with specific reference to a project “Cub Reporter” undertaken at the Sheffield NLP group. Gregory Grefenstette’s paper on “Three steps in Wilks work: From theory to resources to practice” is a celebration of what the author sees as three important components in Yorick’s work viz. his “flight’s of brilliance,” “his reasoned response to difficulties” and his unrelenting engineering effort and serious science. Grefenstette considers in detail three papers of Yorick’s from what could be called the early, middle and late work, placing the work in the context of contemporary research. In a paper entitled “Preference Syntagmatics” Patrick Hanks discusses an ongoing project of his to create a “Pattern Dictionary” which is fundamentally influenced by Yorick’s early work on Preference Semantics. A central claim is that the problem of word sense disambiguation needs to be reformulated before it can successfully be resolved. On a somewhat different tack “Historical Ontologies”, a paper by Nancy Ide in collaboration with David Woolner, discusses the challenge of creating knowledge representations that can handle diachronic events. A major interest of Yorick has been machine translation and Makoto Nagao in his paper “An Amorphous Object Must Be Cut By A Blunt Tool” gives an account of the creation of example based machine translation, an approach he claims is typically Japanese. Addressing another ongoing concern of Yorick’s is Sergei Nirenburg’s paper “Homer, the Author of The Iliad and the Computational-Linguistic Turn.” He sets out in detail the disagreement, first, between Yorick’s views on knowledge representation and the tradition of Fodor, and then Yorick’s views on knowledge representation resources such as ontologies and the views of the Guarino.
xiv
Introduction
Reflecting the wider impact Yorick has had as a philosopher concerned with AI and computer science, Nigel Shadbolt’s paper “Philosophical Engineering” discusses the fundamental philosophical issues which arise in undertaking modern computer engineering. He notes that while formal models have immense power in a continuously changing world our capacity to construct models is under constant challenge. Returning to machine translation, Harold Somers gives a brief account of Yorick’s long term involvement and impact on the field, and goes on to consider current machine translation freely available on the web, and its success and impact. Yorick’s near contemporary at Cambridge, Karen Spärck Jones has contributed a paper on “Semantic primitives: The tip of the iceberg” which discusses how semantic primitives, a long time concern of Yorick’s, are considered today. The collection concludes with a paper by John Tait in collaboration with Michael Oakes “Molecules, Meaning and Post-Modernist Semantics” which again return to preference semantics but from the perspective of the need for machine learning of lexical resources. The editors have put together this Festschrift, and the accompanying volume of Selected Papers, in order to celebrate him as an individual and bring into focus his work and its impact across a range of research topics. We would like to thank all the contributors to this volume for taking time out of their schedules to write these papers and thus make the Festschrift possible. We have known Yorick in various capacities, as students, colleagues, collaborators and friends, and we sincerely hope that these volumes will bring pleasure to him, his colleagues and friends. Khurshid Ahmad Christopher Brewster Mark Stevenson
Acknowledgements The editors would like to thank Springer Verlag for producing these two volumes and also Nancy Ide whose dinner invitation at LREC in Lisbon and conversations with Jolanda Voogd led to these volumes.
1 Yorick Alexander Wilks: A Meaningful Journey Mark Maybury MITRE Corp., Bedford, MA, United States of America Outgoing, happy, generous to a fault and always fun to be with. As soon as he joins a conversation it lightens. - Prof. Derek Partridge, University of Essex
Abstract:
This chapter attempts to summarize Yorick’s long and rich professional career and introduce some of his main areas of contribution which are elaborated in other chapters. Teacher, researcher, mentor, and actor, this chapter pays tribute to his many contributions, professional service and honors
1.1 Time Line Yorick Wilks’ career, still very much in full force, already spans nearly four decades. Figure 1.1 is a very brief time line of some of Yorick’s significant life events sampling his many roles as scientific author, editor, researcher, professor, department chair, director, and fellow. Only a professional actor could have played so many roles so effectively. But then again the timeline and this history fails to capture his important roles as actor, fiction writer, chef, movie critic, social observer, successful business man, globe trotter, and of course husband and father. Certainly this brief history chapter can in no way hope to capture such diversity. Accordingly we have prepared a complementary video festschrift to accompany this volume wherein Yorick’s contemporaries reminisce about what it was like to learn from, work with, and simply enjoy life with this passionate Renaissance man. Finally, the chapter will point to other chapters in this collection which dive more deeply into technical topics which Yorick contributed to if not in some cases invented. 1 K. Ahmad, C. Brewster and M. Stevenson (eds.), Words and Intelligence II, 1–37. © 2007 Springer.
2
Maybury
Yorick Wilks 1968 Cambridge University Pembroke College PhD thesis “Argument and Proof”
1960
1963 First ACL
1970
1969 First IJCAI
1989 1996 Theoretical Issues in Natural Electric Words: Dictionaries, Language Processing Computers and Meanings Series Editor, Studies in Natural Language Processing
1980
with Brian Slator and Louise Guthrie MIT Press 2005 2000 YorickWilks(ed.) Language, cohesion Artificial Believers and form: with Afzal Ballim Machine Conversations Kluwer Margaret Masterman. Lawrence Erlbaum Cambridge Univ. Press
1990
2000
2010
1982–1993 1990–1995 1980 2003 Eurotra DARPA MT initiative First AAAI 1972 Readings in Machine Translation Grammar, Meaning and the Machine w/ SergeiNirenburg and Harold Somers (eds.) Publications Analysis of Language 1963–1966 MIT Press 1990 1985 Research Associate Foundations of AI: A Sourcebook Spärck Jones and Wilks. (ed.) Cambridge Language 1970–1974 Cambridge University Press Automatic Natural Research Unit Researcher, 2007 Language Parsing AI Laboratory, Stanford Machine Translation: Its Scope and Limits 1983 Professional Roles Cambridge Univ. Press 1974 Visiting Professor Senior Research Fellow 1993-present Monash Univ. Australia Institute for Semantic Director of The Institute of Language, 1984–1985 and Cognitive Studies, Speech and Hearing_(ILASH) Prof. of CS and Linguistics Castagnola Switzerland Univ. of Essex 1998–2002, CS Dept. 1975 Head, Univ. of Sheffield 1985–1993 (SERC) Senior Visiting 1998 Director, Computing Fellow, Department of AI Yorick’s CONVERSE dialogue Research Lab NMSU University of Edinburgh system wins Loebner prize in NY 1976–1983, Professor Honors 1991 1980–83 Head of Department 1997 Fellow 2004 Visiting Fellow Linguistics, University of Essex EPSRC 1998 UK Computing Trinity Hall, Cambridge College of ECCAI 1991 Research Advisor, NSF Advisor CMU CS Dept Computing Fellow AAAI Fellow Council
Fig. 1.1. Yorick Wilks time line
1.2 In the Beginning Yorick was born in London and first attended Latymer school in Edmonton, London. He moved to Torquay when he was a boy and attended Torquay Grammar School until 18 where he was a prefect and eventually headboy for some of his time there. Yorick then went to Pembroke College, Cambridge to study Philosophy and Mathematics. While at Cambridge, he was very active in the Pembroke Players drama group where he acted as well as directed. Professor Michael Rowan-Robinson, now Head of the Astrophysics Group at Blackett Laboratory in Imperial College in London, recalls “We acted in plays together while he was an undergraduate. Later we shared a houseboat together when he was a postgraduate.” Yorick also did quite a bit of singing in Cambridge choirs. Yorick’s friend Hugh Brogan remembers how impressive Yorick was as an actor: I’m not a colleague or a linguist or a computer person, but we were both members of the Labour party at Cambridge. He is a remarkable person. Yorick was the Labour party agent in Cambridge and my most vivid memory of him was around 1964 or 1965 when the Cambridge Labour Party put on a pantomime of Labour/Tory politicians. Written by
Yorick Alexander Wilks: A Meaningful Journey
3
the theologian Phillips Abrams the story centered around Dick Whittington, the Lord Mayor of London who traveled to China with his cat to eradicate rats. Yorick played the Emperor of China and never turned up until the dress rehearsal where he simply dominated the proceedings. He was a born performer. Years later I can recall driving from Colchester to Cambridge with Yorick and having such an intense argument that we missed our cutoff. He was great company.
1.3 Cambridge: Dissertation and Career Foundations After Yorick took his BA from the University of Cambridge in 1962, he joined the Cambridge Language Research Unit (CLRU) as a research associate (assistant in British parlance) and worked from 1963–1966 on natural language processing projects focused on semantics and machine translation. Yorick recalls “doing parsing with Hollerith card sorting machines” at the CLRU. Professor Karen Sparck Jones reminisces about the CLRU as “originally a lively discussion group interested in language and translation, subsequently funded to do research on automatic translation.” CLRU was a freestanding unit, not part of University, although individuals in it had university connections. CLRU founder Margaret Masterman was influential in many ways in Yorick’s professional career. Karen reflects: Yorick almost certainly became connected with the CLRU through the fact that his BA was in Philosophy (I assume following earlier part on Maths), that he was in Pembroke College and that Margaret directed studies in Philosophy for the college. There is no doubt that the CLRU was the foundation of Yorick’s work in NLP and indeed the line that Yorick started with was very directly a development and outgrowth of what the CLRU was doing. Margaret also had a connection with Stanford which I am sure also accounts for Yorick’s initial involvement there (the CLRU had some US funding for quite a long period from the mid fifties). Karen’s chapter “Semantic primitives: the tip of the iceberg”, in this collection, begins with Yorick’s CLRU work. At the same time he joined CLRU Yorick began work for his PhD. Yorick’s PhD advisor at Cambridge was Richard Brathwaite,1 husband of CLRU head Margaret Masterman. Yorick was named by Margaret in her will to be her literary executor. His edited collection of selected papers she wrote (Wilks, 2005) capture her belief that “meaning, not grammar, was the key to understanding languages, and that machines could determine the meaning of sentences” and her early sophisticated experiments in machine translation and use of semantic codings and thesauri to determine the meaning structure of texts even on simple card sorting machines. Yorick’s doctoral thesis (Wilks, 1968) “Argument and Proof” investigated automated sense selection. It formed the basis of his first book, Grammar, Meaning 1
http://www.britannica.com/eb/article-9016188
4
Maybury
and the Machine Analysis of Language (Wilks, 1972). In a reflective passage years later, Yorick (2000, p. 1) articulates the essence of his contributions: I want to make clear right away that I am not writing as a sceptic about word-sense disambiguation (WSD) let alone as a recent convert: on the contrary, since my PhD thesis was on the topic thirty years ago. That (Wilks, 1968) was what we would now call a classic AI toy system approach, one that used techniques later called Preference Semantics, but applied to real newspaper texts, as controls on the philosophical texts that were my real interest at the time. But it did attach single sense representations to words drawn from a polysemous lexicon of 800 or so. If Boguraev was right, in his informal survey twelve years ago, that the average NLP lexicon was under fifty words, then that work was ahead of its time and I do therefore have a longer commitment to, and perspective on, the topic than most, for whatever that may be worth! Karen Sparck Jones notes that “in his thesis, as generally in his publications, Yorick always sought to contextualise his work on automated word sense disambiguation by reference to theories of meaning in linguistics and philosophy.” Gregory Grefenstette’s chapter in this collection considers Yorick’s work related to linguistic resources and linguistic theory. Birmingham Professor John Barnden’s chapter in this collection considers Yorick’s idea that preference-violation is common in metaphor, and that it can often be used as a hint that the utterance is metaphorical (e.g., “a car drinks gasoline” even though the subject is inanimate and the liquid is non potable). A scientist as well as social observer, Yorick writes in one of his many word sense disambiguation papers “None of this is a surprise to those with AI memories more than a few weeks long: in our field people read little outside their own notational clique, and constantly ‘rediscover’ old work with a new notation.” In a discussion on vagueness he similarly notes “This issue owes something to the systematic ignorance of its own history so often noted in AI.”
1.4 California Dreaming (1966–1974) Following his work at CLRU, Yorick joined the System Development Corporation from 1966–1967 in Santa Monica, then was a Lecturer at the University of Nairobi in Kenya in 1969, and then a consultant on an ONR contract at Stanford University from 1969–1971. For a 2-year period he actually lived in Hollywood and moonlighted for the Hugh Heffner TV show in Los Angeles hoping to get a break as an actor (a friend of his was the scriptwriter). But this was not to be. Given the strong Masterman connections to Stanford, it was perhaps natural for Yorick to subsequently became a Research Associate and Lecturer at the Artificial Intelligence Laboratory, Stanford University, California from 1970–1974. When Yorick arrived at Stanford, he took up an office next to Roger Schank, then Associate Professor of Linguistics and Computer Science at Stanford (When he went to Yale
Yorick Alexander Wilks: A Meaningful Journey
5
Schank went entirely into CS). Roger had come to Stanford because of psychiatrist Ken Colby’s research with Parry, a conversational computer simulator of paranoid process. Roger recalls: Yorick arrived at Stanford about two years after me. We had offices next to one another, and found we had much in common. We had a similar point of view and position, arguing against the then dominant Chomskian view of the world. At that time we felt like we were two voices in the wind. We were rabble rousing for years and years in many countries. We both had developed semantic representation systems. Mine had fewer semantic primitives than his had but we both believed that creating an interlingua was the key to the computer processing of natural language. We were fighting a difficult battle against formalists and the Chomskians but eventually the Chomskians stopped caring about computers as AI became a more dominant field. We could have wound up as competitors but instead became good friends. Sergei Nirenburg recalls this relationship: [I recall] A couple of rounds of playing Diplomacy at the Guthries’ in El Paso, TX. During one of our negotiations, Yorick told me about another Diplomacy game where Roger Schank and his wife were among the participants. Apparently, after the game Professor Schank became very upset and chastised his wife for not having shown (real-world) loyalty in her (in-game) actions With his next move, Yorick promptly proceeded to occupy Belgium and leaving my forces in France in a lurch. To borrow Yorick’s typical comment in similar situations: “Brilliant!” Yorick collaborated with Roger Schank to capture their views in their joint article “The Goals of Linguistic Theory Revisited”. In Lingua (34) in 1974. From the beginning Yorick’s interests bridged the disciplines of artificial intelligence and language. During this period Yorick published in philosophy of science as well as theoretical linguistics. Roger recalls: We had this funny moment when Yorick, Winograd, and I were all together at the Stanford AI lab. We had the nucleus of what could have been the most powerful and influential group but John McCarthy, the head of the lab and a strong believer in the powers of predicate calculus, said he wished none of us were there. Roger recalls how at that time as a young post graduate Yorick didn’t have lots of funding (ironic years later) and didn’t yet have a regular faculty position: So by 1971 at IJCAI in London we were pretty good friends. Yorick was always trying to get away with something. He didn’t like playing the game as it was written. He hadn’t bought a registration for IJCAI but found a way to sneak in. Yorick and I were sitting together when my turn to give
6
Maybury
my talk came up. I wasn’t wearing a badge because I don’t like badges and of course, Yorick didn’t have one. As I was announced as the next speaker, one of the security guards who had been alerted that Yorick had snuck in tried to prevent me from going to the podium. The head of the conference who was someone who should have known what we both looked like, had thought that I was Yorick. I told the guard he would have to arrest me on the podium. Throughout the years Roger and Yorick maintained a friendship, getting together at meetings all across the world in places such as Sweden, Hungary, and France. Roger recalls one such conference in France years later: We got bored so we left the meeting and walked up to the top of the Alps and were exhausted as we reached the snow line. We found some bed-like platform on skis with two controls we found we could ski down but we went extremely fast and nearly killed ourselves by going off cliff! We were sure we would be dead soon, but we got down the mountain real fast. Roger Schank moved on to Lugano where Yorick would later transfer into his position.
1.5 Lugano (1974–1975) Yorick became a Senior Research Fellow, Institute for Semantic and Cognitive Studies (ISSCO) in Lugano Castagnola, Switzerland from 1974–1975. He came to ISSCO to continue to explore his semantics preferences theory on a MT system, working to extend the inferencing component. Yorick brought along Margaret King from UMIST in Manchester, England on a year’s leave of absence as his programmer. She eventually became the director of the institute. Margaret King recalls that during this period one of the most significant things Yorick did was to organize a tutorial: There was a small group of 8 of us in ISSCO, and I think pretty well everybody was involved. The tutorial itself was a roaring success. Taking place in March of 1975, this was the first in Europe and was quite daring really. About 100 people came to it, mostly post graduates. It made a significant impact on getting CL and AI started in Europe. It was a full week long and helped build a community. Yorick co-edited a resulting book with Professor Eugene Charniak on Computational Semantics: an Introduction to Artificial Intelligence and Natural Language Understanding published in 1976 by North-Holland in Amsterdam. Gene recalls: The original idea of the book was we were each going to pick an area and give lectures on it. After we did it we decided we would put our notes together in this book.
Yorick Alexander Wilks: A Meaningful Journey
7
Professor Margaret King remembers: We met once every week every couple of weeks and tore each other’s manuscripts up. It certainly sold rather well. Five years later in 1981 the book was reprinted in Russian in Moscow, in the series Progress in Linguistics. Gene Charniak had come to Lugano the year prior to Yorick and recalls his first interaction with Yorick: I received a letter about some of my graduate work when I was still at MIT and he questioned some of my deductions. I realized I had simply used the word “deduce” when I meant to use “infer”. I of course at the time ignored this as simply semantics. Looking back he was completely right. In spite of his academic successes, however, the mischievous and human side of Yorick was ever present. For example, Margaret King recalls Yorick’s influence on her kids: Yorick lived in one of the villages near Lugano surrounded by vineyards. He played this great adventure game with my kids. He used to dress up with a hood over his face and take my two 9 year old daughters to the vineyards to learn to “steal grapes”. My forty year old daughters still tell stories of this wonderful experience. Yorick’s daughter Octavia was born in Geneva and the fact that my grand child is named Octavia is not coincidence. Our families have stayed in close contact over the years. Yorick selflessly invested himself in teaching the next generation. Nicoletta Calzolari (CNR, Italy) recalls “extraordinary” summer schools on computational linguistics that the late Antonio Zampoli organized in Pisa in the early seventies: They were extraordinary events, where Antonio was able to gather really the most prominent people of the time. I still remember [Yorick’s] appearance at the first lecture. He impressed us (or me at least) because he appeared with a very elegant suit, very formal (probably also a hat, Borsalino style) and in particular all completely white! It was rather strange among students, but also other teachers, all very casual. And as soon as he started to speak, he shocked us with the speed with which he spoke, all of us (students) most of the time unable to follow . I discovered only later that he could speak rather good Italian. Years later it was Yorick who made me discover a wonderful piece of Italy. He loves all that is beautiful and he told me about Gargonza, a very special medieval village, really just a little more than a castle, completely surrounded by walls, circular on top of a hill, completely restored and transformed almost all of it in a very nice hotel, and here in Tuscany (neither Antonio nor I knew it before!).
8
Maybury
1.6 University of Edinburgh (1975–1976) From 1975 to 1976 Yorick served as a Senior Visiting Fellow at the Department of Artificial Intelligence at the University of Edinburgh. Dr. Graeme Ritchie, now at Aberdeen, was then in his 3rd year of his PhD at Edinburgh. He recalls: Yorick’s presence was felt in the department. I recall Martha Stone (Palmer) organized a reading group, and Yorick was a prominent participant – he would liven up the discussions. He also gave a series of public lectures on computational linguistics, natural language, and artificial intelligence which brought in other departments. The department had quite a thriving coffee room culture, where Yorick contributed his strong opinions to the debates. In 1976 Yorick published the article “Frames, Scripts, Stories, and Fantasies” in the Proceedings of the International Conference on the Psychology of Language. Graeme Ritchie recalls that frames had been the hot topic of that period: Yorick was very much a strong proponent of semantically based approaches to NLP. The roles of syntax and semantics were hotly disputed – five years prior Terry Winograd had published his work on SHRDLU, Roger Schank had his Conceptual Dependency Theory and Yorick had his own preference semantics formalism. He was antagonistic to traditional (syntactic only) approaches. During this same period, lexicographer Dr. Patrick Hanks, responsible for the first editions of Collins English Dictionary, Cobuild, and the New Oxford Dictionary of English recalls how Yorick changed his intellectual life: I first met Yorick in Edinburgh in about 1975. I was there for the Festival, and he was singing in the Mikado – was he Ko-ko (Lord High Executioner) or Pooh-Bah (Lord High Everything Else)? I forget. After the show, over the second or third bottle of wine, I found myself talking about words and meanings and confessing to intellectual bewilderment. An Eng. lit. graduate and a lexicographer, I would read the fashionable papers on linguistics and semantics of the time and wonder why they had so little to say to me – and that little so often implausible – even though my job involved describing central aspects of language and meaning. Yorick explained. “You’ve been reading the wrong sort of semantics – mere symbol pushing! And the wrong sort of linguistics!” He put things into a new perspective for me, expatiating on Wittgenstein, Grice, and other philosophers of language, and he subsequently sent me two of his own papers on Preference Semantics. I was hooked. When my Collins English Dictionary was published in 1979, I resigned from my job, put lexicography behind me for ever (as I thought), and registered for a Ph.D. with Yorick at Essex. The next two and a half years provided the best part of my real education. It is ironic, but true, that my best educational experience was never formally crowned with a
Yorick Alexander Wilks: A Meaningful Journey
9
certificate. (After a couple of years, the Cobuild project at the University of Birmingham made me an offer I couldn’t refuse. I became embroiled again and never submitted my dissertation in the Department of Language and Linguistics at Essex, which by this time was Yorick-less and overrun with Chomskyans.) Also at University of Edinburgh at that time was Professor Alan Bundy who recalls three anecdotes from the year Yorick spent there: 1. He joined the local operatic society, who put on a show consisting of Gilbert and Sullivan highlights, which I attended. I forget now exactly which roles Yorick played, but he played a prominent part, and enthusiastically engaged in the spirit of the event. 2. During his time in Edinburgh, Yorick and his wife hosted several parties. These were pretty lavish affairs – orders of magnitude more grand than the usual student parties we were all used to. I recall strawberries in February, roasted turkeys, etc. As an example of the law of unintended consequences, these events discouraged others from holding parties for a while, since they felt they could not compete. 3. At that time there was a controversial modern sculpture erected in the middle of the roundabout at the head of Leith walk. It was a scaffolding-like structure with randomly flashing lights all over it – or at least, it was supposed to have randomly flashing lights – these never really worked. At a dinner party, Yorick very vocally slagged off this sculpture, only to discover that the artist was sitting next to him. In addition to his impressive academic, acting, and social contributions, Yorick was blessed with the birth of his son Seth in Edinburgh.
1.7 University of Essex (1976–1985) Yorick joined Essex as a Reader in Linguistics in 1976, 2 years later becoming Professor and two more years later Head of the Department of Linguistics. After serving in this role for three years he became Professor of Computer Science from 1984–1985. Professor Derek Partridge recalls: He persuaded the vice chancellor to switch its chair from Linguistics to Computer Science. They thought they’d give him an easy ride by having him teach introduction to CS. Yorick responded “I couldn’t possibly do that as I know nothing about the subject.” How could he, he was a computational linguist. Patrick Hanks recalls a related story: Derek Partridge recalls Yorick claiming to know nothing about Computer Science. This can be matched by the linguists at Essex, at the time when
10
Maybury
he was Professor of Linguistics, who claimed that he knew nothing about linguistics – by which they apparently meant that he was not an orthodox subscriber to their own received dogma. Yorick has always been very clear about what he is and what he does. He is an AIer, a Cambridge philosopher who became fascinated by AI. His range is considerable. Many of us have benefited, in many different ways, from his profound insights into the nature of language and meaning and the problems of making these machine-tractable. While at Essex, Yorick lived in Clacton as Derek Partridge recalls: The other funny thing I recall is that at that time Yorick lived in Clacton, a cheap seaside resort close to university. He was an invited speaker at an international conference and a poster went up and listed him as coming from the non-existent University of Clacton. This oxymoron was enjoyed by the British attendees. While Clacton may have been thought of as a cheap, Dr. Doug Arnold of University of Essex recalls Yorick’s home as a palace: “a very fine and very beautiful old – maybe seventeenth century – black and white half-timbered, and the site of many legendary parties.” Dr. Pat Hayes (Institute for Human and Machine Cognition, Florida) recalls “a lovely old place, converted from two cottages, with a splendid rear garden”. He remembers the ice cream machine at one weekend party: The house that I recall was in the countryside near Clacton, not in the town itself. It was I vividly recall their wonderful dog, Hugo, who was about the size of a small horse, and the parties. One in particular was held on a Sunday afternoon, with many small children present, and the hosts had provided a ice-cream machine which delivered soft-cone ices free, to the kids’ general delight (a typically splendid Wilksean gesture). At one point late in the afternoon one of my sons, aged perhaps five, came to me sadly and said that the dog wouldn’t eat an ice-cream. We found Hugo in a far corner of the garden trying to hide himself in a hedge. Inquiries revealed that he had obediently eaten the first thirteen ice-cream cones quite happily. John Tait also recalls Yorick’s dog: While living in Clacton Yorick acquired a large, highly intelligent, Newfoundland dog. This dog was beloved of everyone (it could be trusted to baby sit a group of children at the side of a busy road) except Karen [Sparck Jones]. Karen particularly disliked its tendency to slaver over everyone and everything: but the more she tried to avoid it, the more determined the dog became to win her around: by slavering over her! Karen remembers Yorick’s “tiresome dog”: it was huge and when it waved its tail tended to sweep things off low tables, as I observed in Y’s house when it broke an antique glass object
Yorick Alexander Wilks: A Meaningful Journey
11
Fig. 1.2. Yorick at Eurotra Workshop, Bangor, 1980
During this time Yorick (see Figure 1.2) was one of the founding figures of the Eurotra Project (1982–1993), the European Community’s Machine Translation program aimed at developing a machine translation prototype for the European Community languages (at that time Danish, Dutch, English, French, German, and Italian – Portuguese, Spanish and Greek were added later). While generally agreed to have been unsuccessful on a technical and scientific level, it achieved remarkable success in fostering computational linguistics research in the EC, and in restoring MT as a worthwhile and respectable area of computational linguistics research. A subsequent DARPA MT initiative (1990–1995) led to the demise of rule-based MT and ushered in the IBM statistical approaches to MT and, more generally, computational linguistics. Dr. Doug Arnold came to the Department of Language and Linguistics at Essex in the late 1970s as a graduate student and recalls Yorick’s presence: Yorick’s dynamism and energy were impressive from the very first. He was a brilliant teacher – hugely entertaining and inspirational. Of course, he looked fantastic – he would come in to teach sometimes in jeans and a ratty tee-shirt, sometimes in a silk shirt and three piece suit, or a white linen jacket sometimes, spectacularly, in a lime green skiing dungarees. Listening to Yorick you always knew you were in the presence of someone very very clever, razor sharp with huge knowledge and absolutely at the forefront of the field, but every now and then he would come out with something truly astonishing – some insight that would leave you almost gasping, and with a clear realization that he really did operate on another level to the rest of us. But maybe more impressive than this was his openness, and complete lack of condescension: you always had the sense that he was open to ideas from any source – and not only if you were going to agree with him. I seriously doubt whether Yorick and I have ever agreed about any theoretical issue, but this was never a problem. He was hugely generous in his support – fierce and profound disagreement did not stop him giving me enormous encouragement and professional support. He had a big impact on the department. At the time Essex was quite a good centre for AI, with people like Pat Hayes, and Mike Brady, as well as Yorick making it one of the best places in Europe, but the Department of Language and Linguistics did not have the same profile. It was moving
12
Maybury
from a mainly language teaching department with a strong line in Applied Linguistics to having a having a full range of linguistics research and teaching. Yorick’s appointment was part of this process, I suppose, but he really set about changing the whole culture of the Department from being somewhat insular and reserved to being more dynamic and outgoing, attracting grants (like the money from the Eurotra project, which really established Essex as a centre for Computational Linguistics) and organizing workshops and conferences which brought really international figures here. Pat Hayes recalls the contribution Yorick made in very lean times: There was no research funding available for AI after the disastrous ‘Lighthill Report’, and (to give an example of the flavor of the times) for three years running, only two or three senior lines were available for promotion over the entire university. Many younger faculty, including Mike Brady and me, decided to leave the country at that time. Yorick had only recently arrived and was then the Chair of the linguistics department: his staying, I think, is what kept Essex on the map. Doug remarks that the transformation effect Yorick had at Essex was extraordinary: But in some ways his main impact was outside the Department, on the University as a whole. In the early 80’s the UK government decided to put some serious money into a broad programme of research called the “Alvey Programme” (it was some sort of response to the Japanese Fifth Generation Project). Yorick organized a joint, University wide, response involving people from Computer Science, Philosophy, Electronic Systems Engineering, and Language and Linguistics. This resulted in the setting up of the Cognitive Science Centre (and later the establishment of a Department of Psychology), the creation of several jobs (one of which I was lucky enough to get), and the arrival of the first Unix machines on campus. Of course, it’s won’t be a surprise to anyone whose ever met him that Yorick is a hugely entertaining speaker, a great teacher, and a dynamic organizer and leader. What I think is less obvious is his modesty – there can be very few people of his standing who are so prone to downplaying the importance of their own work, right from his earliest work, which though nominally about MT is also a landmark in work on robustness in NLP, and an important precursor of much more recent work on lexical semantics, metaphor, and default reasoning. In spite of all the individual attention and influence on his students, Yorick found time during this period to write and record two Open University television programs in Cognitive Psychology in 1977, together with a course book. He also co-edited the collection Automatic Natural Language Processing in 1983 with Karen SparckJones. John Tait recalls:
Yorick Alexander Wilks: A Meaningful Journey
13
The book edited with Karen Sparck Jones was the result of a workshop on Parsing sponsored by the British Science and Engineering Research Council. The list of speakers included many extremely famous and soon to be famous names, including Martin Kay, Chris Riesbeck, Gene Charniak, Ted Briscoe (then a humble Ph.D. student) and so on. The party from Cambridge arrived safely at Wivenhoe House2 at Essex University largely because of my local knowledge and rather in spite of Yorick’s wildly misleading directions. That wound Karen up no end. [Although Pat Hayes claims “Wivenhoe House is only a few miles from a major rail commuter station, less than an hour from London” Karen recalls the place as being “pretty deeply buried in the country”] Yorick’s comment was “Don’t worry they’ll get here: they’re all intelligent people: we wouldn’t have invited them if they weren’t.” Patrick Hanks recalls Yorick’s love of Socratic discourse and his deep reservoir of philosophy pouring into conversations: Memories of studying with Yorick come flooding back. Conversations during “supervisions” ranged over everything and anything – the philosophy of language and lexical semantics, certainly, but also innumerable other aspects of philosophy, culture, life, religion, politics, He took a particular delight in defending the supposedly indefensible – from Ian Paisley to Geoffrey Sampson. He demolished over-general and soggy-minded assertions with what he described as “the standard Cambridge philosophy question: Yes, but what would it be like for it to be otherwise?” Patrick recalls that Yorick’s dialectic challenges were not limited to conversations with students: Visiting speakers were sometimes savaged, usually preceded by the electrifying signal “Forgive me, but ” On one occasion, the visiting speaker was Steve Pulman (now Professor of Linguistics at Oxford). Yorick laid into him as usual, all guns blazing. “But, Yorick,” protested Steve mildly, “You supervised the dissertation in which these ideas were developed.” Yorick shot back, “Yes, but that doesn’t mean that I have to believe them.” Pat Hayes notes Yorick’s treatment of visiting speakers was second to none: This was, like [his house] parties, an Essex tradition. Mike Brady, Richard Bornat and myself would also challenge visiting speakers in this way, at times rather too harshly, as we discovered when we reduced one poor guy to a state where he was unable to speak. And of course we all would also 2
Pat Hayes notes “It is famous: There is a beautiful Constable painting of it in the National Gallery in Washington, DC. (Gallery 57, painted 1816, oil on canvas)”
14
Maybury
do it to one another, mercilessly. But I think that it is fair to say that both in throwing splendid parties and in skill at skewering intellectual weaknesses (and still more, pretensions), Yorick had no equal. Pat also recalls how he distinguished himself in the performing arts: One fond memory is of a rare Yorickian stage performance, as Caliban in an Essex University amateur production of ‘The Tempest’. Yorick’s voice was several decibels more powerful than everyone else’s and, perhaps because he was also a head taller, he spent all his time on stage scampering about bent over like a huge spider and roaring out his lines. The rest of the cast just stood around and looked at him, apparently paralyzed with fear. I will never forget this, and neither would Shakespeare if he could have seen it. Finally, Pat Hayes recalls how the world almost lost this great scholar: Yorick fell through a glass partition which sliced his Achilles tendon and almost killed him (from blood loss because of the time the ambulance took). It was a major trauma, he needed walking aids for several months; and I know that many years later, when I was visiting him in New Mexico, the old injury was still giving him some leg trouble. Of course it did not stop Yorick being Yorick.
1.8 New Mexico State University (1985–1993) The transitions of a globe trotter are not always smooth, as Yorick experienced upon his arrival at NMSU. Professor Derek Partridge admits “I sent him the advert that got him to New Mexico and so I ‘know’ the early years of that saga well”. Derek goes on: The first year we had extraordinarily bad weather. Yorick went out in the snow and slush to buy wellies3 but such things were not available in New Mexico as they never had weather like that! Then on lighting a fire in their rental house the chimney caught alight and they were living in the snow with a tarp draped over their roof. Derek recalls when Yorick sold their house in Clacton to purchase and move to their new home in New Mexico that exchange rates were very poor at the time. To make matters worse, they put their home money in the local credit union and then because of some local rumor people rushed to withdraw their savings leaving all of Yorick’s assets frozen. Fortunately the situation eventually was resolved and Yorick found a home. Yorick was hired by New Mexico as the first director of the Computing Research Laboratory (CRL). In 1985 New Mexico funded six state centers of excellence. 3
Wellington Boots – long rubber boots that come up below the knees.
Yorick Alexander Wilks: A Meaningful Journey
15
NMSU was selected in computer science and provided funding for 4 years and then needed to become self supporting. Gene Charniak who served on the Board of Advisors while Yorick was director recalls “He was a very effective administrator. He also had exquisite taste in restaurants, and took us to an outstanding one just south of the boarder in Mexico.” Dr. Louise Guthrie recalls: Yorick built up the lab well beyond the original robotics and vision sections to include a robust natural language program that included cognitive psychology, linguistics, and modern language with a self supporting budget of about $4M a year. Yorick built a staff of about 65 people. He put New Mexico on the map. He was able to get one of the few Tipster contracts, which was unusual since only BBN and UMass had a contract at that time. Yorick effectively negotiated with the Deans to ensure the lab was well supported. The administration was proud to have it and supported it financially. If you were to take a most crude measure and covert to US dollars and sum up the grants and contracts awarded to Yorick and colleagues over his career it would amount to over $13 million in US, UK, and European investments. While this investment in Yorick was impressive, his intellectual contribution was priceless. What was priceless to Yorick, however, was his pink Cadillac. Ted Dunning recalls: When Yorick came to NMSU he bought this pink (inside and out) Cadillac for $800. It was old in the 80s and was a hideous machine, a great behemoth. One of Yorick’s true regrets when he went back to England was giving up his pink Cadillac. Ted also recalls Yorick’s management style: He had an unusual management style. He would find good people. He would let them contradict what he thought. For example he brought Xiuming Huang and Dan Fass who were in logic programming. Yorick didn’t initially think this was plausible but it led to new ideas. The same was true with me and my work in statistical NLP and genomics. The genomics area was set apart but Yorick was open to new ideas. He initially didn’t think a statistical approach was the way to do things but he allowed me to pursue this and it became a major area of research. His ability to foster research he disagreed with was unusual. It was a very cooperative group. This is why he had such good students. Ted recalls when his wife first met Yorick later in England: Ellen first met Yorick at the department. We walked to the parking garage to get into Yorick’s monster Range Rover. My wife Ellen had just arrived for the first time in England and was still jet lagged. Being an American, she walked to the right side of the car [the drivers side in the UK].
16
Maybury
The very tall Yorick stepped next to the 5’3” tall Ellen and reached forward to open the driver side door noting matter of factly “Did you want to drive”? Ellen Dunning recalls how impressed she was first meeting Yorick: He was a mentor for my husband so I was naturally nervous when I first met Yorick. He conveyed tremendous energy and had real substance, content, and authority when he spoke. He was intelligent, impressive, and effective. He had a wonderful sense of humor he would put his head back and laugh. University of Pittsburgh Professor Jan Wiebe, who joined the computer science faculty at NMSU, recalls that “professional interactions were inspiring, and he was so much fun to chat with.” Sergei Nirenburg recalls Yorick’s performances: A night-time competitive swim in the pool in Hidden Valley, PA, in 1989 and a resounding rendition of Tom Lehrer’s The Irish Ballad and I Hold Your Hand in Mine, in which Yorick’s baritone was joined by Phil Hayes’ tenor, while the host, Masaru “Tommy” Tomita, perched on a high barstool, kept dishing out free drink tickets. Well, I haven’t stopped signing and playing Tom Lehrer ever since. It was also in Hidden Valley that I first saw Yorick as a stage director, willing “his” actor to excellence – gesturing exuberantly from the back of the room to Brian Slator at the lectern trying to inject more pep into the delivery. While at NMSU, Yorick explored belief representations. Afzal Ballim implemented one of Yorick’s key belief systems and they copublished Artificial Believers: The Ascription of Belief (Erlbaum, 1991). Afzal recalls those unexpected days: When I started my Bachelor’s degree at university in Ireland, if someone had said to me that I would end up in the middle of the New Mexican desert some four years later, I would undoubtedly have fixed them with an incredulous stare before calling for the men with strait jackets. Mind you, I’d probably have had to ask them first where New Mexico was and whether I’d be able to visit the US from there. Yet, four years later I found myself touching down at El Paso airport with a Research Fellowship under Yorick Wilks awaiting me in the Computing Research Lab of NMSU. I was met by another of Yorick’s students, Jerry Ball who instantly recognized me (I guess the lack of cowboy boots and Stetson was a give away). I had no idea what I was actually going to do, but I was excited by the prospect of working with a pioneer of Artificial Intelligence (as my undergraduate textbook called him) and the old Wild West seemed an appropriate place to work with a pioneer. My second surprise on meeting Yorick was not to meet a crusty old fogey, but the sort of character that who would not seem out of place
Yorick Alexander Wilks: A Meaningful Journey
playing Othello on stage, writing a skit for Stephan Fry, or welcoming you to his stately manor. In short: Noel Coward in academia. My first surprise? Finding out that a pioneer didn’t have to be dead or senile. Yorick quickly introduced me to the works of a great group of people I’d never been exposed to before (some dead, and many most likely senile). Philosophy of language soon became a favorite reading for me. As I made my way through Yorick’s own voluminous writing, one subject in particular caught my attention: the notion of belief. Yorick had interested himself in how a machine could possibly model the beliefs of an individual and use those beliefs in its own reasoning or in interaction with that person. The notion of belief, however, is not the common usage of the word belief as it is not limited to religious or esoteric subject matter, but covers anything about which there can be a doubt. Indeed, pushed to the limit the difference between this notion of belief and that of knowledge blurs. One man’s knowledge is just another man’s belief. Long accepted truths can be challenged and found wanting (think of scientific progress in the 19th and 20th centuries where science has amused itself in proving that nothing seems absolute). Yorick based himself on Davidson’s Principle of Charity, which you could sum up as saying that we’re all more or less the same. So we charitably assume that everyone else believes the same things as we do. I decided to make this the subject of my work with Yorick. If I characterize my own contribution it would be to add a principle of grumpiness – it’s ok to be charitable towards peoples understanding of a subject, but sometimes you just know better. In my Masters and Doctoral dissertations, and in my work with Yorick that culminated in our book “Artificial Believers,” the goal was to account for both typical belief (the principle of charity) and atypical belief. The latter describes situations where we know that the belief or knowledge we hold is not generally held by other people. The most obvious case is that of expert knowledge – i.e. only following training could you possibly hold beliefs/knowledge about the subject matter. This period was a very exciting one for me, and working with Yorick was a challenging and enriching experience. As we were writing our book, both Yorick and I had, for different reasons, decided to return to Europe. The Institute Dalle Molle (ISSCO) in Geneva offered to put us up for a few months while we finished our book. My memories of this period are laden with anecdotes. Such as the time I went into the ISSCO library and found Yorick lying on the floor. I panicked, thinking he had suffered a heart attack, but it turned out that he had difficulty finding a quiet place for a lie-down! One of my fondest memories, however, happened during the writing of the book when we attended a very interesting meeting in San Marino, a small independent state in Italy. This turned out to be an occasion for me to meet many of the people I thought were either dead or senile. San Marino, you see, is the home of a small university with a professor of semiotics by the name of Umberto Eco.
17
18
Maybury
Eco had decided to hold a conference to honor the life-time achievements of Willard Van Orman Quine. The list of attendees was a who’s who of philosophy of language, and Quine himself commented on every article presented. The kingdom of San Marino is a small walled mountain town, with dramatic views abounding from the bastions and battlements. The setting and people made this an incredible meeting. I’ll never forget myself and Yorick sitting in a little restaurant handing draft pages back and forth between us. Equally, strong in my memory is one evening when all the attendees went to a night club in a nearby town. Yorick, Umberto Eco and Quine on the dance floor at 4 in the morning were quite a sight. Earlier, I likened Yorick to Noel Coward, and they have many things in common. Yorick, however, is far more modest than Coward ever was. So, to conclude, let me sum up my impression of Yorick by borrowing from an interview that Dick Cavett did with Noel Coward in the 1970s and imagine, if you will, that the person being interviewed, and talked about, is Yorick. Dick Cavett – You’re, you what is the word when one has such terrific, prolific qualities? Coward – Talent. Birmingham Professor of AI John Barnden came to the Computing Research Lab (and the Computer Science Department) at NMSU in 1987 because of the attraction of working with Yorick and others on belief representation and metaphor. He recalls his interview: The colourful experiences I was to have there were presaged right from the beginning, during my interview trip. A component of this trip was for my wife and me to have dinner with Yorick and his wife at their elegant desert house just outside Las Cruces. After the consumption of much wine and some cognac, the suggestion was made at about 2 am that we might all four of us take a hot tub together, on the patio under the bright stars. Yorick recited Rilke poems in German, from memory of course, something matched only by the activities of Wilks’ large dog, which seemed interested in, first, eating my watch, and then, no doubt on discovering its timeless merits, exploring the inner recesses of my ear with its prodigious tongue. It was at that point that I thought, New Mexico, here I come. Indeed Yorick always took pause to enjoy life with his colleagues. Derek Partridge recalls in 80s a New Mexico-Australian event called the “bangtail muster” a picture from which is shown in Figure 1.3. Derek explains: It was introduced by the Australian contingent in town and seemed to involve dressing up, driving down to the river bank (the Rio Grande) and drinking large quantities of (New Mexico) champagne. Jordan Pollack was at NMSU with Yorick from 1986 to 1988 and recalls fondly his experience:
Yorick Alexander Wilks: A Meaningful Journey
19
Fig. 1.3. (left to right): Yorick, Roger Schvaneveldt and Derek Partridge by the Rio Grande
I owe him a lot since I was able to finally finish my Ph.D. while earning a living as a Postdoc. I have many fond memories of Yorick as a patient and jovial fellow, with one exception; during a game of RISK, he turned into a positive maniac, a kind of world conquering Napolean. My wife Carrie and I drove to Juarez with Yorick really fast one day, to have lunch, and to find pigskin chairs for his new dining room set. It was the first time we ate Pollo con Mole, and learned that you had to pay random people off so your car wouldn’t get stolen from a parking lot. His house was the only adobe structure with a dumbwaiter, made of stainless steel. The dining room was on the upper level, and during parties, we might be regaled from above by a string quartet. Yorick continued to pursue his earlier ideas of deep semantic representations that transcend languages and his belief in the value of exploiting lexicographer’s knowledge as captured in (machine readable) dictionaries. Louise Guthrie recalls: In the early days there was a big rivalry between traditional rule driven and statistical machine translation. Yorick was supporting an interlingual approach to machine translation (English, Spanish and Chinese) before we got involved in Tipster. Together with Ed Hovy at ISI and Sergei Nirenburg at CMU we proposed to do an interlingual based MT (Pangloss). At the same time IBM proposed a statistical approach (Candide). The government also funded Jane and Jim Baker at Dragon. We tried to get some of the information automatically from machine readable dictionaries. We also tried to use some dictionary based word sense disambiguation leveraging lexicographer knowledge.
20
Maybury
Louise Guthrie would later collaborate with Yorick and Brian Slator to coedit the book capturing their findings: Electric Words: dictionaries, computers and meanings. Cambridge, MA: MIT Press in 1996. Makoto Nagao considers example-based machine translation in his chapter in this volume. Nick Nicholas (Linguistics & Applied Linguistics, University of Melbourne) recalls a favorite quote from Yorick, “the bovver boy of Artificial Intelligence” (Wilks 1992, p. 279): There is no theory of language structure so ill-founded that it cannot be the basis for some successful Machine Translation. Following a suggestion first made at the February, 1989 DARPA Speech and Natural Language Workshop in Philadelphia by Roy Byrd to form a Lexical Consortium, Yorick established the Center for Lexical Resources (CLR) at the New Mexico State University in Las Cruces, New Mexico in collaboration with Roy Byrd (IBM), Ralph Grishman (NYU), Mark Liberman (UPenn), and the late Don Walker (Bellcore). The CLR was DARPA-funded for its first 3 years (self-supporting thereafter) and helped broaden access to resources to the NL research community, acting as a “broker” between lexical information providers (e.g., dictionary companies) and researchers. As part of the lexicographer’s group, Judith Klavans was invited to speak at CLR on several occasions, and she recalls her trips: First, the group of researchers was at the cutting edge of research on disambiguation, extracting information from machine-readable dictionaries, and enhancing dictionary data with information from large corpora. When I gave my talks, I remember deep questions, ones that provoked new thoughts about the field. Second, I remember visiting Yorick’s home, and especially very early morning walks in the desert. Yorick was (after he adjusted to it!) proud of living in the middle of a desert; he found this to be adventuresome and unusual. He pointed out flora and fauna, which at 6:00 am glowed in the early morning sun. Third in my recollections are the meals, but of course others have commented on this aspect of expansiveness in Yorick’s welcoming repertoire. But finally, I recall pre-dawn mountain hikes with Ted Dunning at the lead and others from Yorick’s lab following his path. All of this together made for an experience which was tops in all aspects: intellectually, gastronomically, and aesthetically. Yorick’s research and laboratory leadership impacted the field in many ways, as measured by his publications, the publications of his disciples, and by the culinary traditions he has encouraged. One could say Yorick is a computational linguistics geek with an admirably highly-developed palate. With support from ACL, ACM SIGART, AAAI, and NSF, in January 1987, Yorick hosted Theoretical Issues in Natural Language Processing (TINLAP-3) at NMSU, bringing together the very best in NLP. About the same time he pulled together leaders in artificial intelligence to a workshop at NMSU which resulted
Yorick Alexander Wilks: A Meaningful Journey
21
in the 1990 book co-edited with Derek Partridge and published by Cambridge University Press called The Foundations of Artificial Intelligence: a sourcebook. Derek recalls: We got almost all of the people we wanted at the workshop and later published a summary in AI Magazine including a funny photograph [See Figure 1.4] of Roger Schank and Alan Bundy with the caption: Schank: “Are those the foundations of AI?” Bundy: “When’s lunch?” Contributors to the text included luminaries such as John MacCarthy, Alan Bundy, Donald Michie, Roger Schank, Jerry Fodor, Terry Winograd, David Marr, Daniel Dennett, B. Chandrasekaran, Karen Sparck Jones, and Roger Needham. Yorick was blessed with the birth of Claire while in New Mexico. But this joy of new life was to be balanced with a near death experience for Yorick. Brian Slator tells the story: Yorick and his family lived in a big house up the hill from the university in Las Cruces, the biggest city in southern New Mexico. The most famous figure from that area is Billy the Kid (William Bonney), and if you have seen “Pat Garret and Billy the Kid (1973)” or even “Young Guns (1988)” then you have an image of the terrain. These are the High Plains at the butt end of the Rocky Mountains, a region of rock and sand. The air is thin (the altitude is almost 4,000 feet), and dry (the humidity is 8% much of the year). The land is mostly grey and brown. There are plants here, and it is beautiful in its own way, but it is not very green.
Fig. 1.4. Roger Schank and Alan Bundy4 4
Partridge, D. Workshop on the Foundations of AI: An on-the-spot report. AI Magazine 7(2): 16–19 Copyright © 1986, American Association for Artificial Intelligence.
22
Maybury
As a graduate student, I used to run the hilly trails above Yorick’s neighborhood in the morning. These routes were beaten by hikers over the years, and passed through patches of scrub Mesquite and Prickly Pear Cactus. On a still morning it was not unusual to pass Burrowing Owls or Roadrunners on the trails. If you stopped and looked for them, it was usually possible to find a Horny Toad or a small lizard scuttling across the path. Sometimes a Jack Rabbit would be taken by surprise, and would freeze on the trail, believing it could not be seen if it just stayed still enough. There was also danger in the desert. On these early morning runs I would see the occasional scorpion. One time, a Wolf Spider (also known as a Tarantula) ventured into married student housing, which was captured into a glass jar. Another time, also in married student housing, we found a nest of Black Widow Spiders (famous for their painful poisonous bite, and because the female kills and eats the male after mating) that we also captured in a glass jar, and fed grasshoppers for a while. Nature could be dangerous, but it usually was not. Once, at a party at Yorick’s house, a small child was stung by a scorpion. It was such a rare event, nobody knew what to do. After a short moment of near-hysteria, the child was whisked to the Emergency Room for treatment. The technicians in the ER were dismissive. A scorpion sting is no more serious than a bee sting, we were told. However, a rattlesnake bite, that would be a different thing – much more serious. Encounters with really dangerous species were not unheard of, but they were not common either. Indeed, I lived in New Mexico for five years, and only knew of one rattlesnake bite the whole time. The time Yorick kissed his asp goodbye. Yorick and his family had a nice piece of property. There was a big house with a large dining area for entertaining, and a swimming pool with a diving board. There were also a couple of sheds for tools and to enclose the machinery needed to run the estate – pool filtration, air conditioning, and so forth. I wasn’t there, but apparently Yorick was working in a shed when he met a rattlesnake. These are not actually aggressive animals, but they have a short fuse, and in one account Yorick actually stepped on the beast. In any case, fangs were introduced to a lower leg in such a way that venom was transmitted from reptile to scholar. One can only try to imagine the scene. The viper desperately tries to escape while the scholar hops one-legged into the house to announce his life might be over. Following that is the trip to the hospital where the technicians take this as a serious snake bite. The standard procedures are employed, which means the anti-venom is injected to fight the effects of the rattlesnake bite. A rattlesnake’s venom is necrotic, which means it kills the tissue besides being poisonous. Most of the American rattlesnakes have been investigated to the point where there is an anti-venom for their bite. However, the anti-venom is usually poisonous itself, which makes the
Yorick Alexander Wilks: A Meaningful Journey
23
patient very sick, and they are usually only effective once, so a second bite will have no anti-venom – that’s the way it works. Yorick was taken to the hospital, and given the anti-venom. Yorick had an allergic reaction to the anti-venom, and this was bad. On one hand, the snake was killing him and, on the other hand, the cure was killing him too. This is usually a scenario that leads to bad outcome. There are 20,000 snake bite deaths each year, world-wide. This was a unique case, so the doctors desperately tried another drug treatment. “I laid in bed,” Yorick told me, “going in and out of consciousness, feeling like my arms and legs were on fire. I slept most of this time, thankfully, because being awake was a special kind of agony.” “It seemed like there were days of this,” Yorick continued, as I visited him in hospital, “and I was gone more than I was here. I was delirious.” “But finally,” he added, “I woke up, and I wasn’t sweating anymore, and the first thing I see, I looked up, and there was a television special featuring Roger Schank, talking about his work. You can imagine how that made me feel.” Eric Iverson, a graduate student of Yorick’s at the time recalls: the rattlesnake had an inch and a half fangs and Yorick was rushed to the hospital. They had like a half dozen experts teleconferencing in from Arizona giving advice on what to do. After a day or so of this, Yorick got tired of it and announced that he was checking himself out. The hospital was very alarmed by this, but finally relented, provided that Yorick promised to go straight home and get plenty of bed rest. A few days later he was on a plane to a conference in Finland. Yorick whisked off to COLING in Helsinki to present the paper (Guthrie et al., 1990) co-authored with Louise Guthrie, Brian Slator, and Rebecca Bruce titled “Is There Content in Empty Heads?” Louise Guthrie notes Yorick’s attendance also was fulfilling his commitment as a permanent member of the COLING organizing and program committee. She notes the reaction of other attendees: when he showed up at COLING 90 in Helsinki with his rattlesnake story he confirmed every suspicion the Europeans ever had about the wild west! Robert Gaizauskas comments that “it illustrates the passionate commitment of scholar to vocation, overcoming the fangs of fate and miserable autocracy of hospitals.” But Roberta Catizone notes this behavior was not an isolated incident: Yes, Yorick was on a plane within days after discharging himself from hospital after the snake incident. Yorick has a history of discharging himself from hospital. On one occasion he was still in a hospital gown and on crutches trying to hitch a ride outside the hospital. Amazingly someone picked him up – I think it was a truck driver.
24
Maybury
Brian Slator provides an epilogue to the incident: Yorick, a world class scholar, was “famous” while at New Mexico for inviting his eminent colleagues to visit our remote outpost and speak. One of these was heard to complain that Yorick had made “everyone and his brother” make the trip at one time or another. Yorick was also famous for his own travel habits. Indeed, at one point Paul McKevitt moved into Yorick’s office because he knew it would not be used for long period. Hence, shortly after being felled by the rattlesnake, Yorick checked himself out of the hospital against doctors wishes, in order to fly to Helsinki and give a talk at COLING-90. Thus he personified the “old adage” that “You can pump a scholar full of poison, but you can’t shut him down, especially if he has already paid the airfare.” Professor Sergei Nirgenburg recalls Yorick’s “wistful comments about Seth coming of age at 14 in Helsinki at COLING-90”: Yorick, fresh from a rattlesnake bite that led to his going AWOL from the Las Cruces hospital, chaperoned Seth, with intermittent success, after having taken a ferry in from East Germany (having precisely calculated that such an opportunity would not present itself any more on account of the impending German unification). It was in Helsinki too that Harold Somers, Yorick and I came up with the idea of putting together Readings in MT which promptly came out in 2004, making this probably the slowestgestating project of this kind in our field. Sergei Nirenburg recalls Yorick’s managerial prowess: When Yorick was about to leave New Mexico for Sheffield, he suggested that I apply for his old position of Director of NMSU CRL. During the review process we had many conversations with Yorick about CRL and NMSU, about the nature of the job and its peculiar challenges. These conversations taught me a lot. Yorick impressed me with his ability as a planner and manager to analyze real-world situations, to separate the essential from minutiae and his readiness not to be bogged down by the latter, even at the risk of losing popularity in some circles. That was the time when I got first evidence of his remarkable prowess in assessing the abilities, preferences, motivations and, therefore, future actions, of people – from college presidents to beginning graduate students. Fortunately for me, I had many opportunities to benefit from Yorick’s opinions since then too.
1.9 University of Sheffield (1993–Present) Yorick moved in 1993 to the University of Sheffield and began to build the natural language processing group, collaborating with the speech group at Sheffield which supported a strong colloquium series. Yorick became Head of Department
Yorick Alexander Wilks: A Meaningful Journey
25
of Computer Science in 1998. He set up Sheffield’s Institute for Language, Speech & Hearing (ILASH). On the personal side, Yorick again added to his family with the birth of Zoe while at Sheffield. As was by now his pattern, he was instrumental in bringing folks to come in to give invited talks to keep in touch with the world but also using them as an opportunity to see what his group was doing. While he was chairman, Yorick took Sheffield from a middle rated computer science department to a top rated one (from a 3 to a 5) in the Research Assessment Exercise (RAE).5 Yorick was honored to be elected in 2005 to the panel for Computer Science and Informatics for 2008 (active now), the national committee that evaluates departments on a scale of one to five. Louise Guthrie elaborates: Yorick in both Sheffield and New Mexico has been able to build up these massively large groups. People never leave him. The people he hired at CRL are still in touch with him and still work with him. He’s the best leader I’ve ever seen. He’s inspirational but not stifling. He’s very confident and not threatened. Yorick’s inspired a group that has explored a range of areas of natural language processing. For example, Yorick continued his research in machine translation. He collaborated with Sergei Nirenburg and Harold Somers to pull together an MIT Press Readings in Machine Translation in 2003. Harold Somers explores machine translation and the world wide web in his chapter in this collection. Another research interest of Yorick’s is the derivation of dialogue and text grammars from corpora and their use for practical human-machine dialogue systems. Using this method, Yorick designed the CONVERSE system, that won the annual Loebner Competition for “most humanlike system” in New York City in 1997. Further details can be found in Machine Conversations (1999). A related current research focus of Yorick’s is the creation of companion agents for shopping, learning, and socializing. Finally, Yorick was instrumental in pursuing general architectures for natural language processing. Conceived by Yorick, Hamish Cunningham and Rob Gaizauskas, the General Architecture for Text Engineering (GATE)/JAPE architecture (http://gate.ac.uk) is installed at over 500 sites world-wide and being distributed by US Government agencies. When the Tipster architecture was specified this was perhaps the first and most robust implementation. It incorporates a plugand-play structure, without linguistic commitment, and enables a more intelligent distribution and use of large-scale linguistic resources. Yorick has been active in algorithms and applications of information extraction (IE), question answering (QA), and summarization. Gaizauskas, Barker and Saggion consider “Approaches to Information Access in News Archives” in their chapter in this collection. Sergei Nirenburg’s chapter considers an integration of
5
http://www.rae.ac.uk/panels
26
Maybury
knowledge-based and empirical methods to semantic analysis of text, including “features from Yorick Wilks’ seminal work on preference semantics”. Finally, Yorick has been active in the exploration of the use of the semantic web as a new resource for language processing, addressed in Nicoletta Calzolari’s chapter “Towards a new generation of Language Resources in the Semantic Web vision” in this collection. In his chapter, Nigel Shadbolt considers the role of preference and procedural semantics and belief spaces in the Semantic Web and Nancy Ide considers dynamic ontologies. Sergei Nirenburg recalls a performance related to language processing and the web: [I recall] An evening in 1994 on the island of Santorini in the Aegean when Yorick and I stood on the stage of a small outdoor amphitheater and delivered a dialog, in which we both called each other Socrates, on the topic of “Language processing on the information superhighway” as it then was. I must admit that being able to keep the audience laughing non-stop for almost an hour gave me more satisfaction than many other professional achievements. I am sure that Yorick also remembers that day with a smile. Another dialog, this time at the 1997 TMI conference in Santa Fe, NM. The title was “What’s in a Symbol?” It was later published in the Journal of Experimental and Theoretical AI. In this dialog, Yorick propounded the position that AI metalanguage is, in the end, the language itself, while I persisted in finding differences between the two languages. Both work on the article and the presentation itself (actually, we delivered this talk more than one time) were conducted in a totally collaborative and friendly atmosphere. But our positions still differed; I remember the mischievous delight with which Yorick reported to me that George Miller, the eminent psychologist, preferred his, Yorick’s, position over mine; so many of our colleagues are competitive; so few, Yorick prominent among them, can be so stylish about it.
1.10 Research Interests As is evident by the above, Yorick’s career spans multiple disciplines including language processing, machine translation, and artificial intelligence. Today Yorick characterizes his primary research interests as threefold: computational pragmatics, computational lexicon, and information extraction. In the first case his primary focus is on embodied, persistent conversational personality, belief representation and manipulation, and dialogue systems using machine learning. One of his emphases has been on methods for accessing and distributing resources through an architecture like GATE. Finally he has pursued information extraction, in particular comparative evaluation of methods. Over the years, Yorick’s writings address such practical applications as report generators, adaptive email categorizers, tailoring hypertext, semantic indexing of crime scene photographs, semantics of language for an operating system consultant, and information harvesting from the semantic web. A builder of
Yorick Alexander Wilks: A Meaningful Journey
27
practical systems, he contributed to efforts such as software infrastructure for natural language processing and GATE. While always with an eye toward solving real-world problems, Yorick addressed these in a deep and principled manner addressing core language processing tasks such as word sense disambiguation, dialogue processing, multimedia indexing, (multilingual) information extraction, lexicons, proper name classification, mixed initiative annotation, language analysis, language generation, adaptive machine translation, speech acts, and conversational implicature. University of Colorado Professor Martha Palmer notes Yorick’s broad impact: The field of computational linguistics has been enriched by Yorick’s presence in many different ways; by his wit and humor as well as by his multitudinous technical contributions and his early insights into the mysteries of language understanding. He made effective use of “chunking” and “probabilities” (as in preference semantics) decades before they were embraced by the field as a whole, just one of the many ways in which he led the way. His theoretical contributions have been more than matched by his penchant for developing large-scale systems that actually work, sometimes to the embarrassment of government funded peers. His knowledge and expertise, as well as his passion and enthusiasm, have attracted groups of outstanding natural language processing researchers in places as diverse as New Mexico State University in the US and Sheffield University in the UK, perhaps his most enduring legacy. In addition to challenges in computational linguistics, Yorick also explored theoretical foundations of AI such as knowledge representation, knowledge management, machine learning, knowledge acquisition, vision-language integration, vision and metaphor, and beliefs and agent modeling. While he pursued visions of the future such as artificial companions he has been equally concerned about social issues such as creating responsible computers.
1.11 Publication While of course quantity of scientific output is no measure of quality, Yorick’s raw publication volume over four decades as shown in Figure 1.5 is nothing short of impressive, reflecting his collaborative and indefatigable character. The polynomial shown in Figure 1.5 approximates Yorick’s publication rate – the straight line approximating the publication rate has an upward slope of .36! Not only has he appeared in print in over three hundred publications but Yorick has in fact accelerated his publication rate over time. This is illustrated in Figure 1.6 which shows publication over 5 year intervals over the same 40 years. Of Yorick’s over three hundred publications during this period, 36% were conference papers, 18% journal articles, 9% technical reports, 18% book chapters, and 3.6% books as illustrated in Figure 1.7. What is interesting is that a
28
Maybury Yorick Wilks' Publications 1964–2004
20 15 10 5 CY 6 CY 4 6 CY 5 6 CY 6 6 CY 7 6 CY 8 6 CY 9 7 CY 0 7 CY 1 7 CY 2 7 CY 3 7 CY 4 7 CY 5 7 CY 6 7 CY 7 7 CY 8 7 CY 9 8 CY 0 8 CY 1 8 CY 2 8 CY 3 8 CY 4 8 CY 5 8 CY 6 8 CY 7 8 CY 8 8 CY 9 9 CY 0 9 CY 1 9 CY 2 9 CY 3 9 CY 4 9 CY 5 9 CY 6 97 CY 9 CY 8 9 CY 9 0 CY 0 0 CY 1 0 CY 2 CY03 04
0 YEAR
Fig. 1.5. 40 years of Yorick’s publications (Yorick’s on-line CV (http://www.dcs. shef.ac.uk/∼yorick/cv.html)) 90 80 70 60 50 40 30 20 10 0
78 71 51 34
30
23
23
14 1 1960– 1964
1965– 1969
1970– 1974
1975– 1979
1980– 1984
1985– 1989
1990– 1994
1995– 1999
2000– 2004
Fig. 1.6. 40 Years of Yorick’s publications in 5 year increments (Yorick’s CV)
1
PhD Thesis
8
Misc
12
Book Technical Reports
30 39
Other Technical
59
Chapter in Book
62
Journal Article Conference Paper
120 0
20
40
60
Fig. 1.7. Yorick’s publication types from 1964 to 2004
80
100
120
# of Publications
25
Yorick Alexander Wilks: A Meaningful Journey
Yorick Wilks Co-Author Frequency 30
25
20
15
10
5
webb, tablan, petrelli, krotov, hamza, cowie, brewster
wakao, ursu, plate, partridge, macdonald, lee, hepple, guo, dalli, bruce, barnden
setzer, mckevitt, maynard
peters, nirenburg, azzam
fass
slator, pastra, dingli
saggion
humphreys, farwell, ballim
ciravegna
stevenson, bontcheva
catizone
guthrie
gaizauskas, cunningham
0
248 wilks 24 gaizauskas 24 cunningham 22 guthrie 15 catizone 13 stevenson 13 bontcheva 11 ciravegna 10 humphreys 10 farwell 10 ballim 9 saggion 8 slator 8 pastra 8 dingli 7 fass 6 peters 6 nirenburg 6 azzam 5 setzer 5 mckevitt 5 maynard 4 webb 4 tablan 4 petrelli 4 krotov 4 hamza 4 cowie 4 brewster 3 wakao 3 ursu 3 plate 3 partridge 3 macdonald 3 lee 3 hepple 3 guo 3 dalli 3 bruce 3 barnden 2 xia 2 pustejovsky 2 piao 2 pazienza 2 mccauley 2 levy 2 kit 2 dasmahapatra
29
2 clough 2 bien 2 basili 2 alani 1 yamada 1 wilen 1 watts 1 wang 1 vindigni 1 velardi 1 takemoto 1 tait 1 sugimoto 1 steele 1 sparck-jones 1 somers 1 schank 1 rodgers 1 rigau 1 perkins 1 padro 1 mitchell 1 liske 1 lehnert 1 koizumi 1 khosravi 1 huyck 1 humphries 1 huang 1 hovy 1 herskovits 1 helmreich 1 hartley 1 harada 1 frederking 1 foster 1 dunning 1 dorfman 1 doran 1 dietrich 1 charniak 1 chapman 1 carbonell 1 candelaria de ram 1 battacharia 1 arundel 1 arioka 1 aidenejad
Fig. 1.8. Yorick Wilks co-author index
number of Yorick’s publications reveal his literary talent beyond scientific and technical publication, to include articles addressing computing and social responsibility, “The poet as anthropologist”, and 13 BBC Broadcasts on “Ideas and Ideals”. Just as words can be understood by the company they keep, co-authorship analysis reveals Yorick’s associations to include such well known names as Roger Schank, Eugene Charniak, Terry Winograd, Bill Woods, Sergei Nirenburg, James Pustejovsky, Robert Gaizauskas, Paul McKevitt, and Mark Stevenson. Yorick has co-authored five or more articles with each of (in alphabetical order) Afzal Ballim, Roberta Catizone, Fabio Ciravegna, Hamish Cunningham, Alexiei Dingli, Robert J. Gaizauskas, Louise Guthrie, Kevin Humphreys, Katerina Pastra, Horacio Saggion, and Mark Stevenson. To get a quantitative perspective on Yorick’s publication corpus, Brian Slator (North Dakota State University Computer Science Department)
30
Maybury
computed a YW co-author index,6 the number of articles co-authored or co-edited with Yorick based on publications listed in Yorick’s on-line Curriculum Vitae as of September 16, 2005. Figure 1.8 shows the YW co-author index ordered by most frequent indexes.
1.12 Advisor and Mentor Yorick influenced the lives of multiple generations of students. Table 1.1 lists graduate students or researchers who worked with or were supervised (italicized) by Yorick at Essex, NMSU, and Sheffield. In the late seventies Dr. Doug Arnold was supervised by Yorick at the University of Essex. Doug recalls his extraordinary experience: we never agreed on any major ideas but yet that never seemed to get in the way of progress. Yorick was always positive and helpful, which was one of his great strengths. Doug recalls that Yorick was always good with a turn of phrase. Some of Doug’s own favorite memories of Yorick at Essex can be summed up in quotations: – If you owe the bank a hundred thousand pounds you’ve got a problem. If you owe them a million, they’ve got a problem. [This seemed to sum up his approach to person finance, but I think he may have applied similar principles to the Department. Later, when I was dealing with the Research Grants, I often saw the University Finance Officer go pale at the mention of his name]. – Its no longer a charming eccentricity not to be able to operate a photocopier. [To a Departmental Meeting where people were complaining about some problem with photocopier] – No, he’s just Oxford Common Room in a dirty tee-shirt [On an Oxford Philosopher, now deceased, in response to the question of whether being in America had really changed his way of behaving]. – If you really write a vituperative memo to a colleague, you should never send it at once. You should put it in a filing cabinet on ice overnight, and read it over again coolly and carefully the next morning, and then add a much harsher final paragraph and send it. [On the occassion of sending such a memo]. – Look, its only a PhD, for chrissake. [to a PhD student suffering writer’s block] 6
Brian notes “The YW index is along the same lines of the famous ‘E number’ counting ‘publication distance’ to/from Paul Erdos. The YW Co-Author Index is related to, but conceptually distinct from, the more famous “YW-Number” that counts “publication distance” from YW. For example, everyone in this table has a YW-Number of one. Those who have co-authored with people on this list (but not YW) have a YW-Number of two. Those who have co-authored with THEM, have a YW-Number of three, and so on”. See http://www.cs.ndsu.nodak.edu/∼slator/html/private/yw-co-author-index.html.
Yorick Alexander Wilks: A Meaningful Journey
31
Table 1.1. Graduate students/researchers who studied (italicized) or worked with Yorick Essex Doug Arnold
Dan Fass
Xiu-Ming Huang
Bob Marsden
NMSU Homa Aidinejad Jerry T. Ball Rebecca Bruce
Matt Anderson Afzal Ballim Clint Burleson
Roman Antosik Imre Balogh Sylvia Candelaria de Ram Richard Davis Richard Fowler Niall Graham Mohammed Khan Patty Lopez Lisa Onorato Edward Plumer Uttam Sengupta Paul Vandenberg
Gregg (Skip) Bailey Tim Breen Roberta Catizone
Paul Clough Fang Huang Mark G. Lee Christopher A Brewster Marin I Dimitrov Mark Stevenson Chern Nam Yap
Hamish Cunningham Hammid Khosravi Ekaterini Pastra Samuel John Chapman
Li Chen Carol Dunning Lois Gerber Cheng-ming Guo Wendy Lawrence Paul McKevitt Heather Pfeiffer Barry Rappaport Carol Soderlund Zhi-Yong Zhao
Alamgir Choudhury Chris Esposito Rebecca Gomez Eric Iverson Min Liu Mark Molander Tony Plate Alice Sandstrom Gees Stein
Sheffield Kalina Bontcheva Alexiei Dingli Chun Yu Kit Wim Peters
Tomas By Ted Dunning Alexander Krotov Diego Uribe
Michael Conway Wei Liu Marius Cristian Ursu
Angelo Dalli Hrafn Loftsson Nick Webb
Sunil Desai Takahiro Fukussima Steven Graves Hemant Kirpekar Debbie McDonald Kelly Perryman Martin Rajman Brian M. Slator Fuliang Weng
David Guthrie Valentin Tablan
Yorick influenced not only his own students but those of others as well. Professor Jim Hendler (University of Maryland) recalls Yorick’s visit to Yale in the late 1970s: Yorick visited Yale to talk with Schank’s students (I was one) in the late 70s. We had used his book “Computational Semantics” in a course that year, and were all excited to have this eminent British AI researcher coming to visit. Yorick gave several talks, attended seminars, and did a number of informal interactions with the students. As during that period Yorick was smoking cigarettes using a long cigarette holder, had his, err, not quite so boyish figure, and lectured with a strong British accent, I do remember that we were all very fond of watching him talk – he reminded us so much of Burgess Meredith’s character “The Penguin” in the then-popular Batman show.
32
Maybury
Yorick also influenced many researchers throughout his career. Tony Plate who was a fellow of the CRL while Yorick was director wasn’t supervised by Yorick but did interact with him frequently. I always tell people that my 2 years at CRL at NMSU were the most intellectually fun and stimulating of my academic career. That was largely because of the great atmosphere Yorick created there, and also the eclectic bunch of people he brought there. Mike Rosner, Head of the Department of Computer Science and AI at the University of Malta recalls first hearing about Yorick when he was studying for the Postgraduate Diploma in CS at University of Cambridge in 1973. Mike recalls: Yorick was at Stanford at the time. Karen Sparck Jones, at the time offering a special course on NL techniques, described him as quite a character who wore, as I recall she put it, “interestingly flamboyant shirts”. I met him a couple of years later at University of Essex, where, in my mind, he represented the slippery side of AI, refusing to be pinned down with first order logic in the way envisioned by my then supervisor, Pat Hayes. When Pat left to work in the USA, Yorick, unprompted, stepped in quite unofficially as a kind of surrogate supervisor, an act for which I remain grateful. Eric Iverson similarly recalls Yorick’s generosity: When I came to Sheffield in 1994, Yorick offered to drive me to Manchester to pick up my air cargo. We rented a transit van, and the trip out was fairly uneventful. On the way back Yorick insisted on taking a short cut. He explained to me that we were on a road that the authorities had deemed so dangerous, they took it off the map. But he assured me it was perfectly fine. So we were driving through the Peak district with all my earthly possessions in the back, on this road that was so narrow, I could practically reach out and pet the passing flocks of sheep. As a student, Eric Iverson remembers meeting Marvin Minsky, Doug Lenat, Pat Hayes, Geoffrey Nunberg, and other leading scientists via Yorick: Usually when Yorick took a visiting speaker out to dinner, he’d bring a grad student along selected from a revolving list. So I managed to get in on several free meals over the years. In grad school that’s practically the currency of the realm. Patrick Hanks similarly recalls Yorick’s kindness in what has the hallmarks of a classic Monte Python skit: His generosity is legendary, as is his desperately controlled impetuosity. Occasionally they clash. A prospective research student phoned him one day. I heard one side of the conversation.
Yorick Alexander Wilks: A Meaningful Journey
33
“Give me your number and I’ll call you back.” “Yes, I understand you’re in a hotel lobby. I’ll call you back. It will be less expensive” “ No, I insist. Just tell me your number.” “No, it’s no trouble. Just tell me the number you are speaking from.” Yorick put the phone down and dialed. “Forgive me, this won’t take a moment,” he said, optimistically. “Damn. Busy signal.” Re-dial. “Damn.” Re-dial. “Damn, damn, damn.” Dials Inquiries. “Please give me the number for the X hotel in Manchester.” Dials. “Hello, Reception Desk please. Yes Yes Look, there is a Mr. X at a public telephone in your lobby and I need to speak to him rather urgently. No, it’s very urgent Yes, would you? I’ll hold.” Silence. Some minutes later, the departmental secretary appeared. “Yorick, a Mr. Y is on the phone. He says he’s at a phone in a hotel lobby and apparently it won’t take incoming calls.”
1.13 Professional Service and Honors In addition to his mentorship of next generation researchers, Yorick has been an active member of several professional associations including the Association for Computational Linguistics, the Society for the Study of AI and Simulation of Behaviour, the Association for Computing Machinery, the American Association for Artificial Intelligence, the Cognitive Science Society, the Mind Association, the Aristotelean Society, and the British Society for the Philosophy of Science. Yorick has been formally recognized on numerous occasions throughout his professional career. Some of his more significant accolades include election to the UK Computing Research Council (2004), Fellow of the American (1991) and European (1998) Associations for Artificial Intelligence, visiting Fellow of the Oxford Internet Institute, fellow of the EPSRC College of Computing (1997), Visiting Fellow, Trinity Hall, Cambridge (1991) twice visiting Sloan Fellow (Yale in 1979 and Berkeley in 1981), and invited participant in the Nobel Symposium on Language, Stockholm (1980). He also has served on advisory committees for the National Science Foundation and the Carnegie Mellon University Computer Science Department, and on the boards of some fifteen AI-related journals.
1.14 Yorick the Person Multimodal, multilingual, and multicultural, and multidimensional. No, this is not a new intelligent system Yorick has just designed but rather what he himself manifests. He loves languages, as a speaker and writer of English French, German, Italian, and Spanish. An imposing physical stature, Yorick is a man with zest for life. He has experienced it to the fullest as an actor, fiction and non-fiction author, social observer, mentor, teacher, businessman, leader. He is blessed with daughters Claire, Octavia and Zoe and son Seth. His deep and loud voice and full-of-life
34
Maybury
laugh is unmistakable: a serious intellect but never too serious for a witty remark or a good joke. Derek Partridge from the University of Exeter characterizes Yorick as “outgoing, happy, generous to a fault and always fun to be with. His talks are always thoroughly entertaining.” Sergei Nirenburg recalls Yorick’s boylike mischievousness:
[I recall] An afternoon in Kawasaki, Japan where Yorick and I went to give a joint talk at NEC Labs. We were walking past an elementary school at which classes ended for the day and were suddenly surrounded by a gaggle of some two dozen little Japanese girls, in uniform and with identical red backpacks, laughing and pointing at these two substantial and unusual creatures in suits. While I just weakly smiled back, Yorick made a mockmenacing face and pretended to lunge at them, to the utter delight of the little anthropologists who shrieked in mock horror and let us through.
Sergei further comments this man of rich culture and cognition lamented the emergence of “international English,” the language used at most, if not all, international conferences nowadays. This development, said Yorick, forces him to have to express his thoughts using a deplorably impoverished subset of his rhetorical capacity. Citing Yorick’s “use of debate to learn not compete”, “delivery and timing of a good actor”, and “large inventory of rhetorical ‘exit strategies’ to smooth any possible wrinkles in a talk or a conversation”, Sergei notes “our field should feel lucky that Yorick decided to be a part of it.” Oxford Professor Stephen Pulman, FBA, comments on the great fortune of having Yorick in town:
It’s been very lucky for us that Yorick now lives in Oxford. He always has something good to say in seminars, and is very encouraging to students. It should also be said that Yorick’s social contributions to the Oxford scene are outstanding: lots of parties, fine food (Yorick is a talented cook), amateur dramatics, and conversation on the widest range of topics of anyone I know. You always leave Yorick’s company feeling good about life – this is a rare gift to have.
Yorick is a talented actor, having performed in many amateur Gilbert and Sullivan reproductions. His friends characterize him as “still very dramatic”. Dr. Ted Dunning recalls Yorick’s voice and stage talents:
Yorick Alexander Wilks: A Meaningful Journey
35
When he was in Santa Monica he had a good enough voice to read editorials on the Evening News. This was impressive given he was picked up in the LA market. While at Essex Yorick played the half-man, half-fish creature Caliban in the Tempest in a garbage bag. At NMSU he was going to play Watson to Mark Mark Medoff’s Holmes but he wasn’t able to do it in the end. At Sheffield he played the Cardinal and scared the bejesus out of people. He was amazingly good. Ted’s wife Dr. Ellen Dunning recalls he was “absolutely terrifying” as the cardinal (see Figure 1.9): His voice was amazing incredible. He had tremendous flexibility and a stunning manner. He had so much power – not so much in volume but in the terrifying mood he conveyed. The play won an award for Best Drama and Best StageCraft. Gene Charniak recalls after a meeting in London going to the theatre with Yorick. Gene’s wife, Lynette, asked Yorick “what the main actor was thinking about as he did his performance?” Yorick responded matter of factly “He’s probably thinking about what he’s going to have for dinner.” Yorick has been a non-stop adventurer, from climbing Kilimanjaro to racing down the Alps in a makeshift sled. Adventure seems to find Yorick everywhere. Dr. Tomek Strzalkowski recalls celebrating the kickoff of AMITIES project with
Fig. 1.9. Yorick as the cardinal in the play women beware women by Middleton. Sheffield University drama society, 1996
36
Maybury
Jim Bass and Mats Ljungqvist by smoking chocolate cigars in a Paris hotel and setting off the fire alarm in the lobby! Then there is the story Ted Dunning tells: On the same trip, we stayed in the Algonquin Hotel, purely for literary reasons. The place was run down, but the rooms were large and obviously had once been very nice. The kicker was that Yorick was wearing his t-shirt with the first triplet of the Inferno on the front: Midway upon the journey of our life I found myself within a forest dark, For the straightforward pathway had been lost. After we had checked in, the night clerk recited the next triplet without a prompt. Ah me! how hard a thing it is to say What was this forest savage, rough, and stern, Which in the very thought renews the fear. Definitely a hotel with a literary history if the desk clerk knows what happens after the poet finds himself in a dark wood in the middle of the road of his life. And what a journey Yorick has had. May we all enjoy the adventure as it continues!
Acknowledgements I thank Doug Arnold, Afzal Ballim, John Barnden, Hugh Brogan, Alan Bundy, Nicoletta Calzolari, Roberta Catizone, Gene Charniak, Ted and Ellen Dunning, David Farwell, Helen Flechsenhaar, Rob Gaizauskas, Loiuse Guthrie, Patrick Hanks, Jim Hendler, Eric Iverson, Margaret King, Paul Mc Kevitt, Sergei Nirenburg, Andrew Ortony, Derek Partridge, Stephen Pulman, Tony Plate, Jordan Pollack, Graeme Ritchie, Mike Rosner, Michael Rowan-Robinson, Roger Schank, Roger Schvaneveldt, Brian Slator, Tomek Strzalkowski, Karen Sparck Jones, John Tait and Jan Wiebe for their contributions to and corrections on earlier drafts. Thanks to Barbara Reismann and Paula MacDonald at MITRE for tracking down Yorick’s publications. I appreciate Carole Hamilton from AAAI for permission to reprint the Schank/Bundy photo. And finally, we are all indebted to Khurshid Ahmad, Christopher Brewster, and Mark Stevenson for their idea and efforts to bring together this historic collection.
References Ballim, A. and Wilks, Y. 1991. Artificial Believers. Norwood, NJ: Erlbaum.
Yorick Alexander Wilks: A Meaningful Journey
37
Charniak, E. and Wilks, Y. (eds and principal authors). 1976. Computational Semantics – an Introduction to Artificial Intelligence and Natural Language Understanding. Amsterdam: North-Holland. Reprinted in Russian, in the series Progress in Linguistics, Moscow, 1981. Guthrie, L., Slator, B., Wilks, Y. and Bruce, R. 1990. Is There Content in Empty Heads? Proceedings of the 13th International Conference on Computational Linguistics (COLING90). Helsinki, Finland, Aug. 20–25. Mani, I. 2005. Yorick Wilks. Elsevier Encyclopedia of Language and Linguistics. 2nd Edition, edited by Keith Brown, Elsevier. Nirenburg, S., Somers, H. and Wilks, Y. (eds.). 2003. Readings in Machine Translation. Cambridge, MA: MIT Press. Partridge, D. and Wilks, Y. (eds. plus three YW chapters and an introduction). 1990. The Foundations of Artificial Intelligence: A Sourcebook. Cambridge: Cambridge University Press. Sparck Jones, K. and Wilks, Y. (eds.). 1983. Automatic Natural Language Processing. Chichester: Ellis Horwood. (republished in 1984 by New York: Wiley.) Wilks, Y. 1968. Argument and Proof. Cambridge University PhD thesis. Wilks, Y. 1972. Grammar, Meaning and the Machine Analysis of Language. London and Boston: Routledge. Wilks, Y. 1976. Frames, Scripts, Stories, and Fantasies. In the Proceedings of the International Conference on the Psychology of Language, Stirling, l976, and in Pragmatics Microfiche l977. Reprinted in H. Stegentritt (ed.). Regenburg Romanistentag. De Gruyter: Berlin. Wilks, Y. (ed.). 1990. Theoretical Issues in Natural Language Processing. Norwood, NJ: Erlbaum. Wilks, Y. 1992. Form and content in semantics. In Rosner, Michael & Johnson, Roderick, (eds.) Computational Linguistics and Formal Semantics. Cambridge: Cambridge University Press. 257–281. Wilks, Y., Slator, B. and Guthrie, L. 1996. Electric Words: Dictionaries, Computers and Meanings. Cambridge, MA: MIT Press. Wilks, Y. (ed.). 1999. Machine Conversations. Kluwer: New York. Wilks, Y. 2000. Is Word Sense Disambiguation Just One More NLP Task? Computers and the Humanities, 34(1): 235–243. Wilks, Y. (ed.). 2005. Language, Cohesion and Form: Selected Papers of Margaret Masterman. Cambridge: Cambridge University Press. http://www.cambridge.org/uk/ catalogue/catalogue.asp?isbn=0521454891 Wilks, Y. (in press) Machine Translation: Its Scope andLimits. New York: Cambridge University Press. Publication and co-author index from the DBLP Bibliography Server, http://www.informatik.uni-trier.de/∼ley/db/indices/a-tree/w/Wilks:Yorick.html.
This page intentionally blank
2 Metaphor, Semantic Preferences and Context-Sensitivity John A. Barnden School of Computer Science; University of Birmingham, Birmingham, United Kingdom
2.1 Introduction In this chapter, the main reference point in Yorick Wilks’s work is Wilks (1978). This extends his preference-based semantics (Wilks, 1975) to handling metaphor. Wilks (1978) covers a number of issues that deserve fresh comment and that are central to problems about metaphor that are still unresolved – theoretically, let alone computationally. Also, the legacy of the 1978 work stretched through to, and beyond, the work of Fass (1997) on an approach to metaphor and metonymy based on “Collative Semantics.” This was a development from Wilks’s Preference Semantics. Fass’s system meta5 implemented an approach to metaphor (and metonymy) derived from Wilks (1978). Wilks himself also continued work on metaphor in conjunction with various researchers, including myself (see, e.g., Ballim, Wilks & Barnden, 1990, 1991; Fass & Wilks, 1983; Wilks, Barnden & Wang, 1991, 1996). The Wilks (1978) approach to metaphor is very roughly as follows. Various types of words – notably verbs and prepositions – have “preferences” for the semantic types of words that they semantically liaise with in an utterance. E.g., a verb has preferences for the semantic types of its case-role fillers, such as the agent. These preferences are like the selection restrictions previously suggested by other authors (Katz & Fodor, 1963), but a key difference is that violations of preferences are tolerated, indeed expected, and are used to cause further processing to come up with metaphorical utterance interpretations (as one main possibility). The main example used is the famous “My car drinks gasoline.” The initial interpretation of the sentence is a complex internal data structure that can be grossly summarized as [my-car drink gasoline]. The verb drink prefers an animate agent, but this is violated by my-car. Therefore, an encyclopaedic knowledge structure for car is searched in an attempt to find something about cars that matches [my-car drink gasoline]. This could lead, say, to the item [a-car-engine use liquid] being found. As a result, the final interpretation of the sentence is [my-car use gasoline]. That is, because of the match found, drink has been replaced by use. 39 K. Ahmad, C. Brewster and M. Stevenson (eds.), Words and Intelligence II, 39–62. © 2007 Springer.
40
Barnden
Wilks (1978) himself said that a person reading the sentence would surmise that the car uses a great deal of gasoline, not just uses it. However, he said that this is an “idiomatic” part of our interpretation and is not accounted for by his process – indeed, he said that no reasoned basis could be provided for it. We will take up this matter below. Our concern will not be with the details of the process or with what it is able to achieve on the basis of discovered preference violations. Rather, our concern is to cast the approach as a special case of metaphor interpretation approaches that are largely or wholly utterance-based. Fully utterance-based approaches operate by coming up with a metaphorical interpretation based just on the utterance in isolation, without taking context into account. The present paper reacts against this, in the sense of arguing that in many cases the process of metaphor interpretation should in large measure be contextual-issue-based, not just utterance-based. That is, interpretation should as far as possible be guided by specific issues raised by the surrounding discourse (or by context in a broader sense). We will see by way of the “ATT-Meta” approach and implemented computer program for metaphor processing (Barnden, 1998, 2001a, b; Barnden et al., 2002, 2003, 2004; Lee & Barnden, 2001a, b) how specifically this can happen, once the specific issues have been extracted from context. Many metaphor researchers have affirmed the importance of context in metaphor interpretation (e.g., Cameron, 1999; Gibbs & Tendahl, 2006; Giora, 1997; Hobbs, 1990; Leezenberg, 1995; Peleg, Giora & Fein, 2001; Stern, 2000). Recently, Carston & Wilson (2005) have suggested the need for a context-based interpretation process roughly on the lines of ATT-Meta’s, as a way of dealing with metaphor within Relevance Theory (Sperber & Wilson, 1995). Indeed, Wilks (1978) himself talked about context, pointing out that context could affect how the interpretation process proceeds, thus making it less than purely utterance-based. However, context-aware work on metaphor has been short on proposals for detailed mechanisms whereby context can affect interpretation. The main exception is probably Hobbs (1990), and indeed the ATT-Meta account of metaphor interpretation is strongly related to his. (For differences, see Barnden, to appear.) In particular, the contextual guidance that the ATT-Meta approach proposes is, as stated above, based on the idea that context raises very specific issues, rather than just providing, say, information about the general subject matter of the discourse. Examples will be given below. Naturally, interpretation normally has to be based partly on the utterance itself, otherwise it hardly counts as interpretation. It is possible to imagine situations in which an utterance is totally incomprehensible in its own right (whether through, say, corruption or unfamiliar vocabulary) but nevertheless its context is so definite that the meaning of the utterance can be guessed anyway. This fringe possibility aside, interpretation rests in some way both on information from the utterance and information from context. Although advocating contextual-issue-based approaches, this chapter still affirms the potential for a strong heuristic role for Wilksian semantic preference violations. Violations can still be an important heuristic both in guiding interpretation and in
Metaphor, Semantic Preferences and Context-Sensitivity
41
suggesting that the utterance is metaphorical in the first place, even though there are many cases of metaphor that do not violate preferences. The plan of the rest of the chapter is as follows. •
• •
•
•
First there are some observations about the limitations of using preferenceviolation as a guide to the presence of metaphor (drawing upon other authors but also adding observations of my own). Nevertheless, the discussion preserves the point that metaphorical expressions often do violate preferences. We then look at the main general problem with purely utterance-based metaphor processing – the indeterminacy of interpretation – and go on to see how making it partly based on specific, contextually-raised issues can be beneficial. In discussing the use of contextual issues, we see in outline the way the author’s ATT-Meta approach works. The current implementation of this approach takes a particularly strong line on the usage of contextual issues, namely by using them to generate backwards reasoning that finally connects up with the utterance itself. However, we discuss the possible desirability of having a mix of this backwards, contextual-issue-driven style of processing and an utterance-driven style that goes forwards from the utterance to finally meet up with the contextual issues. We see how the ATT-Meta approach provides an alternative approach to the car-drinking-gasoline example, but also see how preference-violation could still be a useful guide. In addition, we see that the use-a-great-deal interpretation mentioned above could in principle be obtained. The chapter culminates by sketching an overall framework for metaphor interpretation stretching from the case of stock phraseology, through cases of minor variation of stock phraseology, then through the more open-ended type of metaphor ATT-Meta is mainly addressed at, and on finally to completely novel metaphor. We see that preference violations could be one useful heuristic guide, mainly in the last two cases.
A notable simplification in this chapter, for the sake of brevity, is that we do not attend to a particularly important complication. This is that preference violations can signal phenomena other than metaphor – particularly metonymy, as studied in Fass (1997).
2.2 Limitations of Preference Violations Wilks (1978) gives the following expressions as examples of preference-violating metaphorical utterances. The presumed preference-bearing items are italicized and some preference-violating elements underlined. For example, in (a) the verb “taken” bears a preference for its patient being a movable physical object, whereas “line” does not fall in this category. (a) the line taken by the Shadow Cabinet (b) a Scottish Assembly should be given no executive powers (c) lead to the break-up
42
Barnden
(d) (e) (f) (g) (h) (i) (j) (k) (l) (m)
break-up of the United Kingdom Britain tries to escape Common Market my car drinks gasoline my car drinks mud my car chews gasoline John’s new car runs on diesel [John’s car] does 100 m.p.h. John grasped the idea I see what you mean An ambulance driver went through red traffic lights.
For the sake of argument let us agree with Wilks that these examples do plausibly break preferences. The trouble is that variant expressions of many of them, at least, can be constructed that are still metaphorical but for which it is much more difficult to argue that they contain preference-breaking. In the following variants of some of (a–m), the underlining indicates items that correspond to items in (a–m) but that now fail to break the preferences of the italicized verbs. a the line drawn by the Shadow Cabinet ([members of] the Cabinet could literally draw a line, but that is not what’s meant here1 ) b a Scottish Assembly should be given no explosive device ([members of] the Assembly could literally be given an explosive device, but that is not necessarily what the exhortation is about; rather, “explosive device” could be used metaphorically to refer to something abstract such as a special, powerful law) e Britain tries to escape the prison warders of the Common Market (using “prison warders” to refer metaphorically to functionaries of the Common Market, but the escaping is not physical escaping as one might do with real prison warders) f my car floated down the street (a car could literally float on water, and a street could be flooded; but the sentence could also mean that the car motored smoothly and effortlessly down the dry street) l I see the picture you’re painting (could be literal or metaphorical). The central point here – that a sentence that is clearly metaphorical in context need not show internal signs of being metaphorical, let alone any internal sign as simple as a preference violation – has been pointed out in many ways by numerous authors.
1
We do not dwell on the complication that “the Shadow Cabinet” may break an agent preference for “draw” and should be interpreted metonymically to refer to its members. This point applies also to example b .
Metaphor, Semantic Preferences and Context-Sensitivity
43
However, the point is made especially vividly by doing simple manipulations on utterances that do break preferences or show other internal signs of metaphoricity. A further example is as follows. It arose in “e-drama”, i.e. dramatic improvisation or role-play conducted over computer terminals. Zhang, Barnden, Hendley & Wallington (2006) have studied e-drama in a research project that is centrally concerned with affect-laden metaphor. In an improvised dramatic session concerning school bullying, a character named Mayid has already insulted another character named Lisa by calling her a “pizza,” developing a previous “pizza-face” insult. Mayid then says: “I’ll knock your topping off, Lisa” – a theoretically intriguing spontaneous creative extension of the “pizza” metaphor (itself possibly a metonymic extension of the earlier “pizza-face” metaphor). The noun “topping” does not violate any reasonable preference of “knock off.” In this example, we can also consider what would happen if the understander could attach preferences to a word such as “your.” In the bodily-part-of sense of “your,” the word “topping” would violate a preference for a body-part. The trouble is there is no clue internal to the sentence that “your” should have this particular sense as opposed to a general possession sense, so that “your topping” could in principle amount to, for example, “the pizza topping you are eating.” So there would still be no preference-violation overall. On the stance adopted in the ATT-Meta project (see Barnden et al., 2004, and other references above), metaphor always involves viewing something as something else that it is not, e.g. viewing a car as an animal, a mind as a physical space, a person as a pizza, Iraq as Vietnam, or whatever. So, there is always a violation of reality in some sense. However, the difference between the semantic types (e.g., inanimate object versus animate being) of the two sides of the metaphorical view can not only be narrower than the semantic type-separations envisaged in proposals for such things as semantic preferences, but in fact they can be arbitrarily narrow. For example, take “Tuesday is honorary Monday this week” [heard in conversation], uttered because the real Monday was a holiday. Clearly, it would be difficult to strike a semantic type difference between Monday and Tuesday. Other examples of metaphor that involve narrow semantic-type differences are “Iraq is today’s Vietnam” “Purple is this year’s black” and “Jules Verne is France’s H.G. Wells.” (Closeness of semantic type is not peculiar to examples of the syntactic form “A is B’s C.” A variant of the Iraq/Vietnam example could be “We’re in Vietnam again” spoken by someone caught up in the Iraq situation.) Having said all this, metaphorical utterances often do violate semantic preferences that can plausibly be postulated. For instance, in most conceptual metaphors (metaphorical views) of Lakovian style (Lakoff, 1993) a target-domain item and its corresponding source-domain item are of widely different semantic types, making it reasonably likely that the difference will be manifested in a sentence based on the conceptual metaphor. For example, consider the conceptual metaphor of LOVE AS JOURNEY (Lakoff, 1993), where the source domain is physical journeys and the target domain is love relationships. A crossroads in the source domain could then correspond to a difficult decision point in the love relationship. This could come up in a sentence such as “Jack and Jill were at a crossroads in their marriage.”
44
Barnden
A semantic preference of “in” for the two things it relates to be of suitably related types (e.g., if X is in Y and Y is a physical object then Y should be a physical region or container) could then signal the presence of metaphor. Of course, in a context that is clearly about a love relationship the sentence could also have been simply “Jack and Jill were at a crossroads,” with no preference violation.2
2.3 Indeterminacy in Purely Utterance-Based Interpretation Most work on metaphor that attempts to show how, algorithmically, particular information can be extracted from metaphorical utterances takes a largely or wholly utterance-based approach, with little scope for context to guide the process. Within AI this is true not only of Wilks (1978) but also the related system of Fass (1997) and the system of Martin (1990), though not of Hobbs’s (1990) approach. Approaches such as those of Wilks (1978) and Fass (1997) are utterance-based in the especially strong sense of seeking also to detect signs of metaphoricity early in the interpretation process and on the basis of the utterance itself. In their approaches, an attempt at metaphorical interpretation is only made if signs of metaphoricity, in the particular form of semantic preference violations, are detected. But it should be noted that an utterance-based approach need not have this early-detection characteristic: Martin’s system proceeds by investigating the literal interpretation and a range of possible metaphorical interpretations on a par with each other, without prior detection of signs of metaphoricity, finally making a choice on the developed representations by means of a scoring mechanism. Thus, the system can only be said to decide on metaphoricity late in the process, and is able to deal, to an interesting extent at least, with the issue of metaphorical utterances that also have a semantically plausible literal interpretation (and/or that have several competing metaphorical interpretations). But any largely or fully utterance-based approach faces the problem that a given utterance can have a large set of possible metaphorical interpretations. This has been a concern of many metaphor researchers (e.g., Stern, 2000). Naturally, this is the more true the more unconventional the metaphorical phraseology is. And it is perhaps especially a problem for systems that do not rely on previously known source-to-target mappings (such as, for example, those in Lakoff’s conceptual metaphors), but rather, as in the systems of Wilks and Fass, find a source/target analogy completely from scratch by comparing structures. But even Martin’s system, which is based on having a stock of known sourcetarget mappings (e.g., they could include the mappings needed for the LOVE AS JOURNEY conceptual metaphor), faces the multiple-interpretation problem even when the utterance does not involve any novel analogy between two domains. This is for several reasons: 2
In this paper, our term “metaphorical view” and the widely-used term “conceptual metaphor” can be taken to be synonymous, though there are theoretical differences that need not detain us.
Metaphor, Semantic Preferences and Context-Sensitivity
45
1. The nature of the target domain may not be evident from the utterance itself, as in “Jack and Jill were at a crossroads” above. There may be several possible mappings to several different target domains from the source subject matter contained in the utterance. For example, the physical world is used as source domain for metaphorically talking about many different things – relationships, time, money, mental states, etc. An attempt could therefore be made to use context to select a target domain, but several possibilities might be supported by context. 2. Much metaphor is concerned with conveying evaluations, emotions and other affective attitudes about the target subject matter, rather than making cold propositional points about it (see, e.g., Mio, 1997; Musolff, 2004; Vervaeke & Kennedy, 2004). This happens by the transfer to the target of evaluations, emotions, etc. that are somehow related to source-domain items in the sentence. There are easy cases of this, where the “somehow related” is a very direct relationship and there is no real competition between different possible affective qualities. This would arise in, perhaps, “My son’s room is a cess-pit” where negative feelings about cess-pits are highly salient. However, it is trickier to infer the relevant affective quality in sentences like “Lisa is a pizza” (example from the previous section) and “Veronica is being chased by publishers.” In the Lisa example, the appropriate affective connotation in context is negative, even though pizzas have strong positive affective qualities for many people. In the publishers example, the appropriate affective connotation in context could well be positive even though physical chasing, which is presumably the source of the metaphor, might by default be taken to have a negative quality. Even with “cess-pit” the matter is not completely clear-cut, as cess-pits would be a major technological improvement for people who did not have sanitation at all. Such is the affective indeterminacy of metaphor that different contexts could make opposing, not just different, types of affect inferrable from one and the same metaphorical utterance in different contexts. 3. The proliferation of possible interpretations is exacerbated by many linguistic factors, but most relevantly here by the fact that metaphorical views (conceptual metaphors) can be mixed together in the same sentence (Lakoff and Turner, 1989; Lee & Barnden, 2001a; Wilks, Barnden & Wang, 1991), and that metonymy and metaphor can be mixed together (Fass, 1997). An example of the mixing of metaphorical views is “This idea crystallized the nebulous mental meanderings that had plagued me ”3 where mental processes are viewed as animate, gas-like and disease/germ-like. An example of the mixing of metaphor and metonymy is “The whole of my childhood rushed through my head like an electric train”4 where (arguably) it is not the childhood as such that metaphorically rushed through her mind, but rather the memories of her childhood, so that there is also a metonymic step from the childhood to the memories. This example could in principle be re-analysed as pure metaphor, with the mind 3 4
From Sheila Dyan, Love Bites, London: Hodder & Stoughton, 1992, p. 48. Heard on the BBC Radio four programme Desert Island Discs on 20 June 2003.
46
Barnden
viewed as a physical region that can contain not just ideas but also the entities those ideas are about. But the availability of this possibility just underscores the point about interpretations proliferating. As we have already noted, many authors have suggested that contextual information is important in metaphor interpretation or have demonstrated by psychological experiment that context affects interpretation, and the underlying assumption is that context can guide metaphor interpretation in the appropriate direction. However, the mechanistic details of how context helps in processing have hardly been explored. We turn to these matters in the next two sections.
2.4 A Way in Which Context Can Help The ATT-Meta approach supposes that the context of any utterance, including a metaphorical one, often creates queries in the mind of the understander, or, what comes to the same thing, puts certain specific issues in focus. The understander uses such queries or issues to control the interpretation of the utterance at hand. In this view, utterances are placed in discourse for specific purposes related to how they connect to other utterances, and the understander benefits by trying to to divine what those purposes are. The matter is best conveyed by example.5 Suppose discourse contains the sentence (1) “John is a tank” and the understander does not know any metaphorical sense of the word “tank” that could apply to people, and therefore has to consider another sense of the word to guide the process of metaphorical understanding. For the sake of brevity, let us assume that that other sense is the military-tank sense. Even with this restriction, the sentence could be getting at a variety of different things, such as: John is square and heavy; John destroys things; John is tough; John tramples over things – where, moreover, the destruction, toughness or trampling could be physical or abstract. A purely utterance-based approach to understanding (1) would blindly have to take qualities of military tanks, such as size, heaviness, inexorableness, ability to withstand attack, and powerfulness, and transfer them in some form to apply potentially to John. The hope would be to find that one or more of these transferred qualities could plausibly apply to him. But it is hardly likely that (1) would be uttered in a context that gave no independent clue as to what the sentence was trying to get at. More plausibly, (1) would appear in a specific context. A possibility is: 1 “Most of my colleagues get dispirited when they’re criticized, but John’s a tank.” 5
The material in the rest of this section is based on a section of Barnden et al. (2004).
Metaphor, Semantic Preferences and Context-Sensitivity
47
The first clause here raises the issue of the ability to tolerate criticism. The word “but” suggests a contrast between the two clauses. The understander can therefore pose an internal query to him/her/itself such as: (2a) Is John able to tolerate criticism well? For definiteness, let’s assume that the understander knows an ARGUMENT AS WAR metaphorical view (Lakoff & Johnson, 1980), and has ready access to a mapping link between military attack-withstanding and criticism-tolerating. So, on encountering (2a) one thing that the understander can do is to use this mapping to translate (2a) into military terms: (2b) Is John able to withstand military attack well? The source-domain query (2b) can then be answered in the affirmative within the source domain using the datum that John is a tank.6 The information from the sentence that John is a tank, which is a military object, can conceivably be used proactively to boost the relevance of the above mapping link in preference to other mapping links, if any, that do address criticism-tolerating but that do not involve the military domain. In other work (Barnden, 2006) we cast doubt on the utility of domains in defining the very notions of metaphor and metonymy, and we therefore depart from the concentration on domains in approaches such as Lakoff’s. However, this does not preclude the possibility that pieces of information are tagged as being about particular rough domains such as warfare and that this information is used heuristically to facilitate metaphorical processing. For our argument to carry through, it is not necessary for the context to raise the issue of toleration of criticism in quite such an explicit or precise fashion as the first clause of 1 does. Rather, we make the general assumption that context provides relatively determinate information about what issue the utterer is addressing in the metaphorical utterance at hand, at least in those cases where that utterance itself is indeterminate about that. But of course it could happen that context does not raise the issue at all or only very implicitly raises one, or raises many issues, so that it was not clear which query or queries the understander should address. However, the situation even in these cases is no worse for the ATT-Meta account here than it is for metaphor theories which do not have any account at all of how context could help with metaphor understanding. Furthermore, when context fails to raise clear issues, a human understander would presumably be unsure about what the metaphorical utterance such as (1) was conveying. And there is no need to require context to suggest a definite answer to any query raised. Although context may suggest that one answer is more to be expected than another, the ATT-Meta approach does not assume this happens. The interpretation 6
Use of the word “well” in (2a) is a symptom of a general phenomenon whereby metaphor is often about matters of degree rather than yes/no issues. We will see more of this in the discussion of example (3) below.
48
Barnden
process should in any case generally seek for evidence both for and against any particular hypothesis that arises. (This is because most information and reasoning that crops up in discourse interpretation is uncertain.) Therefore, even if context were to provide some evidence for a particular hypothesis, this could be overridden by contrary evidence coming from the sentence. Also, we do not need to assume that the issue-raising context arises before an utterance such as (1) – it could occur afterwards instead. It would be reasonable for an understander to adopt a strategy whereby if preceding context does not raise any discernible, specific issue the understander postpones full interpretation of the sentence until after examining succeeding context. We have used invented tank examples in order to isolate the issues of interest, but appropriate examples can readily be found in real discourse. Real-discourse examples using the noun “rock” are analysed in detail in Barnden et al. (2004). One of these examples is: “Okay. My husband has always been very involved with the children, although he works a lot of hours. He spent more time than he usually does with them. Obviously, I wasn’t around, or I was sick, but he was a rock.” [italics added] Notice that in this example, just as in our tank example, the context raises a specific issue partly through the conjunction “but” (in “but he was a rock”). This appears to be a common phenomenon in metaphorical discourse. Some of Hobbs’s (1990) examples also rely on a “but.” And “but” occurs also in the following example from a magazine article,7 part of which we will proceed to analyze in detail: (3) In the far reaches of her mind, Anne knew Kyle was having an affair, but “to acknowledge the betrayal would mean I’d have to take a stand. I’d never be able to go back to what I was familiar with,” she says. Not until eight months had passed and she finally checked the phone bill did Anne confront the reality of her husband’s deception. The analysis of this example in Barnden & Lee (2001) and Barnden et al. (2004) is as follows. The discourse chunk involves the metaphorical views of MIND AS PHYSICAL SPACE and of IDEAS AS PHYSICAL OBJECTS. Assume that the meaning of the segment “In the far reaches of her mind, Anne knew Kyle was having an affair[.]” to be in part that Anne only had a very low degree of conscious awareness of the idea that Kyle was having an affair. The succeeding segment “to acknowledge the betrayal,” together with the “but,” can readily give rise to a target-domain query such as (4a) To what degree was Anne able to operate in a conscious mental way on the idea of Kyle having an affair?
7
In Linden Gross, “Facing up to the Dreadful Dangers of Denial,” Cosmopolitan, 216(3), USA ed., March 1994. Italics in original.
Metaphor, Semantic Preferences and Context-Sensitivity
49
Let us assume that IDEAS AS PHYSICAL OBJECTS includes a correspondence between the physical operation on ideas, by the agent’s conscious self, with conscious mental operation on them by the agent. Thus (4a) can be transformed to create the query (4b) To what degree was Anne’s conscious self able to operate physically on the idea of Kyle having an affair? This query then controls the understanding, within the source domain, of the significance of the qualifier (4c) “In the far reaches of her mind” in (3). This qualifier indicates indirectly a very low level of ability by Anne’s conscious self to operate physically on the idea. This answer to (4b) is transferred, by the above-mentioned correspondence, to the target domain to become the conclusion that Anne only had a very low degree of ability to operate in a conscious mental way on the idea. This conclusion is the answer to query (4a). Qualifier (4c) indicates a very low level of ability by Anne’s conscious self to operate physically on the idea because, in applications of MIND AS PHYSICAL SPACE, the conscious self of the person is implicitly viewed as being a person located in a main part of the physical space, presumably distant from “far reaches.” This distance in turn implies a very low degree of ability of the conscious self to physically interact with the idea. This interpretation process is of the sort advocated in the ATT-Meta theoretical approach. In fact, given the original query (4a), the process is fully implemented in the ATT-Meta system. An overall picture of the major reasoning steps taken is given in Figure 2.1 (queries themselves are not shown, except for the top query, 4a). The process is explained in considerable detail in Barnden & Lee (2001). The “pretence space” in Figure 2.1 is a special computational environment in which inferential consequences of the ostensible meaning of the utterance (e.g., the proposition that Anne’s state of knowing really does have a physical location within her mind, which really is a physical space) can be teased out without risk of contaminating or being contaminated by reality. For purposes of the present chapter, the inference within the pretence space can be taken to be within the terms of the source subject-matter, and the reasoning within the reality space to be within the terms of the target subject-matter. However, there is in fact no restriction on what subject matter arises in each space. An important point to notice from the Figure, for the purposes of the next section, is that the reasoning can be split into three broad aspects: 1. the reasoning within the pretence space (i.e., that joins up the utterance’s ostensible meaning with the source side of one or more metaphorical mapping links); 2. the reasoning consisting the actual use of one or more mapping links; 3. and the reasoning in the reality space that connects the target sides of the utilized mapping links with the specific context-derived queries. This list implies no specific temporal ordering, however, for reasons that we do not go into here.
50
Barnden “In the far reaches of her mind, Anne believes that Kyle is having an affair” OSTENSIBLE MEANING Anne’s mind has [spatial] far reaches, FR Anne’s believing ideaK is spatially−in FR (ideaK = that Kyle is having an affair)
REALITY CONTEXT
VARIOUS REASONING STEPS
to only a very low degree: Anne’s conscious self can physically operate on ideaK to only a very low degree: Anne can consciously mentally operate on ideaK METAPHORICAL PRETENCE COCOON
ANSWERS to what degree:
TOP QUERY
can Anne consciously mentally operate on ideaK
Fig. 2.1. Overall shape of processing of an initial part of the Anne/Kyle example, (3) in text. The thick arrow shows the action of mapping. The statements within the diagram are English glosses of expressions in ATT-Meta’s internal representation scheme
2.5 Context-Drivenness in ATT-Meta The ATT-Meta system embodies a particularly strong way of using contextual issues that arise. The issues take the form of reasoning queries like the TOP QUERY in Figure 2.1 (expressed of course in a formal internal representation language), and these are used to generate a backwards-chaining reasoning process that eventually meets up with the ostensible meaning of the utterance in the pretence space. That is, given a query such as (4a), ATT-Meta looks for facts or IF-THEN rules in its knowledge based that could provide a value for the variables in the query (the degree variable in the case of 4a), or, if there are no variables, supports or refutes the
Metaphor, Semantic Preferences and Context-Sensitivity
51
hypothesis. If it finds such rules, then their IF parts are used to create sub-queries, and so forth. Since metaphorical mapping links are themselves cast as IF-THEN rules, the creation of, say, query (4b) from (4a) fits naturally and flexibly into the backwards chaining process. We can call the overall process context-driven rather than merely context-based because of the backwards, query-directed nature of the reasoning. However, the ATT-Meta approach has always recognized that some measure of forward chaining from the utterance’s ostensible meaning may be needed or desirable (see, e.g., Barnden & Lee, 2001). Such processing could be called utterance-driven rather than just utterance-based. What mix of utterance-driveness and contextual-issue-drivenness is desirable is partly a practical matter of how to connect the ostensible meaning to the contextual issues in the most effective and efficient way. Only much further empirical work on testing the approach on specific examples, with realistically sized knowledge bases, will reveal the best mix or suggest how the mix should be adjusted to suit different circumstances. The system and theoretical approach have been tested on many examples (see Barnden, 2001b, c; Barnden & Lee, 2001; Lee & Barnden, 2001b) but the knowledge bases used have not been large – for example, just under 60 rules or so were used in the experiments on the Anne/Kyle example (“In the far reaches of her mind, Anne knew Kyle was having an affair” from (3)). Thus, it is worth emphasizing that there are two possible theses that should be distinguished: 1. Metaphor interpretation should, when possible, not just be utterance-based but also contextual-issue-based. 2. When metaphor interpretation has a contextual-issue-based aspect, this aspect should be handled through contextual-issue-driven processing rather than utterance-driven processing. Much suggestive evidence can be provided that context of metaphor does often raise specific issues and that these issues can effectively be used to guide interpretation, thereby supporting thesis 1. There is less evidence that thesis 2 is true even though the current version of the ATT-Meta system has adopted it as a working hypothesis. The overall ATT-Meta theoretical approach, on the other hand, is not committed to relying wholly on contextual-issue-driven processing. There is nevertheless a general argument that suggests that contextual-issuedrivenness is often, and perhaps normally, the method of choice. The argument appeals to the conjecture that metaphor generally casts a relatively abstract and ill- and/or sparsely-understood subject matter (the target) in terms of a relatively concrete and well/richly-understood subject matter (the source) (see, e.g., Lakoff, 1993). We will call this the conjecture of source superiority on concrete richness. When this superiority exists, there will typically be more choices to make for any given reasoning step about the source subject matter than is the case for the target subject matter. The remainder of the argument is as follows, appealing to the three reasoning types listed at the end of the last section.
52
Barnden
Because of the extra richness in the source subject-matter, it is more difficult to go forward in an utterance-driven way from the ostensible meaning of a metaphorical utterance towards the source side of mapping links (reasoning of type 1) than it is to go backward from the contextual queries to the target sides of mapping links (reasoning of type 3). In addition, in the contextual-issue-driven case, once a query has been transformed by going backwards over a mapping link to land within the source subject-matter, the sub-queries that thereby arise are from only a subset of all known mapping links, and furthermore these queries arise from specific applications of the links. By contrast, utterance-driven reasoning that is aimed at connecting up with source sides of mapping links does not know which mapping links will turn out to be useful, and has no specific applications of those mapping links to work with. Having said this, it is still the case that, because source superiority on concrete richness is at best only a general tendency, there may be occasions on which it is practical to go forwards from ostensible meanings to mapping links without (yet) having any guidance from context. This is especially so to the extent that reasoning within sources may often be constrained by quite specific stereotypical “scenarios” of the type that Musolff (2004) has discussed (cf. scripts, etc. in AI). For instance, he argues that in political metaphors casting the EU as a marriage, there are standard scenarios of getting married, being unfaithful, or getting divorced etc. that tend to be used, rather than the full panoply of possible knowledge about marriage in general being deployed. A further consideration is as follows. As we noted above, one purpose of many metaphors is to convey affect, e.g. evaluations of or emotions about the target situation described. It appears that such affective elements are often carried over identically in metaphor, irrespective of the particular metaphor-specific mapping links involved (such as a mapping link between lovers and travellers). Thus, strong disgust towards cess-pits carries over in “My son’s room is a cess-pit.” Now, given that affect often does carry over in this way and may indeed be part or all of the point of the metaphorical utterance, it makes some sense for an interpretation process to try to reason forwards from the utterance to find affective consequences, irrespective of the specific context at hand. We saw above, though, that conflicting affective consequences may arise from a given utterance, and used this as one argument about the dangers of purely utterance-based processing. Thus, there is a balancing act needed between the dangers of proliferating interpretations by not using contextual guidance in inferring affective qualities and the benefits of having affect as a useful source of guidance. Affect is just one of sizable set of properties that the ATT-Meta project takes to be carried over identically in metaphor. A provisional set of properties the ATTMeta project has been working with is listed in Barnden & Lee (2001) and Barnden et al. (2003), and includes, for example, temporal structure, causation relationships, dis/enablement relationships, and purposes. Since the mapping principles involved are neutral with respect to any particular metaphorical view, they are dubbed as being view-neutral mapping adjuncts (VNMAs). What we will call here the “Affect VNMA” is the principle that, first of all, if the understander judges a source-domain
Metaphor, Semantic Preferences and Context-Sensitivity
53
item S to have affective quality Q, and there is a target-domain item T corresponding to S, then T also has quality Q, by default. (All effects of VNMAs are defeasible.) A second aspect of the Affect VNMA is that if an agent A in the source-domain scenario judges source-domain item S to have affective quality Q, and both A and S have corresponding target-domain items B and T, then (by default) B judges T to have quality Q. VNMAs are loosely related to a hierarchy of invariant characteristics that Carbonell (1982) claimed tended to be carried over in metaphor, though VNMAs are not regimented in a hierarchy and are different in detail. The non-affect VNMAs, such as one that carries over causation relationships from source to target, could provide an end-point to within-pretence reasoning, much as the Affect VNMA could. Indeed, analysis of some examples of metaphor (Barnden, 2001b) suggests that in many cases most of the useful information coming from a metaphor is carried by VNMAs as opposed to view-specific mapping links. View-specific links can often be largely or wholly confined to providing a scaffold that allows the relevant applications of VNMAs. This phenomenon occurs in the next section.
2.6 Gas Guzzler Tamed In order further to illustrate the ATT-Meta approach, we address the central example in Wilks (1978), namely (5) My car drinks gasoline. It is also a prominent example in Fass (1997), and we use the treatment there, in the meta5 system, as a contrast to the way the ATT-Meta approach would proceed on the example. The meta5 system can interpret the sentence as meaning “My car uses gasoline” essentially by finding an analogical match, from scratch, between the following two knowledge items: animals drink liquids; cars use gasoline. The meta5 system has no prior knowledge of any particular metaphorical views. However, we can arguably reanalyze the example more naturally within the ATT-Meta framework. Plausibly, ordinary English users possess a metaphorical view (conceptual metaphor) of MACHINES AS CREATURES (as indeed Fass, 1997:p.318, points out). Utterances such as “my radio is dead,” “my car has life in it still,” “he killed the engine,” “a middle-aged toaster,” and “an aggressive lawn-mower” are mundane and easily understandable. We assume that as part of the view, a machine’s running corresponds to a creature’s biological activity. This is the only view-specific mapping we need in order to be able to deal with (5). The proposed reasoning process is sketched in Figure 2.2. It has not been implemented in the ATT-Meta system, because it relies heavily on some VNMAs which have not yet been implemented. ¿ From the ostensible meaning of the utterance and source-domain general knowledge, it can be (defeasibly) inferred in source-domain terms that gasoline helps the car to be alive (biologically). But,
54
Barnden ‘‘My car (C) drinks gasoline’’
OSTENSIBLE MEANING C drinks gasoline C belongs to speaker
C is a car gasoline helps: C to be alive C is a machine C is a living creature
VNMA
gasoline helps:
gasoline helps: C to be biologically active
C to run
C uses gasoline
C ingests gasoline VNMA
C causes: some gasoline not to exist VNMAs
C causes: some gasoline not to exist
PRETENCE COCOON
Fig. 2.2. Showing how the approach could deal with The car-drinking-gasoline example, (5) in text. A thick arrow labelled VNMA or VNMAs shows the action of one or more VNMAs. A thick arrow marked with a circle shows the action of a mapping relationship specific to the particular metaphorical view, MACHINES AS CREATURES. The statements within the diagram are English glosses of expressions in ATT-Meta’s internal representation scheme
by default, being alive enables the creature to be biologically active. Therefore, gasoline helps the car to be biologically active. Since being-biologically-active maps to machine-running as mentioned above, and as helping is mapped across identically by a VNMA (the one that also deals with causation, enablement, ability, etc.), the pretence-space hypothesis that the gasoline helps the car to be biologically active can be transformed into the reality-space hypothesis that the
Metaphor, Semantic Preferences and Context-Sensitivity
55
gasoline helps the car to run. Further reasoning sketched in Figure 2.2 could now produce the conclusion that the car uses [up] gasoline. The production of these target-domain conclusions does not require any mapping of the drinking itself to be created. Of course, an understander could go on to do the extra work of mapping drinking itself to, say, the process of a car having gasoline put in it or of the engine using the gasoline. Furthermore, to say that the drinking maps to using, as opposed to something else such as having gasoline put in, seems an unwarranted extra commitment. The ATT-Meta approach does not make that commitment since the inference to the car-uses-gasoline proposition is defeasible (because the approach is based on uncertain reasoning), and defeat of it would still leave us with the useful proposition that gasoline helps the car to run. As pointed out above, Wilks (1978) himself says that a person reading the sentence would get a connotation that the car uses a great deal of gasoline, not just uses it, and says that this connotation is not accounted for by the process be outlines. Fass (1997:p.192) himself makes a similar brief observation. For purposes of applying the ATT-Meta approach, we can frame the desired connotation as being that the car uses gasoline relatively quickly. Compare “the blotting paper drank up the ink.” Also the entries in Webster’s Third New International Dictionary suggest at least moderate rapidity of ingestion. But, because an act of drinking is (normally) moderately fast, on a rate-of-change scale concerning ordinary human activities, a use of a VNMA that deals with temporal rate (Barnden & Lee, 2001; Barnden et al., 2003) would allow the ATT-Meta approach to conclude that the car’s use of gasoline is (probably) moderately fast, relative to the normal speed of consumption. In order for meta5 to be able to come up with the connotation that the car’s consumption is relatively quick, it seems that it would need to already have in the target domain a representation of cars using gasoline relatively quickly, because otherwise there would be nothing in the target domain to be found to be analogous to the source-domain situation. But surely the point of the connotation in question is the exceptional circumstance that the particular car mentioned in the sentence uses gasoline relatively quickly. The example, as treated by ATT-Meta, provides an instance of the point made at the end of the previous section about the importance of VNMAs as opposed to view-specific mapping links. In Figure 2.2, most of the pretence-to-reality transfer (i.e., source to target transfer, in effect) is done by VNMA applications. The only view-specific transfer is that of being-biologically-active to mechanically-running, and no special, extra correspondence apart from this needs to be discovered in order to interpret the sentence. This section has not yet explained the way contextual-issue-basedness or contextual-issue-drivenness could come in. If someone said to you “My car drinks gasoline” in the absence of any prior context, you could well be forgiven for not being sure what the person was getting at. Potentially, the world-knowledge that cars use gasoline could by itself generate a contextual-issue-driven process conforming to Figure 2.2. However, it is surely more likely that there would be prior linguistic context, as perhaps in “I had to fill up this morning again. My car drinks gasoline.” (And it would be reasonable to expect there to be intonational
56
Barnden
emphasis on the “drinks,” or for the sentence to be varied to “My car really/just drinks gasoline.”) The first sentence has the implication that the car has used a lot of gasoline recently. This could be framed as an issue to guide the interpretation of the second, metaphorical, sentence. As “using a lot recently” is tantamount to “using up fast,” we have a suggestion here about how the interpretation could be guided. Another possible scenario, involving a change in the example, is if person A says “My car uses leaded gasoline” and B replies “Oh, mine drinks unleaded [gasoline].” Assuming that a contrast is hinted at here, the issue of whether B’s care uses leaded gasoline or not could be raised to guide the interpretation of the second sentence. Potentially, this could lead to investigation of the possibility that B’s car uses unleaded gasoline, as this would imply that it probably does not use leaded. The issue of speed of consumption could well not be raised. Finally, although the overall ATT-Meta approach to the car-drinks-gasoline example is very different from the Wilks and Fass accounts, preference violations could still give a clue as to what mappings are involved. Thus, the fact that “drinks” prefers an animate creature as agent suggests that the car is being viewed as a creature, in turn suggesting the involvement of the metaphorical view of MACHINES AS CREATURES.
2.7 An Overall Picture By way of a conclusion, the overall picture of how to deal with different types of metaphorical utterances that preceding parts of this chapter suggest is as follows. (A) Stock metaphorical phraseology (completely fixed conventional metaphor). Such phraseology can have its target-domain meanings listed in a lexicon, as has often been observed in the literature. Included here are not only particular words and phrases (potential examples are “see” in the sense of understand, “probe into” in the sense of abstractly investigate, and “build castles in the air,” with of course inflection of the verbs allowed) but also templates that have internal gaps that need to be filled (e.g., “in the recesses of [someone’s] mind” and “at the back of [someone’s] mind”). Assuming that such stock items are listed in the understander’s lexicon, there is no need to detect any metaphoricity, or to detect preference-breaking or other anomalies, although a phrase may in fact contain some anomaly such as violating semantic preferences of an included word or phrase. The recesses/back-of-mind cases are examples of this. (B) Minor, open-ended variants of stock metaphorical phraseology obtained, for example, by (i) replacement of words by synonyms and (ii) adding modifiers such as adjectival/adverbial words or phrases. With reference to the examples in category (A), an illustration of (i) would be the various possibilities in “construct/erect/elevate/ castles in the air,” and illustrations of (ii) would be “see dimly,” “probe deeply into,” “in the dark, murky recesses of [someone’s] mind” and “at the very back of [someone’s] mind.” Moon
Metaphor, Semantic Preferences and Context-Sensitivity
57
(1998) provides an extensive corpus-based study and discussion of such and other forms of variation. Something to be stressed here the open-ended quality of the variation: any way of, for example, conveying the activity of building (not only by a single verb, but perhaps by a creatively constructed phrase) could replace the verb “build” in the castles example, and any way of conveying visibility problems could be used in place of the “dark, murky” modifier in the recesses example to get a similar effect. Such variation would not tax the understanding powers of a competent speaker of English. Naturally, a particular variant of a stock item might itself happen to be a stock item in its own right, but not all possible variants can be. Given the open-endedness, which prevents the lexicon-listing approach used for category (A) being applied to the variants in general, it is typically necessary to reason about the effect of the variations in terms of the source domain. That is, special metaphorical processing is needed, perhaps of ATT-Meta style, to some appreciable extent. The modifiers in variants include ones like “very” that are neutral as to which metaphorical view is involved, whereas modifiers like “dark” are specific to certain source subject-matters. But even the use of “very” in “the idea was at the very back of [someone’s] mind” requires metaphorical processing that involves the source domain. Certainly, the adjective “very” introduces a general intensifying atmosphere, but it is important to work out exactly what it is that is intensified: namely, the subsidiarity of the mentioned idea. The intensification is not of the other important aspect of the meaning of the phrase, namely that the idea is still present to conscious thought in some way. It is difficult to see how to get the correct intensification without reasoning in terms of the source domain. In order for the understander to realize that the special metaphorical processing for category (B) is needed, it would be advantageous for the understander to detect the metaphoricity when this can readily be done. And indeed, given that the discussed variations of subtype (B)(ii) (modifier addition) are syntactically minor, even when added modifiers are in themselves syntactically complex, metaphoricity detection is likely not to be a major problem for subtype (B)(ii), as modifiers can be stripped out for purposes of comparison to stock items. It is less clear how to deal with B(i), however, especially when a word in the stock item has been replaced by a complex phrase. But, assume that to some useful extent it is possible to determine underlying stock items from encountered variants. Then, if stock items in the lexicon have their metaphorical senses tagged as being metaphorical, the metaphoricity of the variant is a default inference from the metaphoricity of the underlying stock item. So, processes of stripping out modifiers, looking for synonyms, etc. could uncover metaphoricity for variation types (B)(ii) and some cases of (B)(i). Clearly, though, preference violations would be a useful adjunct for determining metaphoricity, especially when the underlying stock item cannot readily be determined.
58
Barnden
In addition, stock items in the lexicon could have their metaphorical senses annotated with information about what mapping links are involved. Then, when it is noticed that an expression in discourse is a variant of a stock item, the mapping links can be accessed from the lexicon by the metaphorical processing, without their needing to be worked out afresh. Type (A) itself allows some variability by means of holes in templates. However, the variability in (A) is a matter of obligatory choice of fillers for specific holes, with the non-hole parts absolutely fixed aside from morphological inflection, whereas the variability in (B) is optional, relatively unregimented syntactically, and generally applicable to all elements of the base phrase. (C) Metaphorical utterances that do not fit in (A) or (B) but nevertheless can be analysed as relying on familiar metaphorical views, i.e. familiar sets of mappings. A central form of this phenomenon can be called map-transcendence. Map-transcendence arises when the source-domain scenario involves an element that is not itself directly mappable by any mapping the understander knows, even though some metaphorical view the understander knows is involved in the utterance. Map-transcendence is similar to the notion of metaphor that exploits “unused” parts of the source domain (Lakoff & Johnson, 1980; and see discussion in Grady, 1997), and is sometimes referred to as “extension” of metaphor. Relatively minor forms of map-transcendence are exhibited by the cases of category (B) that involve, for example, source-domain synonyms of words in the stock item or source-domain-specific modifiers, where the synonyms or modifiers are not mappable by a known mapping. But category (C) includes also a more thorough-going open-endedness of metaphorical phraseology, which is able to exploit the resources of the source domain more fully. For example, “in the far distant reaches of [someone’s] mind” may not be a stock template or a variant of one, and the “reaches” of a mind may not be mappable by any known mapping. However, the utterance rests on a very familiar view of a mind as a physical region. Another example is “Company A gobbled up company B and spat all its managers out,” resting on a familiar view of companies as voracious creatures but creatively transcending it by bringing in the spitting-out. For map-transcending metaphor, reasoning via the source domain is needed, and again the ATT-Meta approach is a candidate. Although metaphoricity may in general only be finally decided late in processing, heuristic hints such as preference violations, if present, could be a useful guide as to whether the metaphorical processing is likely to be profitable. In addition, preference violations could give a clue as to what mapping links are involved. We saw in the car-drinking example that the fact that “drinks” prefers an animate creature as agent leads to the suggestion that the metaphorical view of MACHINES AS CREATURES is involved. (D) Completely novel metaphor. That is, metaphor that does not rely on known metaphorical mappings (other than metaphor-unspecific mapping principles
Metaphor, Semantic Preferences and Context-Sensitivity
59
such as VNMAs – see the section above on context-drivenness in ATT-Meta). Completely novel metaphor is probably quite rare, even in poetry (Lakoff & Turner, 1989) but it needs to be accounted for. We can make the following three observations. (i) A first conjecture about completely novel metaphors is that a large proportion of them are image metaphors, i.e. metaphors that merely consist of a similarity of shape or visual appearance. By extension we could include here other types of perceptual similarity, such as acoustic. Of the examples of metaphor in Goatly (1997) that I judge to be novel, many are, plausibly, image metaphors. An example might be “This pencil is a snake” if the pencil is flexible and wiggly, or “My car is a shrub” if the car is covered with branches and leaves after an off-road driving experience. It would appear that such metaphor needs to be dealt with by special perception-aware similarity processing, and is quite likely not to have any meaning other than the perceptual similarity itself. But one unresolved problem is how to distinguish image metaphor from non-image metaphor, if this is possible at all, in general. For example, a car could be a decrepit one stationed in a garden, and “My car is a shrub” could potentially be used to get at the point that birds build nests in it, without the car having much perceptual similarity to a shrub. Again, it could all be a matter of the issues that arise in the context. In addition, there is no reason why a metaphor should not combine a perceptual-similarity aspect with other aspects. (ii) Putting aside the special case of image metaphor, the existing computational approaches that are most relevant to completely novel metaphor are ones like those of Wilks (1978), Fass (1997), an aspect of Hobbs’s (1990) proposal, and analogy-finding models such as SME (Falkenhainer, Forbus & Gentner, 1989) that discover structural analogies, rather than more holistic perceptual similarities, between source and target knowledge structures from scratch. However, if it is right that much of the point of a metaphor is often carried by VNMAs, then there is no reason to think that this does not apply to completely novel metaphor in particular. For example, with “my car is a rotten banana,” one (defeasible) source-domain inference that might be drawn is that the banana is disgusting. The affect VNMA would then lead to the (defeasible) inference that the car is disgusting in some way. In suitable contexts this could an appropriate, and perhaps even a highly important, inference. If VNMAs are important for novel metaphor, then it follows that establishing complex structural analogies between source and target is concomitantly less important. (iii) The processing of completely novel metaphor is likely to benefit on many occasions from detection of preference violations, as it is reasonable to think that the metaphor will often bring together widely disparate subject matters. The above four categories of metaphor, (A) to (D), form an overall, rough framework for seeing how different types of metaphor handling may fit into an overall picture of metaphor. The division into types of metaphor is not itself remarkable, and is not greatly different from categorizations produced by other
60
Barnden
authors (e.g., Goatly, 1997). The point to note is the particular association of algorithmic methods to the categories. The categorization is language-user-relative in the sense that where a particular utterance lies in the framework is partly dependent on what the particular utterer or understander has in in his/her/its lexicon and what metaphorical mappings he/she/it possesses. For example, whether a particular expression is common enough to count as a “stock” item, and what degree of stock-ness is used to warrant inclusion in a lexicon, are user-relative matters. The framework accords varying degrees of usefulness to preference-violation detection. This usefulness is potentially for two things: (1) noticing that there is metaphoricity, and hence for giving more weight that there might otherwise be for special metaphorical processing such as the analogy-finding of Wilks or Fass or the the inferential processes of ATT-Meta; and (2) suggesting the specific metaphorical mappings that may be in play. Purpose (1) is potentially important for metaphor categories (C) and (D) and to a lesser extent for (B), but is not important for (A). Purpose (2) is often important when purpose (1) is. It could also be important for category (B) if the lexicon entries in question did not specify mapping links or state what rough domains or semantic types the mappings are between. But it should not be forgotten that heuristics other than preference violation are potentially helpful for metaphoricity detection as well. For instance, Goatly (1997) lists a number of morphological, lexical and general phraseological clues to metaphoricity.
Acknowledgments This research was supported in part by grants GR/M64208 and EP/C538943/1 from the Engineering and Physical Sciences Research Council (EPSRC) of the UK, and grant RES-328-25-0009 from the Economic and Social Research Council and EPSRC. I have benefited from collaboration with the other members of the Figurative Language Research Group at the School of Computer Science, University of Birmingham, UK – Sheila Glasbey, Mark Lee, Alan Wallington and Jane Zhang – and with former colleagues Steve Helmreich, Eric Iverson and Gees Stein, as well as Yorick Wilks, while I was at the Computing Research Laboratory, New Mexico State University. During that period the precursor to the metaphor research represented in this article was supported by NSF grant IRI-9101354.
References Ballim, A., Wilks, Y. & Barnden, J. (1990). Belief ascription, metaphor, and intensional identification. In S.L. Tsohadzidis (Ed.), Meanings and Prototypes: Studies in Linguistic Categorization. New York: Routledge, Chapman & Hall. pp. 91–131. Ballim, A., Wilks, Y. & Barnden, J.A. (1991). Belief ascription, metaphor, and intensional identification. Cognitive Science, 15(1), 133–171. Barnden, J.A. (1998). Combining uncertain belief reasoning and uncertain metaphorbased reasoning. In Procs. Twentieth Annual Meeting of the Cognitive Science Society, pp.114–119. Mahwah, N.J.: Lawrence Erlbaum Associates.
Metaphor, Semantic Preferences and Context-Sensitivity
61
Barnden, J.A. (2001a). Uncertainty and conflict handling in the ATT-Meta context-based system for metaphorical reasoning. In V. Akman, P. Bouquet, R. Thomason & R.A. Young (Eds), Procs. Third International Conference on Modeling and Using Context, pp. 15–29. Lecture Notes in Artificial Intelligence, Vol. 2116. Berlin: Springer. Barnden, J.A. (2001b). Application of the ATT-Meta metaphor-understanding approach to selected examples from Goatly. Technical Report CSRP–01–01, School of Computer Science, The University of Birmingham, U.K. Barnden, J.A. (2001c). Application of the ATT-Meta metaphor-understanding approach to various examples in the ATT-Meta project databank. Technical Report CSRP–01–02, School of Computer Science, The University of Birmingham, U.K. Barnden, J.A. (to appear). Metaphor and artificial intelligence: Why they matter to each other. To appear in R.W. Gibbs, Jr. (Ed.), Cambridge Handbook of Metaphor and Thought, Cambridge University Press. Barnden, J.A. (2006). Metaphor and metonymy: A practical deconstruction. Technical Report CSRP–06–1, School of Computer Science, The University of Birmingham, U.K. Barnden, J.A., Glasbey, S.R., Lee, M.G. & Wallington, A.M. (2002). Reasoning in metaphor understanding: The ATT-Meta approach and system. In Procs. 19th International Conference on Computational Linguistics, pp. 1188–1193. San Francisco: Morgan Kaufman. Barnden, J.A., Glasbey, S.R., Lee, M.G. & Wallington, A.M. (2003). Domain-transcending mappings in a system for metaphorical reasoning. In Conference Companion to the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2003), pp. 57–61. Association for Computational Linguistics. Barnden, J.A., Glasbey, S.R., Lee, M.G. & Wallington, A.M. (2004). Varieties and directions of inter-domain influence in metaphor. Metaphor and Symbol, 19(1), 1–30. Barnden, J.A. & Lee, M.G. (2001). Understanding open-ended usages of familiar conceptual metaphors: An approach and artificial intelligence system. Technical Report CSRP–01–05, School of Computer Science, The University of Birmingham, U.K. Cameron, L. (1999). Operationalising “metaphor” for applied linguistic research. In L. Cameron & G. Low (Eds.), Researching and Applying Metaphor, pp. 1–28. Cambridge, U.K.: Cambridge University Press. Carbonell, J.G. (1982). Metaphor: an inescapable phenomenon in natural-language comprehension. In W. Lehnert & M. Ringle (Eds.), Strategies for Natural Language Processing, pp. 415–434. Hillsdale, N.J.: Lawrence Erlbaum. Carston, R. & Wilson, D. (2005). Metaphor and relevance: The “emergent property” issue. Talk delivered at New Directions in Cognitive Linguistics: First UK Cognitive Linguistics Conference, University of Sussex, Brighton, U.K., October 2005. Falkenhainer, B., Forbus, K.D. & Gentner, D. (1989). The structure-mapping engine: algorithm and examples. Artificial Intelligence, 41(1), 1–63. Fass, D. (1997). Processing Metaphor and Metonymy. Greenwich, Connecticut: Ablex. Fass, D. C. & Wilks, Y. (1983). Preference semantics, ill-formedness, and metaphor. J. Association for Computational Linguistics, 9(3&4), 178–187. Gibbs, R.W., Jr. & Tendahl, M. (2006). Cognitive effort and effects in metaphor comprehension: Relevance theory and psycholinguistics. Mind and Language, 21(3), 379–403. Giora, R. (1997). Understanding figurative and literal language: The graded salience hypothesis. Cognitive Linguistics, 8(3), 183–206. Goatly, A. (1997). The Language of Metaphors. London and New York: Routledge. Grady, J.E. (1997). THEORIES ARE BUILDINGS revisited. Cognitive Linguistics, 8(4), 267–290.
62
Barnden
Hobbs, J.R. (1990). Literature and Cognition. Stanford University, CA: CSLI Press. Katz, J. & Fodor, J. (1963). The structure of a semantic theory. Language, 39, 170–210. Lakoff, G. (1993). The contemporary theory of metaphor. In A. Ortony (Ed.), Metaphor and Thought, 2nd edition, pp. 202–251. New York and Cambridge, U.K.: Cambridge University Press. Lakoff, G. & Johnson, M. (1980). Metaphors We Live by. Chicago: University of Chicago Press. Lakoff, G. & Turner, M. (1989). More than Cool Reason: A Field Guide to Poetic Metaphor. Chicago: University of Chicago Press. Lee, M.G. & Barnden, J.A. (2001a). Reasoning about mixed metaphors with an implemented AI system. Metaphor and Symbol, 16(1&2), 29–42. Lee, M.G. & Barnden, J.A. (2001b). Mental metaphors from the Master Metaphor List: Empirical examples and the application of the ATT-Meta system. Technical Report CSRP– 01–03, School of Computer Science, The University of Birmingham, U.K. Leezenberg, M. (1995). Contexts of metaphor. ILLC Dissertation Series, 1995–17, Institute for Language, Logic and Computation, University of Amsterdam, The Netherlands. Martin, J.H. (1990). A Computational Model of Metaphor Interpretation. San Diego, CA: Academic Press. Mio, J.S. (1997). Metaphor and politics. Metaphor and Symbol, 12(2), 113–133. Moon, R. (1998). Fixed Idioms and Expressions in English. Oxford, U.K.: Clarendon Press. Musolff, A. (2004). Metaphor and Political Discourse: Analogical Reasoning in Debates about Europe. Basingstoke, UK: Palgrave Macmillan. Peleg, O, Giora, R. & Fein, O. (2001). Salience and context effects: Two are better than one. Metaphor and Symbol, 16 (3&4), 173–192. Sperber, D. & Wilson, D. (1995). Relevance: Communication and Cognition 2nd edition. Oxford: Blackwell. Stern, J. (2000). Metaphor in Context. Cambridge, MA and London, UK: Bradford Books, MIT Press. Vervaeke, J. & Kennedy, J.M. (2004). Conceptual metaphor and abstract thought. Metaphor and Symbol, 19(3), 213–231. Wilks, Y. (1975). A preferential, pattern-seeking, semantics for natural language inference. Artificial Intelligence, 6, 53–74. Wilks, Y. (1978). Making preferences more active. Artificial Intelligence, 11, 197–223. Wilks, Y., Barnden, J. & Wang, J. (1991). Your metaphor or mine: belief ascription and metaphor interpretation. In Procs. 12th Int. Joint Conf. on Artificial Intelligence (Sydney, Australia, Aug. 1991). pp. 945–950 San Mateo: Morgan Kaufmann. Wilks, Y., Barnden, J. & Wang, J. (1996). Your metaphor or mine: belief ascription and metaphor interpretation. In B.H. Partee & P. Sgall (Eds.), Discourse and Meaning: Papers in Honor of Eva Hajicˇ ová, gˇ ives upturned circumflex, 141–161. Amsterdam/Philadelphia: John Benjamins. Zhang, L., Barnden, J.A., Hendley, R.J. & Wallington, A.M. (2006). Exploitation in affect detection in improvisational e-drama. Patrick Olivier. In Procs. 6th International Conference on Intelligent Virtual Agents. Lecture Notes in Computer Science, 4133, 68–79. Springer.
3 Towards a New Generation of Language Resources in the Semantic Web Vision Nicoletta Calzolari Istituto di Linguistica Computazionale del CNR, Pisa, Italy Abstract:
In this contribution I touch on issues related to: language resources (LR) and semantics, dynamic resources automatically acquired, and how to go for a new generation of LRs compliant with the Semantic Web (SW) vision, pointing at the potentialities and the need for cross-fertilisation between the two communities of Human Language Technology (HLT) and SW/ontologies. Many of these issues are related to Yorick’s work on preferences, lexicons, semantic annotation, and recently to his ideas on the relation between HLT and SW Large scale LRs are unanimously recognised as the necessary infrastructure underlying language technology (LT) (Varile and Zampolli (eds.) 1997). Discussing a few major European initiatives for building harmonised LRs, I highlight how computational lexicons and textual corpora should be considered as complementary views on the lexical space, in the perspective of modelling a new type of resource which is both a lexicon and a corpus together. A “complete” computational lexicon should incorporate and represent our “knowledge of the world”. I claim that it is theoretically impossible to achieve completeness within any “static” lexicon. Moreover, choices on the syntagmatic axis are pervasive in language. A sound language infrastructure must encompass both “static” lexicons, as the traditional ones, and “dynamic” systems able to enrich the lexicon with information acquired on-line from large corpora, thus capturing the “actually realised” potentialities, the large range of variation, and the flexibility inherent in the language as it is used. These are the challenges for semantic tagging, which is at the core of the SW vision of giving meaning, in a manner understandable by machines, to the content of Web documents Broadening our perspective into the future, the need for more and more “knowledge intensive” large-size LRs for effective content processing requires a change in the paradigm, and the design of a new generation of LRs, based on open content interoperability standards. The SW notion may be helpful in determining the shape of the LRs of the future, consistent with the vision of an open distributed space of sharable knowledge available on the Web for processing The approach to realise the necessary world-wide linguistic infrastructure requires coverage not only of a range of technical aspects, but also – and maybe most critically – of a number of organisational aspects. An essential aspect for ensuring an integrated basis is to enhance the interchange and cooperation among many communities that act now separately, such as LR and LT developers, Terminology, Semantic Web and Ontology experts, content providers, linguists and so on. This is one of the challenges for the next years, for a usable and useful “language” scenario in the global network
63 K. Ahmad, C. Brewster and M. Stevenson (eds.), Words and Intelligence II, 63–83. © 2007 Springer.
64
Calzolari
3.1 Language Resources: A Peep into the Past 3.1.1 The Establishment of LRs as a ‘Recognised’ Field Within HLT Even if LRs (in the widest sense, i.e. spoken, written and multi-modal resources) have a rather short history, they are nowadays recognised as being among the pillars of HLT. After many years of complete disregard – or even disdain and contempt – for LRs, due mainly to the prevalence and influence of the generativist school (see also Wilks 1975a, for an ante litteram claim concerning the need to “process realistic text sentences” and of the need to “cope with the words of a normal vocabulary”, as intended by his “Preference Semantics”), LRs have acquired larger and larger resonance in the last two decades, when more and more activities, both at the European level and world-wide, have contributed to substantial advances in knowledge and capability of how to represent, create, acquire, access, exploit, harmonise, tune, maintain, distribute, etc. large lexical and textual repositories. Core – and often large lexical and textual repositories have been and are being built for many languages. Many of these came into existence in European projects – e.g. the Parole1 (Zampolli 1997, Ruimy et al. 1998) and Simple2 (Lenci et al. 2000, Ruimy et al. 2003) projects, or EuroWordNet3 (Vossen 1998), in which Yorick Wilks’ group participated for the English language –, and continued in National Projects, thus creating the necessary platform for a future European LR infrastructure. European researchers have played an outstanding role in these initiatives, and the vision, strategic planning, and activities of Antonio Zampolli to whom we also owe the term “language resources” – has been crucial not only to the establishment in Europe of many LR projects, among which is the Enabler4 network of national projects (Zampolli et al. 2000), but also to the recognition of the need to build what is now commonly called an infrastructure of LRs (Calzolari and Zampolli 1999). In Europe an essential role was played by the European Commission (EC) through a number of initiatives, many of which saw the participation of both the Sheffield and the Pisa groups, linked over the years by sharing common approaches and visions. The main lines of action related to LRs covered three aspects, which are strictly implied by the infrastructural role of LRs, i.e. (i) the need to base LR building on commonly accepted standards, (ii) the need to build a core set of LRs, designed in a harmonised way, for all the EU languages, (iii) the need to define a global strategy in the field, and to make the LRs which are created available to the community at large, i.e. the need for a distribution policy.
1 2
3 4
http://www.hltcentral.org/projects/detail.php?acronym=PAROLE http://www.hltcentral.org/projects/detail.php?acronym=SIMPLE; http://www.ilc.cnr.it/clips/CLIPS_ENGLISH.htm http://www.illc.uva.nl/EuroWordNet http://www.enabler-network.org/index.htm
;
Towards a New Generation of Language Resources in the Semantic Web Vision
65
3.1.1.1 Standards Design (Projects such as EAGLES and ISLE, IMDI, Intera, Lirics) The value of agreeing on international standards was recognised in the early 90s, and I claim that this can be seen as a sign of maturity of the field. The importance of designing standards for LRs was and still is a pillar of Pisa activities since the time when Zampolli clearly envisioned the need for it and started the EAGLES, then ISLE initiatives (Calzolari and Zampolli 2003, Zampolli 2003). EAGLES5 (Expert Advisory Group for Language Engineering Standards) was a long-standing European initiative, carried out through a number of subsequent projects funded by the EC since 1993. ISLE6 (International Standards for Language Engineering), a continuation of EAGLES, was a transatlantic standards oriented initiative, under the EC-NSF HLT programme within the EU-US International Research Cooperation, in which also a number of Asian colleagues have actively – albeit voluntarily – participated. The Multilingual Computational Lexicons Working Group of ISLE proposed – as a lexicon standard – the MILE (Multilingual ISLE Lexical Entry) (Calzolari et al. 2003), a general schema for the encoding of multilingual lexical information, to be intended as a meta-entry, acting as a common representational layer for multilingual lexical resources. This standard is based on a very extended survey of common practices in lexical encoding, and is the result of cooperative work towards a consensual view, carried out by several groups of experts worldwide. Both EAGLES and ISLE – in contrast to the TEI – stressed the importance of reaching consensus on (linguistic and non-linguistic) “content”, in addition to agreement on formats and encoding issues, and began to address also the needs of content processing and Semantic Web technologies. The recommendations for standards and best practices issued within these projects are now going to become, through the Intera7 and mainly the Lirics8 project, ISO international Standards within the ISO TC37/SC49 committee, and/or W3C standards. There is world-wide recognition of the value of the results of these initiatives, that have placed the EU at the forefront in the areas of LRs and standards. Just as an example, we are now exporting our model for defining standards to a Japanese project, “Developing International Standards of Language Resources for Semantic Web Applications”, within the International Joint Research Program by the New Energy and Industrial Technology Development Organization (NEDO), to establish ISLE-like standards for Asian languages (Calzolari et al. 2002).
5 6 7 8 9
http://www.ilc.cnr.it/EAGLES96/home.html http://www.ilc.cnr.it/EAGLES96/isle/ISLE_Home_Page.htm http://www.elda.org/intera http://lirics.loria.fr http://www.tc37sc4.org
66
Calzolari
3.1.1.2 LR Building (Projects such as ACQUILEX, MULTILEX, DELIS, ONOMASTICA, PAROLE/SIMPLE, EuroWordNet, LC-STAR) The LR repositories constructed in these projects (for overviews of these projects see Varile and Zampolli (eds.) 1992, the special issue of Literary and Linguistic Computing edited by Ostler and Zampolli 1994, Zampolli, Calzolari and Palmer (eds.) 1994, Zampolli 1998, Calzolari 1998) are rich in linguistic knowledge (and often in world knowledge), and are either based on consensually agreed best practices and standards or have themselves established de facto standards that were then submitted to the international community for recognition. Moreover, a model of synergy between EC and national initiatives was defined and established, by which the EC projects built core LRs for different languages, while it was up to national initiatives to extend the core to real-size LRs. This strategy had the clear benefit of “imposing” a common model on the various languages, thus de facto enforcing interoperability among LRs in different countries. The LR projects have been so far either in the field of written or of spoken LRs, thus probably contributing to maintain a distinction between the two communities. A new initiative is now underway within ELRA10 and its Production Committee, the so-called Unified Lexicon project, linking the LC-Star11 (spoken) and Parole (written) lexicons. Such an initiative goes in the direction of trying to bridge the gap between the two areas, with the double objective of establishing a methodology to link spoken and written LRs, and thus launching common standards and new models of LR distribution. 3.1.1.3 Infrastructural Issues (Projects Such as RELATOR, ELSNET, EUROMAP, HOPE, ENABLER) These type of projects, dealing with policy and meta-level issues related to LRs and standards, have been instrumental in defining a coherent strategy for the LR field in Europe, and in giving Europe a central position in the LR area, leading also to the founding of independent associations such as ELRA (European Language Resources Association), the European counterpart of the American LDC.12 What these projects have achieved was also the emergence of a broader consciousness in the EU community both (i) of the interest raised by the topic of LRs vs. the nonadequate approach to this topic common at that time, and – more importantly – (ii) of the aspects of consensual agreement vs. those involving more difficult theoretical or technological solution with respect to the state-of-the-art. Thus they were effective in spotting and bringing to light a number of commonly felt needs to which some solution had to be found. They were helpful in creating a more homogeneous community in Europe between the different groups interested in the LR area by compelling researchers from different countries and from public and private organisations to work together. There is certainly today a clear and growing industrial interest in the use of LRs and standards, in particular for multilingual applications. 10 11 12
http://www.elra.info http://www.lc-star.com http://www.ldc.upenn.edu/
Towards a New Generation of Language Resources in the Semantic Web Vision
67
A few signs of the wide resonance LRs have acquired in the last decade – as a result of all the above mentioned initiatives – can be found, among others, in a number of international initiatives: the LREC Conference13 (1000 participants in 2004 in Lisbon), bodies such as ELRA and LDC (Linguistic Data Consortium) or Cocosda14 (International Committee for the Coordination and Standardisation of Speech Databases and Assessment Techniques) and Write15 (Written Resources Infrastructure, Technology and Evaluation), the new international journal Language Resources and Evaluation, not to mention the vital role of LRs in statistical methods, in evaluation campaigns, and so on. On the one hand, such a solid position in the LR area must be maintained and reinforced, anticipating the needs of new types of LRs and quickly consolidating (e.g. through EAGLES/ISLE-like initiatives) areas mature enough for recommendation of best practices and standards. A virtuous circle should be established between innovation and consolidation. On the other hand, however, much stronger initiatives are needed to achieve true interoperability (see e.g. the issue of open architectures below), for which I envision the need of a new paradigm – in the sense of Kuhn – for the area of LRs. 3.1.2 Automatic Acquisition of Information: Back from the Late 80s In a short historical note on LRs we cannot ignore the topic of automatic acquisition of information, which is critical in the area of large-scale LRs. Large corpora are needed for acquisition of linguistic information and of knowledge, and vice versa robust acquisition systems are needed for building realistic and adequate lexical resources. Automatic acquisition of lexical information was at the centre of Pisa activities since the late 70s / early 80s. We were the promoters (Calzolari, 1977, 1982, Calzolari and Moretti 1976) of an innovative – at the time – line of research aimed at acquiring lexical information from traditional printed dictionaries. This trend of research was started independently in the same years by Amsler (1981) in the US. This research area then became popular all over the world (in Europe, US, Japan) and continued for many years (among the first were Byrd et al. 1987, Nakamura and Nagao 1988, Boguraev and Briscoe 1989, Wilks et al. 1989). The Acquilex16 project – where natural language definitions in printed dictionaries were the texts to be analysed for acquiring syntactic and semantic information – constituted an essential step towards developing methodologies for acquisition, but also design and representation of computational lexicons (Boguraev et al. 1988). At the time, this process was referred to as going from machine readable dictionaries (MRD) to lexical databases (LDB) (Calzolari and Briscoe 13 14 15 16
http://www.lrec-conf.org/ http://www.cocosda.org http://www.ilc.cnr.it/write Funded within the European ESPRIT Basic Research Actions programme from 1988–1994, as two subsequent projects. http://www.ilc.cnr.it/viewpage.php/sez=ricerca/id=854
68
Calzolari
1995). Yorick’s group contributed substantially to this trend with work on “largescale computational methods for the transformation of machine readable dictionaries (MRDs) into machine tractable dictionaries”, i.e., MRDs converted into a format usable for natural language processing tasks (Wilks et al. 1989, Wilks and Nirenburg 1993). The most widely used MRD to perform automatic extraction was The Longman Dictionary of Contemporary English (LDOCE). Preferences – as we all know – were at the core of Wilks’ famous “Preference Semantics” back in the 70s: “What is essential here is the inferential use of partial information; that is, information weaker than that in dictionaries and analytic (logically true) rules” (Wilks 1975b). Also in Acquilex we encountered the problem of a mismatch between the vagueness and implicit semantic density of natural language and the explicitness and comparative “poverty” of the formal representation language. The rigour and lack of flexibility of available representation languages caused difficulties when mapping into it natural language word-meanings, ambiguous and flexible by their own nature. One of the lessons learned was that it was somehow easier to acquire information than to represent it in a reusable manner, which seems to me a problem that has not yet been solved. Part of the results of meaning extraction, e.g. many meaning distinctions, which could be generalised over lexicographic definitions and automatically captured, were unmanageable at the formal representation level, and had to be blurred into unique features and values (Calzolari et al. 1993). Unfortunately, it is still today difficult to constrain word-meanings within a rigorously defined organization: by their very nature they tend to evade any strict boundaries. The same inadequacy of the formal machinery – a typed feature structure (TFS) representation language – with respect to the complexity of the lexical information to be encoded emerged in the ET10 project on extracting syntactic/semantic information from the Cobuild dictionary, in the early 90s (Calzolari et al. 1995, but see also Barnbrook 2002, who wrote a grammar of Cobuild definition sentences and parsing software to extract their functional components). The necessity to formally represent all information automatically extracted from Cobuild raised the problem of the distinction to be made between constraining and preferential information. This distinction proved to be not inherent in the nature of the data, but related to their use: the same grammatical specifications (e.g. number or voice) must be seen and used either as constraints or as preferences in different situations. Unfortunately, constraint-based formalisms did not easily capture the distinction: preferences had to be either ignored or treated as absolute constraints. There were in the 90s – as said above – a large number of projects building large lexical and textual resources. LRs were not conceived as an end in themselves, but as an essential component to develop robust applications, and it was clear that they were the prerequisite and the critical factor for the emergence and the consolidation of the data-driven approach in HLT. These LR building projects started to provide the essential basic infrastructure of LRs, but it was also a recognised fact that these LRs did not have enough coverage, not only for practical reasons, but for more structural and inherent reasons. No “static” lexical resource can ever be adequate and satisfying, from more than one perspective: (i) in extension: e.g. it cannot,
Towards a New Generation of Language Resources in the Semantic Web Vision
69
obviously, cover new formations, nor all possible domains; (ii) in depth: not even for the existing lexical entries can it provide all the required and useful linguistic information (e.g. not necessarily all the subcategorisation types actually occurring in a specific domain are covered by a general lexicon). For them to become really usable, it is essential that these generic, core LRs are built in such a way that they are really open to different types of enrichments and customisations, possibly performed automatically, and that information is granular enough so that different applications can extract what they need in the appropriate format. A message that we have repeated over the years is that the common generic platform of LRs – constituting the basic infrastructure – needs to be enhanced and fine-tuned in various ways – according to domain, task, system (IR, MT, QA, ), etc. – to become actually usable within specific applications. This makes it vital, for any sound lexicon development strategy, to accompany core static lexicons with dynamic means for enriching and integrating them – possibly on the fly – with the types of information which are known to be structurally and intrinsically missing from existing LRs. This global view eliminates an apparent dichotomy, i.e. the one between static vs. dynamically built (or incremental) resources, encompassing the two approaches in a more comprehensive perspective that sees the two as complementary and equally necessary facets of the same problem. This raised the need of working towards semi-automatic construction of a new generation of computational lexicons directly from corpora, otherwise coverage and/or accuracy would remain inadequate. The increasing availability and reliability of robust techniques (for chunking, shallow parsing, functional analysis, named entity recognition, etc.), and the ability to integrate them, made the exploitation of text corpora of greater relevance in many HLT tasks, and allowed the acquisition of lexical information complementing that available in static lexicons. Steps towards this objective have been taken, over the past years, by a consistent and constantly growing number of groups all over the world, with many varied research and development efforts aimed at acquiring linguistic and, more specifically, lexical information from corpora. Among the first EC projects working in this direction we mention Sparkle17 (Shallow PARsing and Knowledge extraction for Language Engineering), coordinated by me (1995–1998) (Federici et al. 1998) and Ecran18 (Wilks 1995), coordinated by Yorick, on tuning lexicons to domains and extracting selected types of information from written natural language texts, for instance information about joint ventures or new products from financial newswires. Sheffield and Pisa (Zampolli, Calzolari and Cignoni (eds.) 2003) were again going in the same direction, at the same time. Sparkle, combining shallow parsing and lexical acquisition techniques capable of learning (from large corpora) aspects of word knowledge required for LE applications, developed methodologies and techniques for application- or domaindependent lexical resources to be acquired (semi-) automatically from texts (Briscoe et al. 1999). The aim was to explore how far simple robust phrasal parsing 17 18
http://www.ilc.cnr.it/sparkle/sparkle.htm http://www.dcs.shef.ac.uk/research/ilash/Ecran
70
Calzolari
combined with classification techniques using limited and manageable linguistic knowledge and statistical data from substantial corpora can provide rich and reliable information about areas of word knowledge in which most extant conventional dictionaries, lexical databases and realistic NLP lexicons are demonstrably weak, e.g. predicate subcategorisation, argument structure and semantic preference. The central idea was to take advantage of the fact that text corpora contain hundreds or thousands of examples of word usages of intermediate frequency; by application of a partial parser, these examples are put in a form from which lexical information can be abstracted. This approach was then continued in an Italian National Project. Also in Sparkle it was evident that the acquisition issue is intimately related with the representation issue, and specifically with “standards” for annotation. Common annotation schemes were defined for three levels of syntactic analysis (chunking, phrasal parsing, and functional annotation), and a common description language for lexical encoding was designed. These technical standards were at the basis of a common evaluation framework defined both for the parsers and for the lexical acquisition systems (Lenci et al. 1999).
3.2 Language Resources and Acquisition of Information: A Cyclical Relation 3.2.1 Linguistic Analysis vs. Statistical Approaches Automatic acquisition usually implies a bootstrapping methodology, but also a cyclical approach. On the one hand, it has been proved that more efficient extraction techniques presuppose some capability of automatically analysing the raw text in various ways. On the other hand, the induction phase must be accompanied and/or followed by a linguistic analysis and classification phase if the induced data is to be used and merged with already available resources, so that it can contribute to enrich them. This phase of linguistic analysis of statistical data is what in recent years was too often – and still is in my view – missing in many approaches (as seen in many conference papers). Corpus use raises additional interesting theoretical questions. One of the most interesting – and intriguing – aspects of using corpora to acquire lexical information is that one is immediately confronted with the impossibility, based on textual evidence, of using any type of description which is based on a clear-cut boundary between what is allowed and what is not. A consequence of the corpus-based approach (e.g. to lexicon building) is that it compels us to break hypotheses too easily taken for granted in mainstream linguistics. It is evident that, in the actual usage of a language, a large number of properties are displayed that behave as a continuum, and not as properties of the “yes/no” type. In fact, this is one of the main characteristics encountered in actual language usage. The same holds true for the so-called “rules”: we find in corpus evidence more of “tendencies” towards rules rather than precise rules, so that many of the “usual” theoretical rules
Towards a New Generation of Language Resources in the Semantic Web Vision
71
appear to be simplifications or idealisations which are in fact dispelled by real usage. A conclusion which must be drawn from these observations is that almost all information types must not be treated as absolute constraints, whose violation makes a sentence totally unacceptable, but as preferences that make a given sentence more or less acceptable in a given context, without affecting its grammaticality. This again poses a problem at the level of representation. Moreover, the evidence of actual usage is often in contrast with what one would expect if judgement were based solely on introspection. It is worth making a last observation: the implementation of such a cycle – analysis, acquisition, improved analysis – needs a strong compatibility both (i) between the lexical representation and the corpus annotation, and (ii) at the system/tools interface level (for input/output). From this consideration a clear need for interoperability standards and for common terms of reference emerges. Acquisition and standardisation must go hand in hand, if we want to profit in the best way of acquired information. 3.2.2 Relations Between Corpora and Lexicons Corpora help in particular in those areas where more delicate categories than subcategorisation and broad semantic classes are necessary. To this end, robust and flexible tools are needed for (semi-)automatic induction of linguistic knowledge from texts. Because of the mixture in the lexicon – as indeed in language in general – of (i) core phenomena which can be encapsulated in general rules (or tendencies), and (ii) peripheral but pervasive phenomena which are flexible, variable and with loose boundaries, we have to handle these two types in different but integrated ways. This is particularly true when we enter the realm of semantics. Regular patterns of usage can be described intentionally, by e.g. classifying them as members of lexical types. More elusive patterns can be at least partially extensionally recorded, e.g. as collocational patterns. This can be achieved for instance by listing (possibly with their frequencies) those words which fulfil a particular role in relation to the entry we are describing, but cannot be classified through a general semantic type or relation or as a normal metaphorical extension of a semantic type. We must be aware that it is just this type of data, i.e. collocational patterns of usage, which are often the real clue for proper selection of the correct translation when we move to bilingual or multilingual lexicons. When we look attentively at the various ways in which lexicon and corpus are related to each other, we cannot avoid highlighting the complexity of their mutual interactions (Calzolari 1991, 2003, 2004). According to different perspectives, the relation goes in one or the other direction. In any case, we cannot safely separate these two linguistic objects from one another as if they were independent entities. Computational lexicons, as well as printed dictionaries, often represent a sort of stereotypical language. Instead, a (computational or traditional) lexicon has to faithfully represent also the “irregular” facts and the divergences of usage from what is potentially and theoretically acceptable. We should not decide what to encode in the lexicon relying only on native speakers’ (even if lexicographers) intuition, since this
72
Calzolari
leads to a description of a “theoretical language” instead of the language as it is used; we must provide, in the lexicon, some representation of (and distinction between) what is allowed but only very rarely instantiated, and what is both allowed and actually used. With respect to this broad issue, a number of apparent dichotomies must not be considered as opposite views, but may be reconciled as complementary perspectives: – – – – – – – –
rules vs. tendencies absolute constraints vs. preferences discreteness vs. continuum/gradience theoretical/potential vs. actual intuition/introspection vs. empirical evidence theory-driven vs. data-driven paradigmatic vs. syntagmatic symbolic vs. statistical.
I claim that the second element of the above dichotomies has to be highlighted, in order then to combine the two. Again we can’t avoid mentioning Yorick’s “preferential” rules, “in that they seek preferred entities but will accept those that do not satisfy the preferences” (Wilks 1979), linked to the famous “My car drinks gasoline” example, where “the action of drinking can be said to prefer an animate agent”, but “none of the senses available for ‘car’ are animate, and so the system simply accepts what it is given. I contrasted this approach with that of selection restrictions ”. 3.2.3 Lexicons as Dynamic Resources As a consequence, the promotion of a change of perspective on lexicons, from static resources towards dynamic entities, whose content is co-determined by automatically acquired linguistic information from text corpora and from the web, is deemed essential. We stress the need for using (semi-)automatic or machine aided methods wherever possible in resource work. This implies, for the machine learning community, developing new and stronger algorithmic methodologies to model textual statistics, integrating them with traditional NLP tools, and basing them on sound linguistic considerations. This trend of automatic acquisition of information has become today a consolidated fact in the HLT community, and we have moved from focusing on acquisition of “linguistic information” (as it was at the beginning) to broad acquisition of “general knowledge”, with more data intensive, robust and reliable methods. Today we can easily say that ontology learning, i.e. the practical feasibility of supporting knowledge acquisition in a domain, depends on developing automatic methods for acquiring conceptual representations from natural language text. Semantic Web initiatives are in fact also focussing on the building of ontological representations from texts, and in this respect show a large amount of conceptual overlap with the notion of a dynamic lexicon. Acquisition tools must be able, for example: to increase the available repositories with new words/terms,
Towards a New Generation of Language Resources in the Semantic Web Vision
73
possibly their definitions, domain, etc., from digital material on the web; to automatically identify collocational sets in corpora; to learn concepts from text – including automatic multi-lingual thesaurus building and ontology structuring; to improve existing semantic lexicons or wordnets by adding syntagmatic preferences and semantic qualia; and in general to tailor resources to specific needs. Agents will look for examples, identify uses in monolingual/multilingual web texts for glossary creation. Extensional descriptions will amount to creating virtual links between lexicons and examples (corpus/web samples, image samples, clips and videos, etc.). For the dichotomy, or better the coexistence in language, of static and dynamic features, we can mention also Patrick Hanks’ Theory of Norms and Exploitations (Hanks 2004, Hanks and Pustejovsky 2005): “For an NLP resource, first we must identify the norms of a lexicon, which are relatively static; then we must develop sets of exploitation rules, which govern the dynamic use of words in unusual ways – i.e. rules that govern how each norm can be exploited in various ways – for rhetorical effect, metaphor, ellipsis, and so on. In both cases, the rules must be developed afresh, on the basis of empirical corpus analysis, replacing the navel gazing and invented examples that were so popular among generative linguists in the 1960s–1990s and that have turned out to be so misleading.” A conclusion I would like to draw is based on the various experiences outlined above, and could be taken as a framework in which also insert our future work strategy in the field of lexical/textual resources. We should push towards new and innovative types of lexicons: a sort of “example-based living lexicons” that participate of properties of both lexicons and corpora. In such a lexicon redundancy is not a problem, but rather a benefit.
3.3 The Role of LRs in the Future HLT 3.3.1 A New Paradigm for LRs Broadening our perspective into the future, a further step and radical change of perspective is now needed, in order to facilitate the integration of the linguistic information resulting from all these initiatives, bridge the differences between various perspectives on language structure and linguistic content, put an infrastructure into place for content description and content interoperability at European level and beyond, and make lexical resources usable within the emerging Semantic Web scenario (Calzolari 2002, 2005, 2006). The ever-growing need for LRs for effective multilingual content processing requires a change in the paradigm, and the design of a new generation of LRs, based on open content interoperability standards. The effort of making available millions of “words” for dozens of languages is something that no single group is able to afford. This objective can only be achieved when working in the direction of an integrated Open and Distributed Linguistic Infrastructure, where not only the linguistic experts can participate, but which includes designers, developers and users of content encoding practices, and also many members of the society. It is
74
Calzolari
already proved by a number of projects that lexicon building and maintenance can be achieved in a cooperative way. We claim that the field of LR and LT is mature enough to broaden and open itself to the concept of cooperative effort of different set of communities (e.g. spoken and written, LT and Semantic Web, theoretical and application oriented). We can mention a few cooperative efforts to build language resources, in particular lexicons intended for human use, not for natural language processing. Wiktionary is “ a collaborative project to produce a free, multilingual dictionary with definitions, etymologies, pronunciations, sample quotations, synonyms, antonyms and translations. Wiktionary is the lexical companion to the open-content encyclopedia Wikipedia. we currently have 134,792 entries ”.19 It also contains community supplied translations for tens of thousands of words. There is also (the less open) Logos.it project,20 and there are efforts to extract lexicons and word resources from the web such as the Languages of the World project.21 3.3.1.1 Lexicons’ Integration and Interoperability: Towards a Cooperative Model The SW model of open data categories will foster LR integration and interoperability, through links to common standards. With the ISLE approach to lexical standards, and its definition of the MILE (Calzolari et al. 1993), new lexical objects can be progressively created and linked to the core set. We foresee an increasing number of well-defined linguistic data categories and lexical objects stored in open and standardised repositories, which will be used by different types of users to define their own structures within an open lexical framework. It is this re-usage of shared linguistic objects which will link new contents to the already existing lexical objects, while enabling shareability of distributed lexicon portions (Bertagna et al. 2004). The design of an abstract model of lexicon architecture will ensure a flexible model while working with a core set of lexical data categories. It will guarantee freedom for the user to add or change objects if that is deemed necessary, but will require an evaluation protocol for the core standard lexical data categories, and verification methods for the integration of new objects. This vision, enabled by MILE, will pave the way to the realisation of a common platform for interoperability between different fields of linguistic activity – such as lexicology, lexicography, terminology – and SW development. The lexicons may be distributed, i.e. different building blocks may reside at different locations on the web and be linked by URLs. This is strictly related to the SW standards (with RDF metadata to describe lexicon data categories), and will enable users to share lexicons and collaborate on parts of them.
19 20 21
http://www.wiktionary.org/ http://www.logos.it/ http://www.languagesoftheworld.org/
Towards a New Generation of Language Resources in the Semantic Web Vision
75
3.3.2 Some LR Priorities and Challenges Recurring themes in all consultations about LRs (such as the recent Elsnet/Enabler22 and LREC200423 Roadmaps) are: – make sure to define and provide basic LR coverage for all languages (see the BLARK/ELARK notions below) – increase multilingual LRs – investigate the possibility of developing an “Open Source” concept for LRs – coordinate the creation of LRs (also across languages) with a view to interconnectivity and reusability, to enhance LR content interoperability – give high priority to methods and tools to quickly develop LRs “on demand” (acquisition, annotation, merging, porting between domains or languages, ) – develop LRs for evaluation purposes – enhance metadata infrastructure and standards and set up infrastructures for comparative evaluation – investigate IPR issues. Many challenges exist, at various levels of complexity and with various priorities and weights. Some are challenges at the technological level, some are more at the organisational level. I quickly touch some: 3.3.2.1 Mismatch Between LRs and LT We have often experienced a gap – and a mismatch – between advancement in LRs and LT. There are times and situations where adequate LRs are missing, and others where there are no systems able to use “knowledge intensive” LRs effectively. The main shortcomings of this are (a) lack of usable implementations fully exploiting new types of LRs, (b) LR claims are not empirically evaluated. We must pursue a parallel evolution of R&D for both LRs and LT, which requires more overall coordination. 3.3.2.2 Lack of Communication Between the Communities of HLT and Semantic Web/Ontologies The Semantic Web – conceived here just as a broad vision aiming at implementing a semantic structure behind the content of the Web – could act as an integration point of various efforts from different communities: HLT will highly benefit from the SW but the SW needs HLT, otherwise (i) there is a clear risk of “re-discovery” of what was done 20 years ago (if we look at the first issue of the International Journal on Semantic Web & Information Systems, 2005, we find some statements identical to ours in papers of the 80s!), or (ii) without the capability of automatic
22 23
http://www.enabler-network.org/final-workshop-program.htm http://www.ilc.cnr.it/write/cocosda-iccwlre_meeting.htm
76
Calzolari
semantic mark-up, achieved with NLP techniques, authors or users have to mark up their content themselves. Yorick himself (2005) has written a critique of the Semantic Web’s semantics, and argues that “there is a view that the SW will be the World Wide Web with its constituent documents annotated so as to yield their content or meaning structure. This view of the SW makes natural language processing central as the procedural bridge from texts to KR, usually via a form of automated Information Extraction.” The relation between SW and HLT communities and initiatives clearly go in both directions. Examples of relations from HLT to SW: – Semantic mark–up: HLT is essential – and is robust enough – to start the task of adding meaning to Web data and make it usable for automatic processing. – LRs as the basis for knowledge representation, sharing, and interoperability among knowledge based systems: On one side most ontologies are not linguistically motivated, while on the other side linguistic ontologies, or better semantic lexicons such as WordNets (Fellbaum 1998, Vossen 1998), are intrinsically suited for natural language interpretation, but may lack the expressiveness needed to represent knowledge for inferences, planning, etc. Both approaches – if taken apart – can fail during interpretation. – Ontology learning, ontology design and evaluation of ontologies: LT, particularly Information Extraction, is a core technology for the extraction and creation of semantic content, while ontologists are often not paying attention to NLP methodologies and tools and their relevance to ontology building and evaluation. Examples of relations from SW to HLT, i.e. of uses of SW technologies for improvement of NLP applications: – LRs as web services, and the use of SW representation formalisms: How are traditional LTs or LRs changed by SW languages? The standard formal annotation of data or ontologies on the web (e.g. RDF) is of high value for natural language applications, and ontologies can provide a framework for structuring terminologies. The SW notion and its representation formalisms may crucially determine the shape of the new generation of LRs of the future, consistent with the vision of an open space of sharable knowledge available on the Web for processing. – Open access paradigm, semantic interoperability, information integration: This is – in my vision – a must for the next decade of LRs, and implies a complete re–thinking of the current area of LRs. 3.3.2.3 Open and Distributed Architectures for LRs and LT, Interoperability, GRID Technology, and Standards A new paradigm of R&D in LRs and LT is emerging, pushing towards the creation of open and distributed linguistic infrastructures for LRs and LT, based on sharing LRs and tools. It is urgent to create a framework – both technological and organisational – that enables controlled and effective cooperation of many groups on common tasks,
Towards a New Generation of Language Resources in the Semantic Web Vision
77
adopting the paradigm of accumulation of knowledge that has proved so successful in more mature disciplines, such as biology and physics. This implies the ability to build on each other’s achievements, to merge results, and to have them accessible to various systems and applications. This is the only way to make a clear leap forward. This means emphasising interoperability among LRs, LT and knowledge bases, and using linguistic ontologies to enable the development of interoperable large scale distributed knowledge-based systems. To mention just one example, more and more initiatives are arising aimed at achieving international consensus on annotation guidelines: to merge diverse linguistic annotation efforts (such as PropBank, NomBank, Framenet, TimeML, Penn Discourse Treebank, ), and to produce a set of coherent, integrated, comprehensive linguistic annotations to be readily disseminated throughout the community. Standards – also for metadata – are again unavoidable. This may also mean application of GRID technology to the problem of processing extremely large quantities of “facts and their relations”, of development of unprecedented large-scale annotated LRs, and of their dynamic linking across many different sources. A problem and a challenge is how to coordinate different information sources. 3.3.2.4 Specific (New) Types of LRs and Few Critical Issues I’d like just to mention here a few types of LRs that should receive (more) attention in the next few years. – Metadata (automatic creation, merging, ). – Multilingual LRs (in the EU, also for East–European languages), in an open infrastructure, also for multilingual access to web. – Multimedia resources, and multimedia indexing, content extraction, content– based search techniques, data mining, with integration of different modalities. – The Web exploited as a multilingual corpus. We should pursue both massive practical state–of–the–art annotation but also experimental ideal annotation, with as complete semantic annotation as possible and in language as neutral as possible, serving as forward thinking to guide future efforts. This strategy can have significant implications e.g. on example-based machine translation. – New types of “example-based” context sensitive LRs, Lexicon and Corpus together, dynamically created from heterogeneous resources and adapted to the flexibility and multidimensionality of meaning, exploiting new ways to extract reusable “value” from large linguistic repositories. As said above, choices on the syntagmatic axis are pervasive and we need to cope with language displaying many properties as a continuum, with vagueness and exceptional cases almost as the norm. – Integration of Lexicons/Terminologies/Ontologies, towards Knowledge Resources linked to/based on linguistic means of expression of content. Overall, the lexicons will perform the bridging function between documents and conceptual categorisation.
78
Calzolari
– Facts and commonsense knowledge. There is usually a trade–off between a system’s breadth of knowledge, and its depth of reasoning. We must break this by creating a vast amount of commonsense knowledge, making better use of contextual information, to allow commonsense inference. An initiative in this direction is the artificial intelligence CYC project (Lenat and Guha 1990). Building practical commonsense reasoning systems could be pursued in distributed and collaborative fashion by the community as a whole. – Common sense in affective classification of text, with affective qualities of things, actions, events, and situations. – Personal digital memory, with a broad-spectrum model of everyday life events. And we must not forget two orthogonal basic issues, often neglected: – Knowledge transfer across languages, to take advantage of LRs built for a few resource–rich languages to induce knowledge in languages for which few LRs are available. – Maintenance of LRs (updating, tuning, etc.) should be organised, and is still a big issue.
3.4 Technical vs. Organisational/Strategic Issues 3.4.1 Basic LRs for all Languages Enabler has adopted and strongly supported the BLARK (Basic LAnguage Resource Kit) concept, first launched through Elsnet (Krauwer 1998) and Nederlandse Taalunie (Binnenpoorte et al. 2002). The promotion of BLARK (Mapelli and Choukri 2003) requires to: – specify for every language the minimum set of LRs (in terms of text and spoken corpora, lexicons, basic tools to manipulate them, skills required, etc.) to be able to do any pre–competitive research for that language; – spot the actual gaps to be filled (a matrix highlighting the gaps of LRs for many applications and languages will be accessible and modifiable directly from the ELRA Web site, to enable customers or providers of LRs to fill it, to identify available LRs and to promote the production of new LRs); – present a summary of the technical, operational and organisational problems to be tackled and provide suggestions for an overall organisation framework for international cooperation. For this notion to become actual, it is clear that not only technical but also coordination and political initiatives are required. International cooperation will certainly be the most important factor for the field of LRs in the next few years.Moreover BLARK must be considered as an evolving notion. A further level is defined as Extended LAnguage Resource Kit (ELARK), which will be extensively promoted for its larger adoption.
Towards a New Generation of Language Resources in the Semantic Web Vision
79
3.4.2 Towards a True LR Infrastructure The approach to realise a true LR infrastructure requires the coverage not only of a range of scientific aspects (e.g. pertaining to linguistic modelling), but also – and maybe most critically – of a number of organisational aspects. In order to set up the required world-wide language infrastructure on the web, an essential aspect for ensuring an integrated basis is to enhance interchange and cooperation among the many communities that currently act separately, such as LR and LT developers, Terminology, Semantic Web and Ontology experts, content providers, linguists and so on. This is one of the challenges for the next few years, for a usable and useful “language” scenario in the global network. Moreover, such a language infrastructure may be inherently market driven, since the most widely used language portions may be the best developed and supported, and this has to be seriously considered. Technical scientific issues are obviously important, but organisational, coordination, political issues play a major role, as was highlighted in the Enabler project (Calzolari et al. 2004). Technologies exist and develop fast, but the infrastructure that puts them together and sustains them is still largely missing. For example, the absence of a specific HLT action line at European level in FP6 means not so much a change in the funding scene, but – more dangerously – lack of opportunities to discuss meta-level issues on HLT, missing overall coordination, and difficulty in designing common global long-term strategies, with the risk of being just opportunistic in R&D choices. While there is a pressing need for international research infrastructures for LRs and LT, for forums in which to discuss a broad research agenda, priorities and strategic actions for multilingual and multimedia LRs and LT together, as hopefully may happen in the Cocosda and the new Write international Committees. The implementation of the notion of open distributed infrastructures for LRs and LT could act as a major technological and organisational challenge around which synergies must be sought (also with other communities), and can naturally lead to the creation of an International Forum in which to discuss strategies and priorities. We must in fact be aware that coordination, strategic and political issues acquire a more and more decisive relevance with the growing maturity of our field, in particular in the sensitive area of LRs.
References Amsler, R.A. 1981. A taxonomy for English nouns and verbs. In Proceedings of the 19th Annual Meeting of the Association for Computational Linguistics. Stanford University, Stanford, California, USA, pp. 133–138. Barnbrook, G. 2002. Defining Language: A Local Grammar of Definition Sentences. Studies in Corpus Linguistics 11. John Benjamins. Bertagna, F., Lenci, A., Monachini, M. and Calzolari, N. 2004. “Content Interoperability of Lexical Resources : Open Issues and MILE Perspectives”. In Proceedings of LREC2004, pp. 131–134.
80
Calzolari
Binnenpoorte, D., De Vriend, F., Sturm, J., Daelemans, W., Strik, H. and Cucchiarini, C. 2002. “A Field Survey for Establishing Priorities in the Development of HLT Resources for Dutch”. In LREC 2002 Proceedings. Las Palmas, pp. 1862–1866. Byrd, R.J., Calzolari, N., Chodorow, M.S., Klavans, J.L., Neff, M.S. and Rizk, O.A. 1987. “Tools and Methods for Computational Lexicology”. In Computational Linguistics. ACL Journal, 13(3–4):219–240. Boguraev, B. and Briscoe, T. (eds.) 1989. Computational Lexicography for Natural Language Processing. Longman. Boguraev, B., Briscoe, E.J., Calzolari, N., Cater, A., Meijs, W. and Zampolli, A. 1988. Acquisition of Lexical Knowledge for Natural Language Processing Systems (ACQUILEX). Proposal for ESPRIT Basic Research Actions No. 3030. Cambridge, UK, p. 34. Briscoe, T., McCarthy, D., Carroll, J., Allegrini, P., Calzolari, N., Federici, S., Montemagni, S., Pirrelli, V., Abney, S., Beil, F., Carroll, G., Light, M., Prescher, D., Riezler, S. and Rooth, M. 1999. Acquisition System for Syntactic and Semantic Type and Selection. SPARKLE Deliverable 7.2. Pisa. P. 72. Calzolari, N. 1977. “An Empirical Approach to Circularity in Dictionary Definitions”. In Cahiers de Lexicologie, XXXI(2):118–128. Calzolari, N. 1982. “Towards the Organization of Lexical Definitions on a Database Structure”. In Ján Horecký (ed.), COLING ’82. (North-Holland Linguistic Series, 47). North-Holland, Amsterdam, pp. 61–64. Calzolari, N. 1991. “Lexical Databases and Textual Corpora: Perspectives of Integration for a Lexical Knowledge Base”.In Zernik, U. (ed.), Lexical Acquisition: Exploiting on-line Resources to Build a Lexicon. Lawrence Erlbaum Associates, Hillsdale, NJ, pp. 191–208. Calzolari, N. 1998. “An Overview of Written Language Resources in Europe: A few Reflection, Facts, and a Vision”. In Rubio, A., Gallardo, N., Castro, R., Tejada, A. (eds.), Proceedings of the First International Conference on Language Resources and Evaluation (LREC). Granada, Vol. I, pp. 217–224. Calzolari, N. 2002. “Computational Lexicons: Towards a New paradigm of an Open Lexical Infrastructure?”. In Willée, G., Schröder, B., Schmitz, H.C. (eds.), Computerlinguistik. Was geht, was kommt?. Computational Linguistics. Achievements and Perspectives. Gardez! Verlag, Sankt Augustin, pp. 41–47. Calzolari, N. 2003. “Corpus-based Lexicon Building: An Overview Across Projects, Problems, Approaches”. In Zampolli, A., Calzolari, N., Cignoni, L. (eds.), Computational Linguistics in Pisa – Linguistica Computazionale a Pisa. Special Issue of Linguistica Computazionale. IEPI, Pisa, pp. 79–116. Calzolari, N. 2004. “Computational Lexicons and Corpora: Complementary Components in Human Language Technology”. In van Sterkenburg, P. (ed.), Linguistics Today – Facing a Greater Challenge. John Benjamins, Amsterdam, pp. 89–107. Calzolari, N. 2005. “Language Resources: priorities and challenges”. In Symposium on Natural Processing and Image Recognition. National Institute of Information and Communication (NICT), Kyoto University, pp. 9–12. Calzolari, N. 2006. “Technical and Strategic Issues on language resources for a Research Infrastructure”. In Furui, S. (ed.), Proceedings of the International Symposium on Largescale Knowledge Resources (LKR2006). Tokyo Institute of Technology, pp. 53–58. Calzolari, N., Bertagna, F., Lenci, A., Monachini, M. (eds.) 2003. Standards and Best Practice for Multilingual Computational Lexicons. MILE (the Multilingual ISLE Lexical Entry). ISLE CLWG Deliverable D2.2&D2.3. Pisa, p. 194. http://lingue.ilc.cnr.it/EAGLES96/isle/. Calzolari, N., Lenci, A., Bertagna, F. and Zampolli, A. 2002. Broadening the Scope of the EAGLES/ISLE Lexical Standardization Initiative. In Calzolari, N., Choi, K., Lenci, A.,
Towards a New Generation of Language Resources in the Semantic Web Vision
81
Tokunaga, T. (eds.), Proceedings of the 3rd Workshop on Asian Language Resources and International Standardization. Taipei, Taiwan, pp. 9–16. Calzolari, N. and Briscoe, T. 1995. “ACQUILEX I and II. Acquisition of Lexical Knowledge from Machine-Readable Dictionaries and Text Corpora”. In Cahiers de Lexicologie, 67(2):95–114. Calzolari, N., Choukri, K., Gavrilidou, M., Maegaard, B., Baroni, P., Fersøe, H., Lenci, A., Mapelli, V., Monachini, M. and Piperidis, S. 2004. “ENABLER Thematic Network of National Projects: Technical, Strategic and Political Issues of LRs”. In LREC 2004 Proceedings. Lisbon, pp. 937–940. Calzolari, N., Federici, S., Montemagni, S. and Peters, C. 1995. “Extracting, Representing and Using Syntactic-Semantic Information from the Cobuild Student’s Dictionary”. In Sinclair, J., Hoelter, M., Peters, C. (eds.), The Languages of Definition: The Formalization of Dictionary Definitions for Natural Language processing. Studies in Machine Translation and Natural Language Processing, European Communities, Brussels – Luxembourg, pp. 59–148. Calzolari, N., Hagman, J., Marinai, E., Montemagni, S., Spanu, A. and Zampolli, A. 1993. “Encoding Lexicographic Definitions as Typed Feature Structures”. In Beckmann, F., Heyer, G. (eds.), Theorie und Praxis des Lexikons. Foundations of Communication and Cognition. Walter de Gruyter, Berlin, pp. 274–315. Calzolari, N. and Moretti, L. 1976. “A Method for a Normalization and a Possible Algorithmic Treatment of Definitions in the Italian Dictionary”. In Proceedings of the 6th International conference on Computational Linguistics (COLING’76). Ottawa, No. 32, p. 13. Calzolari, N. and Zampolli, A. 1999. “Harmonised Large-scale Syntactic/Semantic Lexicons: A European Multilingual Infrastructure”. In MT Summit Proceedings. Singapore. Calzolari, N. and Zampolli, A. 2003. “The EAGLES/ISLE Initiative for Setting Standards: The Computational Lexicon Working Group for Multilingual Lexicons”. In Cole, C., Craig, H. (eds.), Computing Arts: Digital Resources for Research in the Humanities. University of Sydney, pp. 45–73. Federici, S., Montemagni, S., Pirrelli, V. and Calzolari, N. 1998. “Analogy-based Extraction of Lexical Knowledge from Corpora: the SPARKLE Experience”. In Rubio, A., Gallardo, N., Castro, R., Tejada, A. (eds.), Proceedings of the First International Conference on Language Resources and Evaluation. Granada, Vol. I, pp. 75–82. Fellbaum, C. 1998. WordNet: An Electronic Lexical Database. MIT Press. Hanks, P. 2004. “The Syntagmatics of Metaphor”. In International Journal of Lexicography, 17(3). Hanks, P. and Pustejovsky, J. 2005. “A Pattern Dictionary for Natural Language Processing”. In Revue française de linguistique appliquée, 10(2). Krauwer, S. 1998. “ELSNET and ELRA: A Common Past and a Common Future”. In ELRA Newsletter, 3(2). Lenat, D. and Guha, R.V. 1990. Building Large Knowledge-Based Systems: Representation and Inference in the Cyc Project. Addison-Wesley. Lenci, A., Bel, N., Busa, F., Calzolari, N., Gola, E., Monachini, M., Ogonowsky, A., Peters, I., Peters, W., Ruimy, N., Villegas, M. and Zampolli, A. 2000. “SIMPLE: A General Framework for the Development of Multilingual Lexicons”. International Journal of Lexicography, 13(4):249–263. Lenci, A., Montemagni, S., Pirrelli, V. and Soria, C. 1999. “FAME: A Functional Annotation Meta-scheme for Multi-modal and Multi-lingual Parsing Evaluation”. In Proceedings of the ACL-IALL Workshop “Computer-mediated Language Assessment and Evaluation in Natural Language Processing”. Maryland, pp. 45–52.
82
Calzolari
Mapelli, V., Choukri, K. 2003. “Report on a (Minimal) Set of LRs to Be Made Available for as Many Languages as Possible, and Map of the Actual Gaps”. ENABLER Deliverable D5.1, Paris, p. 22. Nakamura, J. and Nagao, M. 1988. “Extraction of Semantic Information from an Ordinary English Dictionary and its Evaluation”. In COLING 1988. pp. 459–464. Ostler, N., Zampolli, A. (eds.) 1994. Literary and Linguistic Computing. Special issue. OUP. Ruimy, N., Corazzari, O., Gola, E., Spanu, A., Calzolari, N. and Zampolli, A. 1998. “The European LE-PAROLE Project: the Italian Syntactic Lexicon”. In Proceedings of LREC1998, pp. 241–248. Ruimy N., Monachini, M., Gola, E., Calzolari, N., Ulivieri, M., Del Fiorentino, M.C., Ulivieri, M. and Rossi, S. 2003. “A Computational Semantic Lexicon of Italian: SIMPLE”. In Zampolli, A., Calzolari, N., Cignoni, L. (eds.), Computational Linguistics in Pisa. Linguistica Computazionale. Vol. XVI–XVII. IEPI, Pisa, pp. 821–864. Varile, G.B. and Zampolli, A. (eds.) 1992. “Synopsis of American, European and Japanese Projects”. In Linguistica Computazionale, VIII, Giardini Editore, Pisa. Varile, G.B. and Zampolli, A. (eds.) 1997. Survey of the State of the Art in Human Language Technology. Sponsored by the Commission of the European Union and the National Science Foundation of the USA, Giardini Editori, Pisa and Cambridge University Press. Vossen, P. 1998. “Introduction to EuroWordNet”. Computers and the Humanities, 32:73–89. Walker, D., Zampolli, A. and Calzolari, N. (eds.) 1995. Automating the Lexicon: Research and Practice in a Multilingual Environment. Clarendon Press, OUP, Oxford, p. 413. Wilks, Y. 1975a. “An Intelligent Analyser and Understander of English”. In Communications of the ACM, 18(5):264–274. Wilks, Y. 1975b. “A Preferential, Pattern-Seeking, Semantics for Natural Language Inference”. In Artificial Intelligence, 6:53–74. Wilks, Y. 1979. “Making Preferences More Active”. In Artificial Intelligence, 11:197–223. Wilks, Y. 1995. “ECRAN: Extraction of Content: our Research at Near-market”. EU LRE Project. Wilks, Y. 1997. “Senses and texts”. In Computers and the Humanities, 31(2):77–90. Wilks, Y. 2005. “The Semantic Web as the apotheosis of annotation, but what are its semantics?”. AAAI Proceedings. Wilks, Y., Fass, D., Guo, C.-M., MacDonald, J.E., Plate, T. and Slator, B. 1989. “A Tractable Machine Dictionary as a Resource for Computational Semantics”. In Boguraev, B., Briscoe, T. (eds.), Computational lexicography for natural language planning. London: Longman, and as CRL Memoranda in Computer and Cognitive Science, MCCS-87-105. Wilks, Y. and Nirenburg, S. 1993. “Towards Automated Knowledge Acquisition”. In Proceedings of the Conference on Very large Knowledge Bases. Electronic Dictionary Research Institute, Tokyo. Zampolli, A. 1997. “The PAROLE Project in the General Context of the European Actions for Language Resources”. In Marcinkeviciene, R., Volz, N. (eds.), TELRI, Second European Seminar: Language Applications for a Multilingual Europe. Kaunas, Lithuania, pp. 185–210. Zampolli, A. 1998. “Introduction of the General Chairman”. In Rubio, A., Gallardo, N., Castro, R., Tejada, A. (eds.), Proceedings of the First International Conference on Language Resources and Evaluation (LREC). Granada, Vol. I, pp. xv–xxv. Zampolli, A. 2003. “Standards for Language Data Processing: An Historical Overview”. In Fiormonte, D. (ed.), Informatica Umanistica dalla Ricerca all’Insegnamento. Atti del Convegno Computer, Literature and Philology. Bulzoni, pp. 65–84.
Towards a New Generation of Language Resources in the Semantic Web Vision
83
Zampolli, A., Calzolari, N. and Cignoni, L. (eds.) 2003. Computational Linguistics in Pisa – Linguistica Computazionale a Pisa. Linguistica Computazionale, Special Issue, Vol. XVI–XVII, Vol. XVIII–XIX. IEPI, Pisa-Roma. Zampolli, A., Calzolari, N. and Palmer, M. (eds.) 1994. Current Issues in Computational Linguistics: in Honour of Don Walker. Linguistica Computazionale, Vol. IX–X, Giardini Editori, Pisa and Kluwer Academic Publisher, Norwell, MA, p. 595. Zampolli, A. et al. 2000. ENABLER Technical Annex, Pisa.
This page intentionally blank
4 Information Access and Natural Language Processing: A Stimulating Dialogue Robert Gaizauskas, Horacio Saggion and Emma Barker Department of Computer Science – University of Sheffield – Sheffield – UK Abstract:
In this paper we examine the interplay between the requirements of information seekers to access information in large digital text collections and the techniques developed by natural language processing researchers to support this access. In particular we examine how language processing technologies such as question answering, single and multidocument summarisation, and ontology-guided similar event searching can assist journalists in gathering information from news archives for the purpose of writing background to a breaking news event – the Cub Reporter scenario. Our thesis is that investigating realworld tasks with complex information access requirements, such as the Cub Reporter scenario, stimulates researchers to look beyond existing search engine solutions and drives the development and evaluation of novel language processing techniques; at the same time novel developments in language processing capabilities allow both conceptual insights into how to characterise information seeking behaviour and empirical insights based on observation of information seeking behaviour using new technologies
4.1 Introduction Ever since digital text and then digital document collections began to emerge in the 1950s and 60s, questions of how users of such collections could access them in order to retrieve or to discover information within them has been of central interest to information scientists. Initially much work focussed on investigating whether indexing schemes manually devised by library scientists, such as those underlying traditional paper-based document collections, would prove best at supporting access to the new electronic document collections, or whether the new power of computers to index every word in every document would prove superior. These investigations, which led, via the Cranfield experiments, to the development of the evaluation framework for document retrieval still practised today, vindicated the use of automatic computer-based indexing techniques. This was an early triumph for the use of computers in processing natural languages (see Sparck Jones (1981) or Sparck Jones and Willett (1997) for discussions of the early development of information retrieval). The 1950s and 1960s also saw the beginnings of other approaches to using computers to assist human users in accessing information in electronic text. 85 K. Ahmad, C. Brewster and M. Stevenson (eds.), Words and Intelligence II, 85–105. © 2007 Springer.
86
Gaizauskas et al.
Automatic summarisation is usually traced back to H.P. Luhn’s pioneering paper on the automatic creation of literature abstracts in 1958, in which he proposed a method for automatically generating an abstract of a scientific paper by extracting sentences from the full text of the paper (Luhn, 1958). Question answering, not only against structured data sources but also against textual data, also dates from this period. Simmons (1965), for example, describes a range of exploratory investigations on question answering from the early 1960s including three systems designed to use text collections as the information source to answer natural language questions. Finally, the roots of information extraction can be traced to this time as well, for example in the work of Sager at NYU in the late 1960’s on filling “information formats” from radiology reports and discharge summaries Sager (1981) and of Wilks on extracting template structures from texts (Wilks, 1964). Each of these four information access technologies – information retrieval (IR), summarisation, question answering (QA) and information extraction (IE)– has since grown into a research area in its own right. Workshops and conferences are organised around them, researchers identify their research programmes with one or more of them, and shared task definitions, evaluation methods, and annotated data resources have emerged around each of these four areas. Information access, then, as a source of challenging applications, has been and continues to be a long term stimulus to those seeking to develop human language technologies. However, information access raises a much more pointed challenge to a particular research community within the broader grouping of those researching language processing by computer – the natural language processing (NLP) community. It is that challenge which we wish to highlight in this paper. Of course in a broad sense the term “natural language processing” comprehends any sort of computer processing of human language. But this is not how the term has evolved historically, nor how most who either adopt or avoid the label in respect of their own work would chose to use it. From some time in the 1960s NLP became associated with computational linguistics (CL) and artificial intelligence (AI) and has been heavily influenced by whatever approaches were in the ascendant in these areas. For a 30 year period NLP was influenced via CL by formal linguistics and logic (for example, consider the effect of Chomsky on syntax in CL, Montague on semantics, to name just two of the most powerful influences). During the same period NLP was heavily influenced from AI by representationalist/logicist approaches (e.g. the use of frames, semantic nets, scripts for NL understanding/meaning representation; planning approaches to generation). During this time a separate IR community grew up, pursuing document retrieval using techniques developed largely without reference to linguistics or AI. IR, viewed as a research community advocating a set of techniques, began to see itself, and to be seen as, the antithesis of NLP. The legacy of this cultural split is still with us – two separate, though not disjoint, research communities – although it is fading due to the waning of influence of formal syntactic and semantic approaches and the resurgence in interest in data driven/statistical approaches to NLP. However, what remains is a belief that NLP techniques must, by definition, be something more than “mere” IR, by which is generally meant “bag of words” techniques. Such
Information Access and Natural Language Processing: A Stimulating Dialogue
87
techniques typically view a document or query as a multiset of terms (usually the terms are just the words) and then derive a weighted term-based representation from it using a weighting scheme incorporating distributional information from individual terms in the document/query itself and from the document collection as a whole. Such representations are used together with a similarity measure to measure document-document or document-query similarity. If NLP is not “mere” IR then what is it? (we are focussing here on differences in techniques not on tasks addressed – obviously NLP has historically interested itself in far more than document retrieval). It is unlikely that anything approaching a consensus could be obtained from those working in the field. Answers might range from the facetious “anything to do with the machine processing of human language that we can’t do yet” (by analogy with those definitions of AI that try to distinguish it from computer science in a similarly negative way) to definitions that commit the field to some attempt to represent the “meaning” of texts or to get the machine to “understand” texts. Others would distinguish NLP from IR or text processing by a commitment to at least paying attention to, if not going as far as modelling, insights or observations from linguistics and/or cognitive psychology about the nature of the human language system and its functioning in human minds. In their introductory text on speech and language processing, subtitled “An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition”, Jurafsky and Martin (2000) suggest that what distinguishes language processing applications from other data processing systems is not just what they process (spoken and written human language), but their use of knowledge of language. Thus, a simple program like Unix wc, is a data processing application insofar as it counts bytes and lines, but a language processing application insofar as it counts words, as the notion of word is a linguistic one, and requires some knowledge of the language system in order to characterise it properly (indeed this is knowledge non-trivial even for English – how many words is can’t? – and especially so for languages like Chinese where word delimiters are not part of the script). This is a useful distinction and how much knowledge of language is brought to bear in a given system gives us a better purchase on the difference between NLP and IR. One can characterise those pursuing NLP as those more willing to use techniques that involve some knowledge of language at one or more levels from that of words (e.g. a linguistically motivated model of morphology) to that of of discourse (e.g. theory of discourse relations). They have a native belief that as the language processing system is complex, then attempts to model its complexity will bear fruit, not just in terms of deeper understanding, but in terms of performance in applications. IR researchers, by contrast, are tempermentally inclined to be sceptical of the value of bringing more knowledge of the language system to bear, until its value in terms of their chosen application – document retrieval – can be clearly demonstrated. Thus, the question Can NLP techniques lead to better information access technologies than IR techniques? is a contentful, if not entirely precise, question, one asking about the value of using richer rather then poorer models of language processing in building information access systems. Furthermore, in our view, attempts to answer it form a stimulating dialogue which benefits both information
88
Gaizauskas et al.
access and NLP. From the side of information access NLP is a well-spring of new ideas and approaches to language processing motivated by trying to stretch the capabilities of machine processing of human language. Many of these ideas may fail (e.g. as has been claimed for the application of word sense disambiguation in IR or the attempt to do text understanding to support IE), but this may not be forever, but only until implementations improve. Others may indeed succeed (e.g. morphological processing to support search engines for highly inflected languages, semantic tagging in QA). Similarly, NLP benefits from concrete challenges, such as those thrown up by the problems of information access, that force scientists/engineers to propose solutions and act as a reality check – if new techniques do not work or work no better than simpler techniques (in particular no better than non- or minimal NLP techniques) then they need to be abandoned/rethought. What constitutes “better”, however, is not always easy to say. Standard evaluation techniques have been developed for IR and subsequently for summarisation, QA, and IE. Evaluation methods for the latter three have been heavily influenced by the successful IR evaluation paradigm. In all cases the standard evaluations are intrinsic evaluations, in the sense of Sparck Jones and Galliers (1996). That is, they assess how well a system meets its design objectives, but do not directly assess how well the system functions in an application setting to allow its intended user to carry out some task. Typically they involve an artificial task abstracted away from any specific, real task context. Thus, ranking documents by relevance (IR) or selecting sentences from a text that most approximate the content of those that appear in a model summary (summarisation) or proposing text snippets that answer factual questions (QA) or filling template structures after reading texts (IE) are not tasks that humans typically perform. Rather it is assumed that if we had systems that could do these things automatically, then they would be of value to humans; and that the better a system does these things the more utility it would have. Such evaluations are of critical importance in technology development. However, they are problematic in a number of ways. First, they may tell us little about how useful a system is likely to be in a real task setting. This is particularly true if, as is the case, we do not know the lower bound on acceptable performance, i.e. what is the minimal performance level an IE system or a QA system must reach in order for it to be of any utility whatsoever? Secondly, such evaluations focus effort away from other aspects of system building, particularly interface design and the management of extended information seeking interactions, aspects that may be more important in terms of usability than small differences in precision and recall scores. Finally, evaluation results are not comparable across technologies – e.g. how can factoid QA be compared with IR for information access if one is assessed in terms of accuracy or mean reciprocal rank and the other in terms of mean precision at rank 10? Given a document collection and a novel information request (e.g. “prepare a brief on company X for the CEO”), could someone armed with a search engine produce a more or less comprehensive report in more or less time than someone equipped with QA and/or summarisation and/or IE technology? It is hard to see how technology-specific intrinsic evaluations could answer this question. Thus, taken together, these considerations suggest that what is needed in addition to intrinsic
Information Access and Natural Language Processing: A Stimulating Dialogue
89
evaluations is some form of extrinsic evaluation, difficult as this may be, that will enable the utility of various information access technologies to be compared and judged in a real, or at least more realistic, task setting than those used for intrinsic evaluations. These reflections lead us to the following point. To explore concretely the question of whether NLP techniques can lead to better information access technologies than “bag of words” techniques a number of things are required: (1) a real information seeking task scenario which throws up more serious challenges than those that can obviously be met by, e.g. a single simple Google search (such as “Find out what courses in Computer Science are offered at MIT”) (2) a large but stable text archive for controlled experiments (3) a good IR system that supports ranked document retrieval over the archive and can act as a baseline information access system (4) implementations of one or more NLP-based information access technologies (5) a uniform interface that supports information seeking using one or more of the IR and NLP information access technologies (6) a methodology for carrying out extrinsic evaluation of users performing the information seeking task using different information access system configurations. To realise this experimental setting as fully as possible we have embarked upon the Cub Reporter project. In the remainder of this paper we describe the project, including the information seeking scenario around which it is based, the experimental information seeking platform we have built, which incorporates a variety of information access technologies, and the methodology for extrinsic evaluation which we have developed. While it is too early to answer the question we have set, the process of answering it has been extremely illuminating.
4.2 The Cub Reporter Project Cub Reporter is a research project that aims to investigate how language technologies might help journalists to access information in a news archive in the context of a background writing task. The function of background material is to support and contextualise a breaking news story. The specific characteristics of the background-writing scenario suggest that recent work in areas of natural language processing such as question answering and text summarisation should be relevant to this task. In the project we set out to answer a number of research questions: (1) what are the essential components of a background story and how does background information relate to the “foreground” breaking news story? (2) how can background information for a breaking news story be accurately found in the archive given the initial breaking news story? (3) how can human language technology assist a journalist to access the vast amount of information in a news archive? in particular can recent advances in NLP technologies, in areas such as question answering, summarisation, and information extraction, offer advantages in gathering background that standard information retrieval cannot? (4) how is background writing quality affected by the use of human language technology?
90
Gaizauskas et al.
To address these questions we have designed and implemented a prototype that incorporates a standard information retrieval engine as a baseline, as well as a question answering system and document summarisation technology. Information extraction technology is also used to extract structured representations of events which are in turn used to populate a database to support similar event search. These information access technologies are embedded in a browser-based graphical user interface which allows users to combine them flexibly in an iterative information seeking process. The main contributions of the work to date are: (1) a descriptive theory characterising the nature of background in the news and its relation to the foreground news story; (2) a design and prototype implementation of an information access platform that integrates information retrieval, summarisation, question answering and information extraction capabilities within a single system operating over a text archive of significant size; (3) a methodology for comparative evaluation of different combinations of language technologies for the task of background writing, allowing an assessment of the relative utility of more sophisticated natural language processing tools versus traditional information retrieval tools for the task of background writing.
4.3 The Cub Reporter Scenario 4.3.1 Information Seeking for Writing Background Journalists seek and use background information in a number of different task contexts. Some of these leave little obvious trace, such as when a journalist is preparing for an interview or inserting small amounts of background into a current story (e.g. John Doe, president of FooBar Inc. and former world tiddlywinks champion, ). Others leave clearer traces, such as when, in support of a big story, a journalist is instructed to prepare a background fact-sheet to assist other journalists putting the current story in context (for example, a list of previous train crashes), or even to write a dedicated, so-called “backgrounder”, which is an extended prose piece whose function is to contextualise the current news event. We have focussed on the latter – the task of writing a backgrounder – in order to acquire a better understanding of what background information is and also to gain some insights into the information seeking process. Interviews with journalists, observation during a controlled writing task and text analysis of a sizeable set of archived background stories have shown that journalists have two high level goals when seeking background for a news story: one is to provide simple descriptive information for entities or events that figure in the current story. For example, in background to a hurricane we see the proposition: A storm can only be classified as a hurricane if its wind speed is faster than 73 mph. A second is to find information about past events that can be used to frame the current event in a narrative that is both compelling and significant to the intended audience. Journalists describe this second goal as “angle seeking” and while they may begin the task not knowing what the ‘angle’ may be, they have an expert understanding of the kind of information that needs to be examined in order to develop and support an angle.
Information Access and Natural Language Processing: A Stimulating Dialogue
91
We can identify a number of types of information which are commonly used to provide and support angles in a news story, including: (1) chronological sequences of events, which lead up to the current event; (2) possible explanations for a particular outcome; (3) accounts of similar events (e.g. other train crashes, scandals of similar nature, etc.), especially extreme or distinguished examples; (4) interesting associations between groups of similar events or entities; (5) information that places a current event or entity on a scale of similar events or entities and (6) comment (quotes) on any of the preceding by notable individuals. In current practice journalists use standard search engine technology, incorporating IR techniques, to assemble background information from digital text collections. These systems return ranked sets of documents which the user must read in order to find information located within them. Typical functionalities for information access in such systems include text, keyword and topic search. Output to a specific query is presented as a ranked list of documents and associated ‘lead’ paragraphs. Access to the full document is done by following a link. While this technology might be adequate for information requests that are readily definable in key words and satisfied by single document searching, (e.g. the date Coetzee wins the Booker Prize for the second time), it is less good for the more complex information seeking activities we see in background research. Our study showed that the background task presented particular problems for current technology, for example: (1) answers may be distributed across multiple documents (e.g. Write a profile of Google Inc.); (2) answers may be buried within a long document; (3) in a retrieved document set the same information may occur redundantly in multiple documents, which the user may examine only to discard; (4) the user may need to learn about the topic or correct his misconceptions before precise requests can be asked (e.g. Write a brief about regional government in Sweden); (5) the user may need to discover information which is not explicit in the archive, i.e. by making novel associations between units of information in the texts (e.g. Are there any patterns in British Gas’s new ventures in the past year?); and (6) the user may have only a general idea of what he is looking for and therefore key word queries will not be effective, (e.g. When was the last time there was a disaster like this one?).
4.3.2 The News Archive In practice journalists use whatever text-based information sources seem relevant and to which they have access. Thus, they may use the Web (using a general Google search or focussing on particular news-bearing sites such as the BBC web-site), third party proprietary information tools and collections, as provided by companies such as Lexus-Nexus, or their organisation’s in-house archives. However, for purposes of development and evaluation, it is useful to have a controlled text collection. Such a collection should be large enough to offer significant challenges for information access technologies and so that no user is likely to know all that it contains or does not contain. However, it should stable, so that repeatable experiments can be
92
Gaizauskas et al.
run, and not so large that research prototypes which have not been optimised for performance can process it in acceptable times. Through our collaboration with the UK Press Association (PA), the principal UK domestic newswire service, we have obtained access to 11 years of newswire copy from 1994 to 2004. The archive contains more than 8.5 million stories totalling 20GB of data. The raw corpus has been processed and encoded in XML following a strict Document Type Definition (DTD) specification which captures all meta-data delivered by the PA and which includes elements such as story date, category, topic, and structural information such as headlines, bylines, and paragraphs. The archive is organised per dates following the logical organisation of the PA wire where years are composed of months, months are composed of days, and there are a number of stories per day. Stories in the PA archive are classified into a number of topics or news categories from a controlled vocabulary representing the subject matter of the story (e.g., Courts, Politics). Within the same topic, stories are further identified by a number of free-text keywords that the journalists would assign which are called catch-lines. When a “news event” occurs, a reporter writes a snap, a line of text summarising the news and “moves” it to the wire. From that point on, stories follow an installment pattern where each installment carries an updated account of the story. Installments have names such as snapfull, one or two paragraph long text expanding the snap, lead, copy that summarises the major aspects of the story, and so on. These installment types reflect their position and significance in the publishing cycle of major newspapers. All of this material finds its way into the archive.
4.4 NLP Technologies for Information Access Interestingly, the background information gathering requirements outlined in section 3.1 are similar to those addressed in recent NLP challenge tasks. For example, finding descriptive information for people or organisations, the so-called “definition question task” is dealt with in recent TREC Question Answering evaluations Voorhees (2004) and Document Understanding Conferences Over and Yen (2004), and can be supported by solutions proposed in these contexts. Finding events similar to one reported in breaking news could be implemented with information extraction technology: text in the archive could be mapped off-line into structured representations which could be stored in a database for on-line searching Milward and Thomas (2000), perhaps using an ontology to guide the generalisation process. Question answering technology can be used to support fact gathering as well as fact checking in a background writing context. For example, consider the breaking news story about the “kidnapping of UK-born Margaret Hassan” in Iraq. Of considerable importance for the UK public are answers to the following (among other) questions: How many British citizens are living in Iraq? and Where was Margaret Hassan born?. Techniques used in factoid question answering should be relevant here. To explore the potential of NLP-based techniques to support journalist in the background writing task we have implemented a research platform which
Information Access and Natural Language Processing: A Stimulating Dialogue
93
incorporates a range of these techniques together with a conventional ranked retrieval search engine. The aim is to produce a system which integrates recent advances in NLP-based techniques, including some made in this project, as well as a conventional search engine into a browser-based graphical user interface comparable to those used by conventional searching tools. The resulting platform, which can be configured to include different combinations of technologies, then be used both for controlled experimentation and evaluation (see Section 4.5 below) and for gaining informal feedback from journalists as to the utility of the various component language technologies for information access. Cub Reporter comprises an off-line corpus processing subsystem and an on-line information access subsystem (see Figure 4.1). The off-line subsystem produces a text index for document retrieval, generic summaries at fixed length for each story, generic multi-document summaries for sets of known related stories, and logical forms for database population. The database is an entity-event-relation relational repository which stores the information resulting from a process of semantic interpretation of each story. The database contains tables to record references to entities (such as people and organisations), events, locations, and temporal information. Relations are a set of fixed grammatical relations including logical subject and object, apposition, qualification, etc. A table of attributes stores the different values that qualify entities and events such as adjectives, adverbials, and quantifiers. The
PRESS ASSOCIATION RAW DATA
STRUCTURE ANALYSIS
PRESS ASSOCIATION ARCHIVE
OFF LINE PROCESS
ONTOLOGY WORDNET
EVENT EXTRACTOR DATA BASE
GATE TOOLS
INDEXER SUMMARISER
TEXT INDEX PROFILER
TEXT SEARCH
QUESTION ANSWERING
INPUT TEXT
ON−LINE SUMMARIZATION
USER INTERFACE
USER NEED ANALISYS
Fig. 4.1. System Components
SIMILAR EVENT SEARCH
PROFILE SEARCH
RESULTS PAGE
ON−LINE COMPONENTS
94
Gaizauskas et al.
on-line system provides question answering, keyword search, similar-event, and further ad hoc summarisation capabilities. These two subsystems are discussed in more detail in the following sections. 4.4.1 Off-line Processing The whole archive is processed with tools adapted from the GATE Java library Cunningham et al. (2002). We perform tokenisation, sentence boundary identification, part-of-speech tagging, morphological analysis, and named entity recognition, keeping the results of the analysis for use by various language processing components. Two text indices are produced for the processed documents using Lucene1 , a Java-based open source tool for indexing and searching. One index is of the full text of each story; the other is for each paragraph. The metadata fields for each story are included in both of these indices. Search can be performed against the full text of the stories or restricted to any of the metadata fields alone or in combination, using boolean operators. Further linguistic processing of the archive is carried out with SUPPLE (Gaizauskas et al., 2005), a freely-available parser, integrated in GATE, and with an in-house discourse interpreter. SUPPLE uses a feature-based context-free grammar in order to produce syntactic representations and simplified quasi-logical forms. The grammar in use consists of a sequence of subgrammars for: Noun Phrases (NP), Verb Phrases (VP), Prepositional Phrases (PP), relative clauses (R) and sentences (S). The semantic rules produce unary predicates for entities and events and binary predicates for attributes and relations. Predicate names are: (i) the citation forms obtained during lemmatisation; (ii) forms used to code syntactic information (e.g. lsubj for the logical subject of a given verb); (iii) specific predicates are used to encode, for example, named entity information (e.g. name for the name of a person). The document semantics is further analysed by a discourse interpreter which maps entities into a discourse model and performs coreference resolution based on an ontology we are adapting for the purpose of this project. In the course of processing the system also attempts to determine the appropriate WordNet sense of each noun and verb in the archive. To do this we rely on: (i) the availability of centroids of topic signatures for each word sense (Agirre and Lopez de Lacalle, 2003), and (ii) a similarity metric such as the Cosine metric or the Jaccard co-efficient (Salton, 1988) which, given a word form and its word context, computes the semantic proximity of the word to each topic signature and decides the closest word sense for the word instance. The results of the semantic discourse analysis are transformed into records that are used to populate the database in order to support similar event/entity search (see Section 4.4.2.4). Summaries at fixed compression rate and ranked sentences (for on-line summary access) are computed for each story in the archive using an in-house single document summariser Saggion (2002). Sentences are ranked based on a sentence-summary worthiness score obtained by combining scores for various features including
1
http://jakarta.apache.org/lucene
Information Access and Natural Language Processing: A Stimulating Dialogue
95
sentence position, similarity of the sentence to the document headline, term distribution, named entity distribution, etc. Individual scores are combined using weights experimentally obtained from training corpus. At present we carry out single document summarisation using three features which are computed for each sentence: (1) sentence position: sentences close to the beginning of the story are considered more relevant than those at the end of the story; (2) similarity of each sentence to the story headline: sentences containing words from the headline are considered more important than sentences which do not contain those words; and (3) similarity of each sentence to the keyword topics: sentences containing human assigned story keywords are considered more important than those sentences not containing those keywords. We currently store in the database the three top ranked sentences as the document summary. These summaries are shown to the user in the user interface. Off-line multi-document summarisation is carried out on a set of story-related documents. The tool extends the single document summariser by implementing a centroid-based summarisation system (Saggion and Gaizauskas, 2004a) which computes the similarity of each sentence to a cluster centroid and combines this value with single document summarisation features. An n-gram similarity metric has been implemented to filter out redundant information, using a similarity threshold adjusted over training data. The weights used to combine the different features are trained over corpora. 4.4.2 On-line Processing Access to the archive is through a user interface which is designed with the input text as its focus. The user enters a text which can be a sequence of keywords, a well formed natural language question, or a short snap-like text such as the initial report of a breaking news story. The system first carries out full text analysis of the fragment, and depending on the result of the analysis, additional options are made available including: • • • •
access to full documents and summaries; answers and contexts to specific questions; profiles of persons, organisations, and locations; events similar to those described in the input.
4.4.2.1 Access to Full Documents and Summaries In a pure document search situation – when the input text is a list of keywords – the journalist is presented with a results page containing access to full documents and to the previously computed story summaries, and installment multi-document summaries. The documents are ranked either by date or relevance – for the latter Lucene’s default scoring mechanism is used, the standard vector space model with cosine similarity measure between tf * idf weighted term vectors. During an interaction session with the system a user may save retrieved documents to a “Saved Item” store, analogous to the shopping basket functionality present in many internet shopping applications. At any point the user may request a multi-document summary
96
Gaizauskas et al.
of these saved documents, which is then dynamically generated. Developed, but as yet unintegrated into the user interface, is further functionality to dynamically compute query-focused summaries, i.e. summaries that are tailored to the user’s input text in such a way that extracted sentences relate to the user’s assumed information need. Such summaries can be very effective when trying to identify the relevance of a document with respect to a query (Tombros et al., 1998). To produce these, the generic summariser is altered so that the scoring function used to rank sentences for inclusion in the summary combines generic sentence-summary worthiness features with a query-based feature which reflects similarity between the user query and the sentence. 4.4.2.2 Question Answering Question Answering (QA) functionalities are used to provide the journalist with short, text units that answer specific, well-formed natural language questions. Each answer is presented to the user along with the passages where the answer was found and with pointers to the full documents from which the passage was extracted. We use a two-layered QA architecture which consists of an information retrieval step coupled to a natural language analysis and answer extraction module. The essence of the approach is to pass the question unmodified to the information retrieval system which uses it as a query to do passage retrieval against the text collection. In this case we make use of the paragraph index created off-line. The top ranked paragraphs output from the IR system are then passed to a modified information extraction system. This system first carries out partial, robust syntactic and semantic analysis of these paragraphs and of the question (in which a specific “sought entity” is determined), transducing them both into a quasi-logical form (QLF) representation. Given these sentence level “semantic” representations of candidate answerbearing passages and of the question, a discourse interpretation step then creates a discourse model of each retrieved passage by running a coreference algorithm against the semantic representation of successive sentences in the passage, in order to unify them with the discourse model built for the passage so far. This results in multiple references to the same entity across the passage being merged into a single unified instance. Next, coreference is computed again between the QLF of the question and the discourse model of the passage, in order to unify common references. In these passage+question models, possible answer entities are identified and scored as follows. First each sentence in each passage is given a score based on counting matches of entity types (unary predicates) between the sentence QLF and the question QLF (similar to counting noun and verb overlap in word-overlap approaches). Next each entity from a passage not so matched with an entity in the question (and hence remaining a possible answer) gets a preliminary score according to (1) its semantic proximity (in Wordnet) to the type of the entity sought by the question and (2) whether or not it stands in a relation R to some other entity in the sentence in which it occurs which is itself matched with an entity in the question which stands in relation R to the sought entity (e.g. an entity in a candidate
Information Access and Natural Language Processing: A Stimulating Dialogue
97
answer passage which is the subject of a verb that matches a verb in the question whose subject is the sought entity will have its score boosted). An overall score is computed for each entity as a function of its preliminary score and the score of the sentence in which it occurs. Finally, the ranked entity list is post-processed to merge and boost the scores of multiple occurrences of the same answer found in multiple passages and the top scoring answer is then proposed as the answer to the question. Overall the intention is that the matching of candidate answer entities to the sought entity of the question be guided primarily by semantic type similarity (so “who” questions should have persons proposed as answers), then by lexeme overlap between question and answer-bearing sentence, and finally by sharing of grammatical relations where they can be identified. Redundancy of the answer across the candidate answer bearing passages is also taken into account.
4.4.2.3 Profiles The PA archive contains pre-compiled person profiles which we have automatically identified and stored in database tables for quick access. Usually these profiles will have a catch-line matching the pattern keyword person Profile where keyword is an uppercase keyword such as “POLITICS” or “SHOWBIZ,” and person is usually the person’s surname (e.g. “Blair” for “Tony Blair”). Given one such story, we carry out named entity recognition and coreference resolution and extract coreference chains for each named person in the story. If the person name given in the catch-line matches one of the names in a coreference chains, then all names in that chain are considered aliases for the person, and records are created in the database containing each alias, the catch-line keyword, and the document identifier. As an example story, HSA1816 from the PA archive has as catch-line “DEFENCE Harding Profile”. The analysis of this story gives a coreference chain with the following names: “Harding”, “Sir Peter Harding”, “Sir Peter”, “Sir Peter Robin Harding” which correspond with the alias “Harding” in the catch-line. Records are created for all names in the chain, in this way the names “Harding” or “Sir Peter” will provide access to the profile of “Sir Peter Harding”. For persons whose profiles are not in the PA archive and for other entity types, profiles are created automatically using question answering and summarisation technology. At this point the only entities to be profiled automatically are persons, though the approach adopted can easily be extended to include organisations, locations and other entity types as well. Persons to be profiled are identified by first running a named entity recognizer across the corpus, including alias recognition, and then applying a parametrisable filtering process to identify those persons whose names are mentioned more than a threshold number of times and/or across more than a threshold number of successive years. To generate the profile content we follow a method we proposed for answering definition questions in TREC QA 2004. This approach involves three steps: (i) web-based knowledge acquisition for the target entity – terms are gathered which help in the process of identification of definitional passages, (ii) document retrieval for each target, and (iii) “nugget”
98
Gaizauskas et al.
identification and filtering from the returned set of documents, where a nugget is a piece of information deemed relevant for the target. During the web-based knowledge acquisition phase (see Saggion and Gaizauskas (2004b)) relevant terms associated with each target are identified by extracting terms that co-occur with the target in definition bearing sentences found on Web pages, Wikipedia pages, and BBC News pages. The result of the process is a list of terms for each target. Each term t has an associated weight which is the number of times t and the target co-occur in the sources examined. The process produces a mined list of terms that includes all morphological variants and nominalisations of terms co-occurring with the target. Each term is recorded with its associated frequency of occurrence. The nugget identification step looks for evidence to classify a sentence as ‘definition/profile bearing’. The following sources of evidence are used: presence of the target or target alias in the sentence; presence of relevant terms found in external sources; and presence of definition patterns in the sentence. Two types of patterns are used: manually created lexical patterns (“X who is”, “X whose”, etc.) and part-of-speech and named entity patterns. Inducing POS definition patterns Because lexical patterns are far too constrained to match naturally occurring sentences, we induce definition/profile patterns from data. The induced patterns are sequences of four elements (which must include the target entity), where each element in the pattern is a part-of-speech tag, a date, the target entity, or a title. Each pattern has a weight which represents its relative “importance”. Patterns are induced and used separately depending on the type of target (person or other entity). The weighted definition patterns are induced in the following way: •
• • • • •
For each target and nugget from a human created corpus of targets and nuggets, a collection of sentences from a text collection is constructed. The target and the text nugget are used as a query submitted to an information retrieval engine and the top ten passages are retrieved (ranks one to ten). Each sentence retrieved is automatically marked with the target (TARGET), POS information (NNP, VB, etc.), dates, and titles. A coreference algorithm, provided in Gate, is run to identify references to the target and those coreferring expressions are marked as TARGET as well. Sequences of four elements are collected, the score associated with the sequence is 1/sentence_rank. So patterns found in the most ‘relevant’ sentences get score 1 and those found in the least relevant sentence get score 0.1. Scores for each pattern are summed for each instance. Patterns are translated into JAPE grammar rules for use in a Gate pattern recogniser.
Nugget identification and filtering Documents are retrieved from the archive using the target as a query. Each of the documents retrieved from the collection is analysed and the following scores computed for each sentence in the returned documents:
Information Access and Natural Language Processing: A Stimulating Dialogue
• • • •
99
main entity score: is 1 if the sentence contains the target (i.e., “Franz Kafka”), 0.5 if the sentence contains a target alias (i.e., “Kafka”), and 0 otherwise; related terms score: is the sum of the frequency of the related terms occurring in the sentence; definition pattern score: is 1 if the sentence matches a definition pattern and 0 otherwise POS pattern score: is the sum of the scores of the POS-patterns matching the sentence.
Sentences are sorted in descending order by (in order) the main entity, the related terms score, the lexical definition patterns, and the POS-patterns. So, no sentence under consideration is in principle rejected. However this sorting is expected to rank relevant sentences higher than irrelevant ones. Sentences are output in rank order until a maximum number of characters is reached. Sentences are not included in the profile if they are regarded as too similar to a previously included sentences. Two sentences are considered too similar if they have 50% or more of their tokens in common. The profile is stored in the database for use during on-line access. This approach for profile creation was evaluated during TREC/QA 2004 and was ranked fourth among 28 participants (Voorhees, 2005). 4.4.2.4 Similar Event Search Our research into background news writing has shown that users are likely to be interested in past events similar to the new event that is the focus of the breaking news story. One strategy for extracting similar events is to use the IR component with the snap as a query in the hope that stories describing similar events will be returned at high ranks. However, this may be problematic as by definition the breaking story, to be “news”, must be new and hence different in significant respects from previous events. We propose a novel approach based on searching the database of extracted semantic representations of texts. Given a snap-like input text, a structured representation of the input is produced which includes a list of event-like representations which is then used to query the database. Consider the text fragment shown in Figure 4.2 and the output produced by the SUPPLE parser. From this output records in the database are created for each noun (e2, e4, and e3) and verb (e1) and their relations: logical subject <e1,e2>, logical object, qualifications, apposition, and preposition (<e2,e3> and <e3,e4>)
Input paragraph: The head of Australia’s biggest bank resigned today after a multi-million dollar foreign currency trading scandal rocked the institution and sent its shares plummeting. SUPPLE’s output: head(e2), name(e4,’Australia’), country(e4), of(e3,e4), bank(e3), adj(e3,biggest), of(e2,e3), resign(e1), lsubj(e1,e2) Fig. 4.2. PA story paragraph and its SUPPLE’s interpretation
100
Gaizauskas et al.
Document 20040202_HSA6142 20040202_HSA6142 20040202_HSA6142
Entity e2 e3 e4
Objects Word WordSense head head#4 bank bank#1 Australia _
Document 20040202_HSA6142
Entity e1
Word resign
Events WordSense resign#2
Type object object country
Start 0 12 12
End 8 36 21
Type event
Start 37
End 51
Fig. 4.3. Entries for nouns and verbs in the database
(see Figure 4.3). The records contain: the document identifier, an entity identifier produced by the parser, the noun or verb root, the WordNet sense of that root (computed as described above in Section 4.4.1), the entity type (either object, event or named entity type), and the start and end offsets of the entity in the paragraph. Given this representation similar events or objects can be selected by means of word senses and lexical relations using WordNet. For example, for an event such “head of bank resigns”, similar resignation events can be found not only by selecting from the database those records where the form “head” has been interpreted as the WordNet word-sense number 4 of “head” but also by selecting synonyms (from “head” to “chief”), or hyponyms (from “head” to “executive”), or hypernyms (from “head” to “leader”). The output of this process is a list of documents from which each matching event was derived. We are currently investigating methods for presenting the results to the user, e.g., ranking and clustering.
4.4.3 Prototype Implementation The off-line processing results in a Lucene inverted index, summaries and structured semantic representations of each story in the archive. The summaries and semantic representations are held in relational tables in a mySQL database. The user interacts with the system through a web client which communicates with a web server (Tomcat). Both the text index and the relational database are accessed by the server as needed during on-line processing and web content is dynamically created for return to the client. A user database records details of users for security purposes and to allow search histories to recorded and revisited in subsequent sessions. Session management in the server allows multiple concurrent access to the archive.
4.5 Evaluation As noted in Section 4.1 intrinsic evaluations of information access technologies – evaluations which determine to what extent a component is meeting its design objective – play an important role in developing component technologies, even
Information Access and Natural Language Processing: A Stimulating Dialogue
101
while they may not give insight into whether a component or system built from evaluated components is successful in supporting a user in a real task setting. To that end various Cub Reporter components have been evaluated intrinsically. For example: • •
• •
the generic multi-document text summariser performed very well in DUC 2004 – it was the second best system in task 2; the profile-based multi-document text summariser performed reasonably well in task 5 of DUC 2004 coming among the top nine participants. We have recently implemented a new method for extracting biographical information from text and obtained improved performance Saggion and Gaizauskas (2005); the QA system has participated in TREC/QA and in particular the definitional component placed fourth in 2004; in spite of the fact that our parser has never been formally evaluated, it has contributed to many successful information extraction projects in the past. We are currently assessing two approaches to evaluation: one is the evaluation of the logical forms produced by the parser using a resource such as Suzanne Sampson (1995), the other is to develop test suites for testing a range of grammatical phenomena and to support regression testing during grammar development.
Extrinsic evaluation, which attempts to assess whether an information access system is effective in supporting a user in a real task setting, is much more difficult to carry out. First, it requires identifying a task dependent on the outputs of the information access system for which a quantitative measure of success, or at least a judgement process that allows for rank orderings of task outcomes, can be agreed. Secondly, such evaluations require working with users in settings as close to possible to real settings – with all the attendant experimental design issues of abstracting away from individual differences and the logistical issues of organising human trials. Our goal in building the Cub Reporter system was to gain insight into whether and how NLP techniques could help create information access systems that would better support journalists at the task of background information gathering for breaking news stories than conventional IR techniques. To design a controlled experiment to evaluate this is challenging. Since a key output from newswire journalists doing background information gathering is “the backgrounder” (Section 4.3.1), one way to proceed is to see if one can determine what constitutes a better backgrounder. In order to address this issue we are following two complementary directions. First, we are investigating whether independent expert journalist assessors can consistently rank and categorise backgrounders according to the assessor’s informed, but unarticulated, professional judgement of what constitutes a good backgrounder. I.e. we are trying to determine whether there is intersubjective agreement about which backgrounders are better than others – if such agreement does not exist then trying to use backgrounders in an extrinsic evaluation would be futile. Secondly, we are working to develop a descriptive theory of background based in terms of observed semantic relations holding between content units in
102
Gaizauskas et al.
backgrounders in relation to the breaking news event. Our hypothesis is that we will be able to define quality criteria for backgrounders in terms of this theory that will enable quality judgements to be made about backgrounders that correlate with (or enable to predict) the quality judgements made by human assessors. Such a theory would serve two functions. First, it would give deeper insight into content selection and ordering in backgrounders, insights which could be exploited by a suitably powerful information access system to better support the journalist user. Secondly, judgements predicted by the theory could be used to validate or substitute for judgements made by high cost human assessors in evaluation exercises. A pilot study has been conducted to investigate the first issue. The data for this study consisted of a collection of journalism student background writing assignments that were both ranked and categorised in terms of a five point scale for quality by three independent journalist evaluators. Preliminary results reported elsewhere (Barker and Gaizauskas, 2005) indicate reasonably high agreement among evaluators. Given the positive results of this experment, we plan to construct a broader and more controlled corpus which will include different types of background written by professional journalists. In order to develop a theory of background, a set of relations is needed which indicate not only the relation between background and breaking news event, but also the relations between the different content units of the background. We have investigated frameworks for annotating discourse relations such as those proposed by Wolf and Gibson (2004) or Marcu (2000) who have shown that it is possible to specify a set of discourse relations for text segments that are easy to code. However, the relations they propose do not appear entirely adequate to capture the specific character of the semantic relations holding between content units in background and foreground. In Section 4.3.1 we indicated six broad classes of types of related information typically found in backgrounders – it is these sorts of relations and refinements of them that need to feature in a descriptive theory of background. Thus we are elaborating a theory of discourse relations specifically for background that borrows from and builds on the Wolf and Gibson (2004) or Marcu (2000) frameworks. Given such a descriptive framework the next step will be to annotated a corpus of backgrounders. Work can then be carried out to correlate quality judgements of journalist of background with characteristics of backgrounders as expressed in terms of the descriptive theory. Returning to the higher level goal of carrying out extrinsic evaluation of different information access technologies for the task of background writing, it is of course not necessary that a descriptive theory of background be constructed. All that is strictly necessary is that consistent, independent human judgements of quality be obtainable, something our pilot study provides initial support for believing is possible. Thus, while the implementation of the Cub Reporter technology platform is not yet sufficiently stable to be evaluated, once it is ready we will be able to carry out extrinsic evaluations with different journalists being given different technology configurations to carry out the same background writing task against the same text collection in a time-controlled setting. Quality judgements over the
Information Access and Natural Language Processing: A Stimulating Dialogue
103
resulting backgrounders together with user satisfaction data should lead to much deeper understanding of the strengths and weaknesses of NLP-based technologies for information access in relation to conventional IR-based techniques.
4.6 Conclusion Supporting access to information in large digital text collections presents many challenges to language processing researchers, especially the effective support of complex, iterative exploration of the text collections as part of a task such as writing background, where material needs to gathered and synthesised, and where the writer will be reformulating and refining his information request as he learns more about the topic. These challenges have been extremely productive in stimulating developments in language processing technology, particularly within the NLP research community. To these researchers the demanding challenges of complex information seeking tasks offer an opportunity to vindicate the use of techniques which bring to bear more knowledge of aspects of the human language system than those used in conventional IR-based searching. At the same time developments in language processing capability allow both conceptual insights into how to characterise information seeking behaviour (e.g to what extent can we view background writing as multi-document summarisation?) and experimental observation of information seeking behaviour using new technologies. Such observation can in turn reveal weaknesses in the technology and yield further insights into the nature of information seeking in text and the sort of language processing required to support it. Hence NLP and the study of information access in text are engaged in a stimulating dialogue to their mutual benefit. Work on the Cub Reporter project reported above supports this view. By focussing on a real, complex information seeking task – the background writing task – and analysing both the informational structure of the outputs and the process of constructing the outputs using various technologies, we can understand better both what sort of information needs to be accessed for this task and the limitations of, and requirements on, technologies used to support the task. Our work has contributed by (1) providing an in-depth examination of the background gathering task and an emerging analytical framework for characterising the content of background and its relation to foreground; (2) advances in specific NLP technologies that, prima facie, ought to be of use in building more advanced information access systems to support the background writing task; (3) a technology platform that incorporates a variety of information access tools, including a conventional ranked retrieval search engine, a question answering system, single and multidocument summarisation tools (including a tool which automatically builds profiles of various key entities in texts), and a similar event searching tool which supports ontology-guided search for events similar to a specified event; (4) a methodology for extrinsic evaluation of different information access technologies in support of the background writing task. The dialogue continues.
104
Gaizauskas et al.
Acknowledgements The authors would like to acknowledge the support of the UK Engineering and Physical Sciences Research Council, research grant R91465 which has made this work possible. Thanks also to Haotian (“James”) Sun for his help in programming the prototype and to Jonathan Foster for his help in refining our understanding of the nature of background seeking in journalism.
References E. Agirre and O. Lopez de Lacalle. 2003. Clustering WordNet Word Senses. In Proceedings of RANLP 2003, p. 121–130. E.J. Barker and R. Gaizauskas. 2005. Evaluating Cub Reporter: proposals for extrinsic evaluation of journalists using language technologies to access a news archive in background research. In Proceedings of the COLIS 2005 Workshop on Evaluating User Studies in Information Access. To appear. H. Cunningham, D. Maynard, K. Bontcheva and V. Tablan. 2002. GATE: A framework and graphical development environment for robust NLP tools and applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics. R. Gaizauskas, M. Hepple, H. Saggion and M. Greenwood. 2005. SUPPLE: A Practical Parser for Natural Language Engineering Applications. In International Workshop on Parsing Technologies. D. Jurafsky and J.H. Martin. 2000. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Prentice Hall, Upper Saddle River, NJ. H.P. Luhn. (1999). The automatic creation of literature abstracts. IBM Journal of Research & Development, 2(2):159–165, 1958. Reprinted in Mani and Maybury. I. Mani and M.T. Maybury. (eds.). 1999. Advances in Automatic Text Summarization. The MIT Press. D. Marcu. 2000. The Theory and Practice of Discourse Parsing and Summarization. MIT Press, Cambridge, MA. D. Milward and J. Thomas. 2000. From information retrieval to information extraction. In Proceedings of the ACL Workshop on Recent Advances in Natural Language Processing and Information Retrieval. Available at: http://www.cam.sri.com/html/highlight.html. P. Over and J. Yen. 2004. Introduction to DUC-2004: An intrinsic evaluation of generic news text summarization systems. In Proceedings of the HLT/NAACL 2004 Document Understanding Workshop (DUC-2004). Available at: http://wwwnlpir.nist.gov/projects/duc/pubs/ 2004slides/duc2004.intro.pdf. N. Sager. 1981. Natural Language Information Processing. Addison-Wesley, Reading, MA. H. Saggion. 2002. Shallow-based Robust Summarization. In Automatic Summarization: Solutions and Perspectives, ATALA, December, 14. H. Saggion and R. Gaizauskas. 2004a. Multi-document summarization by cluster/profile relevance and redundancy removal. In Proceedings of Document Understanding Conference, Boston, MA, May 6–7. NIST. H. Saggion and R. Gaizauskas. 2004b. Mining on-line sources for definition knowledge. In Proceedings of FLAIRS 2004, Florida, USA. AAAI.
Information Access and Natural Language Processing: A Stimulating Dialogue
105
H. Saggion and R. Gaizauskas. 2005. Experiments on Statistical and Pattern-based Biographical Summarization. In Proceedings of the 12th Portuguese Conference on Artificial Intelligence – TeMA Workshop. Accepted. G. Salton. 1988. Automatic Text Processing. Addison-Wesley Publishing Company. G. Sampson. 1995. English for the Computer: The SUSANNE Corpus and Analytic Scheme. Clarendon Press, Oxford. R.F. Simmons. 1965. Answering English questions by computer: A survey. Communications of the ACM, 8(1):53–70. K. Sparck Jones. 1981. Retrieval system tests: 1958–1978. In K. Sparck Jones, (ed.), Information Retrieval Experiment, pages 213–255. Butterworths, London. URL http://www.nist.gov/itl/div894/984.02/projects/irlib. K. Sparck Jones and J.R. Galliers. 1996. Evaluating Natural Language Processing Systems. Springer, Berlin. K. Sparck Jones and P. Willett. 1997. Chapter 1: Overall introduction. In K. Sparck Jones and P. Willett, (ed.), Readings in Information Retrieval, p 1–7. Morgan Kaufmann, San Francisco, CA. A. Tombros, M. Sanderson and P. Gray. 1998. Advantages of Query Biased Summaries in Information retrieval. In Intelligent Text Summarization. Papers from the 1998 AAAI Spring Symposium. Technical Report SS-98-06, p 34–43, Standford (CA), USA, March 23–25. The AAAI Press. E. Voorhees. 2004. Overview of TREC 2003. In Proceedings of the Twelfth Text Retrieval Conference (TREC 2003), NIST Special Publication 500-255. Available at: http://trec.nist.gov/pubs/trec12/papers/ OVERVIEW.12.pdf. E. Voorhees. 2005. Overview of the TREC 2004 question answering track. In Proceedings of the Thirteenth Text Retrieval Conference (TREC 2003). URL http://trec.nist.gov/pubs/trec13/papers/QA.OVERVIEW.pdf. NIST Special Publication 500-261. Y. Wilks. 1964. Text searching with templates. Technical Report Memo, ML.156, Cambridge Language Research Unit. F. Wolf and E. Gibson. 2004. A response to Marcu (2003). Discourse structure: trees or graphs?. Available at: http://web.mit.edu/fwolf/www/discourse-annotation/Wolf_Gibsoncoherence-representation.pdf.
This page intentionally blank
5 Three Steps in Wilks Work: From Theory to Resources to Practice Gregory Grefenstette CEA LIST, Fontenay aux Roses, France Abstract:
Some researchers are brilliant, able to couch their work in a theory, tracing a line through a long tradition of thinkers, culminating with the advances that they themselves make to body of knowledge that we call science. Some others are head-down, hard workers, butting up against problems and wearing them down over time, hacking out a path through the unknown, a path that others may soon tread over. A very few rare scientists possess both qualities, an enormous capacity for work propelling an unflagging forward movement, and a wide scientific and philosophic culture that they can use to situate what they are doing and to explain why they are doing it. Yorick Wilks is such a scientist whose body of work demonstrates both aspects of hard working obstinacy and cultured brilliance
5.1 Introduction Though Wilks’s publications cover a wide variety of topics in computational linguistics (CL) his constantly recurring theme is one of finding meaning through computational methods. The positive qualities mentioned in the abstract above can be appreciated in any of his publications but they will be illustrated here by examining and comparing three of these: An Intelligent Analyzer and Understander of English (Wilks, 1975) [hereafter Intelligent Analyzer], Machine Tractable Dictionaries as Tools and Resources for NL Processing (Wilks et al., 1990) [hereafter Tractable Dictionaries], and Information Extraction: Beyond Document Retrieval (Gaizauskas & Wilks, 1998) [hereafter Information Extraction]. I picked these documents (though almost any three others might have served my purpose) because the first shows the flight of brilliance, the second shows a reasoned response to difficulties faced, and the third shows the path that can opened onto real progress through the combination of unrelenting engineering efforts and serious science.
5.2 Intelligent Analyzer When I was starting my university education in the mid 1970s it was a heady time for Artificial Intelligence (AI), in general, and Natural Language Processing, then part of AI, in particular. Terry Winograd (1972)’s description of his SHRDLU 107 K. Ahmad, C. Brewster and M. Stevenson (eds.), Words and Intelligence II, 107–118. © 2007 Springer.
108
Grefenstette
program for manipulating a simple blocks world via natural language processing had just been published. With this system, you could give natural commands to a computer for moving blocks around in a virtual world, using a vocabulary of about fifty words (nouns like “block” and “cone”, verbs like “pick up” and “move to”, and adjectives like “big” and “green”). The program maintained an internal model of the scene and of the history of actions performed. In addition to using planning to accomplish the actions requested, the system could resolve pronoun use by both unifying properties of objects in the internal model and recent history and by using focus, leading to dialogs such as: HUMAN: FIND A BLOCK WHICH IS TALLER THAN THE ONE YOU ARE HOLDING AND PUT IT INTO THE BOX. SHRDLU: BY "IT", I ASSUME YOU MEAN THE BLOCK WHICH IS TALLER THAN THE ONE I AM HOLDING. SHRDLU: OK. In this example, the program describes its preference for associating the pronoun “it” with the object of the command “find” rather than the block the computer was holding. In addition to displaying this understanding of its known world, SHRDLU could also learn the meaning of a new word. For example, if the human user used the unknown word “steeple”, the computer would excuse itself for its ignorance and the human could define the new word in terms of known words: SHRDLU: SORRY, I DON’T KNOW THE WORD "STEEPLE" HUMAN: A "STEEPLE" IS A STACK WHICH CONTAINS TWO GREEN CUBES AND A PYRAMID. While this blocks world was limited in its direct applicative interest, its existence gave hope that more complicated tasks could be modeled, understood by the computer, and naturally handled. There was also at this same time, the development of reasoning systems dealing directly with more portent subjects. The first expert systems, such as DENDRAL (Lindsay et al., 1993) for organic chemistry or MYCIN (Shortliffe, 1976) concerning diagnosis of infectious diseases, appeared. Though the dialog in these systems was controlled by the expert systems themselves, with the system asking questions and the user usually responding with short answers, the user could ask the computer to explain its internal reasoning. For example, the following dialog could be produced by MYCIN: MYCIN: HAS THE PATIENT RECENTLY HAD OBJECTIVE EVIDENCE ABNORMAL NEUROLOGIC SIGNS (MUSCLE RIGIDITY, COMA, SEIZURES, ETC.) DOCUMENTED BY PHYSICIAN OBSERVATION OR EXAMINATION? PLEASE ENTER CSF FINDING. IS MENINGITIS A HOSPITAL-ACQUIRED INFECTION? HUMAN: WHY? MYCIN: THIS WILL AID IN DETERMINING THE ORGANISMS (OTHER THAN THOSE SEEN ON CULTURES OR SMEARS) WHICH
Three Steps in Wilks Work: From Theory to Resources to Practice
109
MIGHT BE CAUSING THE INFECTION. IT HAS ALREADY BEEN ESTABLISHED THAT (2.1) THE INFECTION WHICH REQUIRES THERAPY IS MENINGITIS, AND (2.2) ORGANISMS WERE NOT SEEN ON THE STAIN OF THE PENDING CSF CULTURE, AND (2.3) THE TYPE OF INFECTION IS BACTERIAL, AND (2.4) NEUROSURGERY IS NOT ONE OF THE RELEVANT ITEMS FROM THE HISTORY OF THIS PATIENT, AND (2.5) THE AGE OF THIS PATIENT IS GREATER THAN TEN DAYS, THEREFORE, IF (2.6) THE INFECTION WAS ACQUIRED WHILE THE PATIENT WAS HOSPITALIZED THEN THERE IS EVIDENCE THAT THE ORGANISMS (OTHER THAN THOSE SEEN ON CULTURES OR SMEARS) WHICH MIGHT BE CAUSING THE INFECTION ARE E. COLI (75), STAPHYLOCOCCUS-COAG-POS (3), PSEUDOMONAS-AERUGINOSA (5). Though this dialog consisted mostly of canned text, it gave promise that more complex world knowledge could be integrated into open command dialog systems such as SHRDLU. At the same time, Roger Schank (1975) and his team of researchers were developing a more elaborate representation of the semantics of everyday language, one that combined Katz’s and Fodor’s theories with the more traditional linguistic traditions of case grammars (Fillmore, 1968). In Schank’s representation, that he called Conceptual Dependencies (CD), verbs were described by semantic primitives such as ATRANS, PTRANS, MTRANS (for abstract, physical, and mental transfers), PROPEL, MOVE, GRASP, INGEST, EXPEL, SPEAK, MBUILD, etc. These primitives were augmented graphically by diverse arrows indicating case relations with different nouns that were objects, instrumentals, locatives, datives, or subjects of the verbs. In this notation, the sentence “John gave the book to Mary” was visually represented as in Figure 5.1. These diagrams, in addition to being visually appealing, seemed to provide a way of describing the underlying semantics of words in a more abstract way, using a small repertoire of semantic primitives and relations that could easily be implemented on a computer. Further extensions of this approach to include scripts, or the frames of Minsky (1975) that fill in the details of background knowledge once the context of a text (for example, being in a restaurant) had been determined MARY o JOHN
PTRANS
BOOK JOHN
Fig. 5.1. A sample Conceptual Dependency diagram encoding the act of transferring the possession of book from John to Mary: John gave the book to Mary
110
Grefenstette
were also being developed at the time. From this flurry of work in the mid 1970s, it seemed like computers were ready to become more human, ready to interact with humans in a natural way, speaking and understanding language as we do. But even then, it was visible that these above-mentioned approaches were all static, and had the snagging drawback that the whole world had to be deterministic, since the knowledge structures used were fixed and ultimately based on a simple logic of whether a structure could be applied in a given context or not. It was into this scene, that one of the most elegant and promising approaches appeared in the form of the Intelligent Analyzer which presents Wilks’s approach to natural language understanding based on his theory of Preference Semantics. The idea in Preference Semantics was that competing semantic structures embodying the meaning of the text remain active while a text was being processed. At any time during this processing, one partially instantiated structure was “best” and could be used to produce the current interpretation of the text. Any new word or sentence fragment subsequently encountered might change the preference ranking of the competing structures and thus change the meaning of the text. In the absence of certainty, the simplest structures were preferred. The attractiveness of this approach was its ability to use “partial information” and to do the best with what it understood, without having performed a complete logical derivation of facts, or even complete parses of the sentences encountered. The mechanisms that Wilks used to implement his working version of Preference Semantics were elegant: each word encountered contributed its semantic formulas (one formula per word sense) to the system; these formulas were matched against skeletal sentence-like templates of possible agent-action-object sequences; and partially instantiated templates were matched against paraplates, patterns involving more than one template, that permitted resolution of anaphors. Every structure in the system consisted of formulas involving about one hundred semantic primitives. The semantic primitives used by Wilks in this implementation were drawn from a set initially developed at the Cambridge Language Research Unit (CLRU), see Sparck Jones (2000), that can be classed in five main categories: a) Entities: MAN (human being), STUFF (substances), PART (parts of things), FOLK (human groups), etc. b) Actions: FORCE (compels), CAUSE (causes to happen), PICK (choosing), BE (exists), FLOW (moving as liquids do), etc. c) Type indicators: KIND (being a quality), HOW (being a type of action), etc. d) Sorts: CONT (being a container), GOOD (being morally acceptable), THRU (being an aperture), etc. e) Cases: TO (direction), GOAL (goal or end), SUBJ (actor or agent), OBJE (patient of action), IN (containment), POSS (possessed by),etc. These primitives could also be grouped in classes such as ∗ ANI covering the class of animate primitives MAN, BEAST, and FOLK.
Three Steps in Wilks Work: From Theory to Resources to Practice
111
A word sense is a LISP-like formula of pairs of semantic primitives and other formulas nested to whatever level is needed to express the meaning. There is dependence at every level of the left half of a binary pair on the right half. As an example, consider the word “drink” which has the formula: "drink" (action) ∼ ((∗ ANI SUBJ) (((FLOW STUFF) OBJE) ((SELF IN)(((∗ ANI (THRU PART)) TO) (BE CAUSE))))) This formula is glossed by Wilks as It is to be read as an action, preferably done by animate things (∗ ANI SUB); to liquids, or to substances that flow ((FLOW STUFF)OBJE) ; causing the liquid to be in the animate thing (SELF IN); and via (TO indicating the direction case) a particular aperture of the animate thing, the mouth, of course. Here in this explanation, we see another aspect of Preference Semantics. The appearance of a word in a text contributes all the subparts of the formula, but each part need not be satisfied in the interpretation of the sentence. In particular, the case primitives such as SUBJ (subject) and OBJE (object) only indicate the preferred agents or objects of accidents, but the system does not fail to interpret a text segment in absence of such entities. Wilks then goes on to demonstrate how he implements this liberalization of Katz and Fodor’s selection restrictions (Katz & Fodor, 1963) using templates that serve both syntactic and semantic purposes. A template is a permissible, or acceptable, sequence of agent, actions and objects; a skeletal sentence expressed as a sequence of semantic primitives. Since the five categories of semantic primitives correspond roughly to syntactic categories, finding an applicable template can resolve part-of-speech ambiguity as well as word senses. Wilks shows how the competing formulas for “father” as a verb (having CAUSE as its head in its formula) and as noun (having MAN as its head) give rise to two possible interpretations of the following sentence (expressed as in abbreviated form with only the head semantic elements from each word’s sense formulas): Small men sometimes father big sons KIND MAN HOW MAN KIND MAN KIND MAN HOW CAUSE KIND MAN A template such as MAN CAUSE MAN, which would cover a wide variety of agent-action-object utterances, will select the second reading of “father” as a verb, without the explicit notion of verb appearing in the system. Wilks then continues, adding a few more mechanisms (e.g. paraplates which connect templates, sequences of expansions of templates, semantic density of matches) to show how this weak unification of preferences can choose the correct interpretations (i.e. translations) of not only nouns and verbs, but prepositions and pronouns. For example, in the following example, the English words “drinks it” are correctly translated into “le boit” in French because “drink” prefers a liquid object
112
Grefenstette
and so “wine” is chosen over “table” as the referent of “it”1 and the masculine pronoun is produced in translation rather than the feminine “la” that would have been produced for “table”: I PUT THE WINE ON THE TABLE AND JOHN DRINKS IT OUT OF A GLASS. HE OFTEN DRINKS OUT OF DESPAIR AND THROWS THE GLASSES OUT OF THE WINDOW. JE METS LE VIN SUR LA TABLE ET JEAN LE BOIT DANS UN VERRE. IL BOIT SOUVENT PAR DESESPOIR ET JETTE LES VERRES PAR LA FENETRE. With this presentation of Preference Semantics, it seemed that some of the rigidity and brittleness of semantic representations such as Schank’s and Winograd’s had been resolved in a principled and computable way. It seemed, then in the last part of the 1970s that language understanding was on its way to solution, needing only a little more engineering to promote these systems from small restricted domains to general understanding of text. But many of the original researchers on the field began to find the problem more difficult that they had anticipated. By 1986, Winograd (Winograd & Flores, 1986) had come to the conclusion that contextualized meanings of words was much richer than their lexical meaning, and that simple techniques such as those presented in SHRDLU could not be extended to true language understanding. Schank also later abandoned the definition of semantic reasoning in terms of a small set of computer implementable primitives. Many computational linguists abandoned the pursuit of unrestricted language understanding and fell back to restricted domains such as weather reports, recipes and aviation manuals (Kittredge, 1982)
5.3 Tractable Dictionaries Wilks realized that the main problem of these first demonstration systems was their hand-coded lexicons, very rich but very small. In these lexicons every “entry is constructed with foreknowledge of its intended use and hence of the knowledge it should contain. Being designed with only a specific purpose in mind, the knowledge representation runs into problems when scaled up to cover additional linguistic phenomena.” And rather than running away from the problem, Yorick Wilks decided to attack it. First, in “Making Preferences More Active” (Wilks, 1978), he proposes an extension of his originally implemented system by integrating Roget’s thesaurus into the process, a process which has also been extended with a frame-like structure (called pseudo-texts). A pseudo-text would be a collection of formulas, involving
1
Nowadays, I would use the WWW to decide things like this, using statistics such “drink wine” appearing on an estimated 1.1 million pages and “drink table” appearing on 20,000 via Google.
Three Steps in Wilks Work: From Theory to Resources to Practice
113
semantic primitives and other words described also as formulas, that describe some facts about an object or situation. For example, a pseudo-text for a CAR might include a sequence of templates that describe that a MAN USE #tube as an instrument (INST) in order for that MAN to INJECT a #liquid into an ENGINE that USEs that #liquid to CAUSE the MAN inside the CAR to MOVE. In this notation, a word like INJECT refers to the formula for the word “inject” as seen above in the original discussion of Preference Semantics, while #tube and #liquid refers to lines of words, or rows, from Roget’s one thousand thesaurus categories. Each row of words can also be considered as a line of formulas (for each word in the Roget class), and since Roget’s categories are also classed in a shallow hierarchy of ten general sections, this organization would provide a large hierarchical structure of related formulas that could be used be a computer for reasoning (supposing that every word in Roget’s were already defined via a semantic formula). In addition, one could conceive of having general pseudo-texts that could apply to all the elements in a thesaurus row, and inheriting these structures down the thesaurus hierarchy. Given this thesaurus-induced semantic structure, when we are now confronted with preference breaking between the formulas connected to words in a text and the possible templates and combinations of templates (paraplates) that they can be unified with, the system can exploit the pseudo-texts to perform analogical reasoning. It can find the shortest path between a partial template USE #liquid and one for DRINK #liquid to justify metaphoric uses of words such as in “my car drinks gasoline”. Though this system was only described and not implemented, it showed how some existing lexical resources, such as Roget’s thesaurus might be integrated into the effort to overcome the “knowledge acquisition bottleneck”. In the early 1980s research flourished around processing machine readable dictionaries and transforming them into machine tractable dictionaries, utilizable for natural language processing (Amsler, 1980; Michiels, 1982). Could these dictionaries provide the knowledge needed for natural language understanding systems? Yorick Wilks led a team of researchers (at the Computing Research Laboratory in New Mexico State University) who describe in Tractable Dictionaries how they attacked this problem of extracting the semantic information latent in the 55,000 entries of the Longman Dictionary of Contemporary English (LDOCE), a dictionary designed for learners of English as a second language. The lexicographers creating this full-sized dictionary strove to use a controlled vocabulary of about 2,000 words and a simple and regular syntax in creating their dictionary definitions. Three types of information were extracted from this dictionary. One was a set of related words from the controlled vocabulary obtained by comparing co-occurrence data of the words in definitions. The second type of information was a formalization of dictionary entries produced by parsing the entries. Each of the 2,000 or so controlled vocabulary words had a semantic formula hand-built as in the formula for Preference Semantics shown above. A chart parser built to analyze the regular structure of the LDOCE definitions attempted to produce the phrase structure for each of the 55,000 dictionary entries, from which the genus (usually one of the controlled vocabulary words) and differentia were extracted. For example, from the dictionary entry: “ammeter:: an instrument for measuring electrical current”
114
Grefenstette
the formula for the word “ammeter” would inherit the formula for “instrument” and some GOAL involving, ultimately, the formula for the words “measure” and “current”. The third type of extraction from the LDOCE involved starting from a set of primitives (called the Key Defining Vocabulary) that were found to have been used to define the original 2,000 or so controlled vocabulary items and to gradually spiral upward through the definitions of the other words to create defining cycles, where ultimately, one must suppose, all words would be defined in function of the original Key Defining Vocabulary Primitives. This research did not produce the large-scale lexical resource needed to break the knowledge acquisition bottleneck and bring Preference Semantics out of the demonstration level into real use. It did, however, convincingly test the limits of what can be extracted from dictionaries and proved that this extraction was not going to be sufficient to realize the dreams of natural language understanding 1970s style.
5.4 Information Extraction After a brilliant beginning laying out Preference Semantics and a serious but unsuccessful effort to make it work on a large scale, one would not have been surprised to see Yorick Wilks devote more of his time on the philosophical investigations into meaning, which he had always continued throughout his most intense “computer science” days (Wilks, 1971; Wilks, 1976; Wilks, 1977; Wilks, 1984; Wilks, 1985; Wilks, 1987; Wilks & Ballim, 1987). But in the mid 1990s we find that Wilks had not given up on finding meaning in text, and he began a new computational attack on the problem under the guise of Information Extraction. Information Extraction covers the problem of identifying and isolating typed information from natural language texts such as newswire articles. It became a hot topic in Computer Science after the DARPA, a US defence funding agency for advanced research projects, began financing a series of Message Understanding Conferences (MUC) in the late 1980s. In these conferences, research groups are presented with the same texts and shared task of extracting certain categories of information from these texts. For example in MUC-3 and MUC4, the task was to fill templates concerning terrorist attacks, finding out where the attack was, what type of attack was it (e.g., bombing, military assault), how many people were injured, etc. The next MUC involved finding information about financial mergers.2 The MUC tasks gradually spread out identifying wider and wider varieties of entities and also the relations between them. Wilks and his team at the University of Sheffield began participating in the MUC conference starting with MUC-5 in 1993, gradually developing a system called the Large Scale Information Extraction (LaSIE).
2
The last MUC was MUC-7 held in 1998, mutating into the Topic Detection and Tracking (TDT) tasks under the DARPA Translingual Information Detection, Extraction, and Summarization (TIDES) program. See the site www.nist.gov/speech/tests/tdt
Three Steps in Wilks Work: From Theory to Resources to Practice
115
In LaSIE, text is parsed using a bottom up chart parser. A set of manually constructed context free rules are then used to recognize multi-word structures such as organisations, persons, locations, dates, and monetary amounts. For example, a rule like ORGAN_NP –> ORGANIZ_NP LOCATION_NP BIZTAG is used to recognize the organisational name Merrill Lynch Canada Inc. and a rule like PERSON_NP –> FIRST_NAME NNP recognises the personal name Donald Wright. Given a sentence such as the following: Donald Wright, 46 years old, was named executive vice president at this brokerage firm the following semantic formula, in which the labels e21, e22, etc. stand for specific entities or events, are instantiated by the system: person(e21), name(e21, 'Donald Wright`) naming(e22), lobj1(e21), lobj2(e22,e23) title(e23,`executive vice president`) firm(e24),det(e24,`this`) These expressions can be read as there being a specific person (with label e21) whose name is ‘Donald Wright’ (actually names are decomposed into given and family names by the original rules) which is the logical object of a naming event (e22). This event also involves a second object which is a title (e23) for a firm (e24). Other steps in the system word attach this firm (e24) to a company name and a location. With an Information Extraction system such as LaSIE, that can handle large quantities of text, it seemed that Wilks fulfilled some of the original promise of the Intelligent Analyzer. Though the domain of MUC-5 was restricted to certain types of information concerning people and other named entities found in text about mergers (see Figure 5.2 for part of the financial merger world model used by the system), the processing structure is open and can allow for the extraction of a wide range of relations. Major differences between this Information Extraction approach to text understanding and the approach used in Intelligent Analyzer are the separation of lexical processing and syntactic parsing tasks from semantic analysis, and the use of atomic lexical concepts arranged in a hierarchy (as in Figure 5.2) rather than having each concept decomposed into a small number of semantic primitives. The primary interests of moving from away from the earlier monolithic system, in which everything is intertwined, to a modular system, in which language processing tasks are clearly identified and separated, is that this subdivision eases the scaling up
116
Grefenstette entity object person
organisation e26 company
post_holder
event
e23
date
attribute
succession
single-valued
government incoming outgoing animate count
e25 firm
appoint name retire
multi-valued name
title
resign
title: executive VP e24
e21
det: this animate: yes name: Donald Wright
e20 name: Burns Fry
e22 lsubj: e25 lobj: e26 lobj2: e23
location: Toronto
Fig. 5.2. Part of the existing world model from LaSIE for Information Extraction in the domain of financial mergers from MUC-5
process, makes the system more robust since errors can be isolated and more easily corrected, and also permits the system to be reconfigurable for different applications. And, here in this new realisation that language understanding can mean different things for different applications, and that natural language processing involves a modular collection of clearly identifiable tasks, we see a third contribution of Wilks to the field of machine-based language understanding. Wilks and his team at Sheffield realized not only that the division of natural language processing into these steps possessed these advantages; but also that the wider scientific community needed an open platform that allowed these processing functions to filled by independently operating units. If one person had already built a part-of-speech tagger that performed well, why should another group have to redesign and re-implement this function? In a magnanimous contribution to the scientific community, Wilks and his colleagues built such a open platform called GATE (General Architecture for Text Engineering) (Cunningham et al., 1995) and offered it to the research community at large.3 The GATE platform allows existing natural language engineering components to be integrated into a chain of processing. It also offers a graphical user interface that facilitates the composition of a sequence of components into a usable application. The platform has been downloaded by more than 500 groups and has thousands of users to-date. 3
The GATE software is licenced under the GNU Library General Public Licence. It can be downloaded from http://gate.ac.uk. Hamish Cunnigham nows manages all aspects of GATE. In addition to English, it supports languages from Hindi to Chinese, and from Italian to German.
Three Steps in Wilks Work: From Theory to Resources to Practice
117
5.5 Conclusion These three publications Intelligent Analyzer, Tractable Dictionaries, and Information Extraction demonstrate a scientific career of brilliance, perseverance, and openness and largesse. The initial demonstration of Preference Semantics in Intelligent Analyzer that language understanding does not necessarily imply a logical resolution of all premises in a world model, but can be performed with uncertain and contradictory knowledge, reappears in the robustness of approaches to semantic extraction and anaphora resolution in Information Extraction. The attempt at scaling up Preference Semantics by bootstrapping resource acquisition from machine readable dictionaries in Tractable Dictionaries showed the limits of what was extractable from a dictionary, and posed the problems of dictionary-oriented meanings that were further explored in the SENSEVAL programs4 from 1998 onwards. Lastly, Information Extraction provides the scientific community with the fruits of this research on meaning, by providing an open system that allows new systems to build on what has been learned. The work that Yorick Wilks performed on understanding meaning in the 1970s did not lead to a dead-end as did many of the works from that time. It motivated subsequent work on dictionary exploitation, and word-sense disambiguation, and finally led, under Wilks, to the most widely used platform for natural language processing, GATE. This trajectory shows what a true scientist concerned with advancing a whole field of investigation is capable of realizing.
References Amsler, R.A. (1980) The Structure of the Merriam-Webster Pocket Dictionary. PhD Thesis, University of Texas. Cunningham, H., R.J. Gaizauskas and Y. Wilks. (1995) A General Architecture for Text Engineering (GATE) – a new approach to Language Engineering R&D. Technical Report CS – 95–21, Computer Science, University of Sheffield Fillmore, C. (1968) The case for case. In: E. Bach and R. Harms (eds.) Universals in Linguistics theory. New York: Holt, Rinehart, and Winston. Gaizauskas, R. and Y. Wilks. (1998) Information Extraction: Beyond Document Retrieval. Journal of Documentation, 54(1):70–105. Katz, J.J. and J.A. Fodor. (1963) The Structure of Semantic Theory. Language, 39:170–210.
4
From www.senseval.org “Senseval is the international organization devoted to the evaluation of Word Sense Disambiguation Systems. Its mission is to organise and run evaluation and related activities to test the strengths and weaknesses of WSD systems with respect to different words, different aspects of language, and different languages. Its underlying goal is to further our understanding of lexical semantics and polysemy. Senseval is run by small committee under the auspices of ACL-SIGLEX (the Special Interest Group on the LEXicon of the Association for Computational Linguistics).”
118
Grefenstette
Kittredge, R. (1982) Variation and homogeneity of sublanguages. In: R. Kittredge and J. Lehrberger (eds.) Sublanguage: Studies of Language in Restricted Semantic Domains. Berlin: de Gruyter, pp. 107–137. Lindsay, R., Buchanan, B., Feigenbaum, E. and Lederberg, J. (1993) DENDRAL: A Case Study of the First Expert System for Scientific Hypothesis Formation. Artificial Intelligence, 61(2):209–261. Michiels, A. (1982) Exploiting a Large Dictionary Data Base. PhD Thesis, Université de Liège, Liège, Belgium. Minsky, M. (1975) A framework for representing knowledge. In: P.H. Winston (ed.) The Psychology of Computer Vision. New York: McGraw-Hill, pp. 211–277. Schank, R.C. (1975) Conceptual Information Processing. Amsterdam: North-Holland Publishing Company. Shortliffe, E. (1976) MYCIN: Computer-based Medical Consultations. New York: American Elsevier. Sparck Jones, K. (2000) R.H. Richens: translation in the NUDE. In: W.J. Hutchins (ed.) Early Years in Machine Translation., Amsterdam: John Benjamins, pp. 263–278 Wilks, Y. (1971) Logic, Linguistics and Computational Linguistics. In the Proceedings of the International Conference on Computational Linguistics, Debrecen, Hungary. Wilks, Y. (1975) An intelligent analyzer and understander of English. Communications of the ACM, 18(5):264–274. Wilks, Y. (1976) Frames, Scripts, Stories, and Fantasies. In the Proceedings of the International Conference on the Psychology of Language, Stirling, 1976, and in Pragmatics Microfiche 1977. Reprinted in H. Stegentritt (ed.) Regenburg Romanistentag. Berlin: De Gruyter. Wilks, Y. (1977) Knowledge Structures and Language Boundaries. In the Proceedings Fifth International Conference on Artificial Intelligence. MIT Press. Wilks, Y. (1978) Making Preferences More Active, Artificial Intelligence. Vol. 11, pp. 197–223. Wilks, Y. (1984) Is Frege’s Principle Trivial or False? In the Proceedings of the Annual Conference of the Linguistics Association of G.B. Essex University. Wilks, Y. (1985) Relevance, Points of View and Speech Acts: An Artificial Intelligence View. In the Proceedings of the Cognitive Science Conference. Paris. Wilks, Y. (1987) On Keeping Logic in its place. In the Proceedings of the Third International Workshop on Theoretical Issues in Natural Language Processing (Tinlap3). Las Cruces, New Mexico. Wilks, Y. and Ballim, A. (1987) The Heuristic Ascription of Belief. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’87), Milan. A fuller version In Review of Cognitive Science, Vol I. N. Sharkey (ed.), London: Ablex, 1989 Wilks, Y., Fuss, D., Gou, C.M., McDonald, J.E., Plate, T. and Slator, B.M. (1990) Providing Machine Tractable Dictionary Tools, Machine Translation 5(2):99–154 Winograd, T. (1972) Understanding Natural Language. New York: Academic Press. Winograd, T. and F. Flores. (1986) Understanding Computers and Cognition. Norwood, NJ, USA: Ablex Publishing Corporation.
6 Preference Syntagmatics Patrick Hanks Masaryk University, Brno Abstract:
This paper compares Yorick Wilks’s theory of preference semantics with the evidence of English usage in a large corpus and reports the rationale of a project (in progress) that attaches meanings, not to lexical items, but to contextual patterns, in which each lexical item is normally found. These contexts are based on analysis of a large corpus and stored in a Pattern Dictionary. In addition to other influences, this work is partly inspired by Wilks’s theory of semantic preferences of the 1970s, but there are significant differences. If meanings are attached to words in context instead of in isolation, the formulas needed to express them can express delicate distinctions without being excessively cumbersome. The Pattern Dictionary provides a resource for reducing lexical ambiguity in texts while maintaining interpretative delicacy. The meaning of a word in an unseen document can be estimated by matching its context to one or other of the normal contexts in the Pattern Dictionary, which are themselves explicitly linked to a meaning, called a primary implicature. In the past dozen years corpus analysis has shown with increasing clarity that, although the number of all possible syntagmatic combinations in which each word can participate is vast, and indeed perhaps unlimited, the number of normal syntagmatic combinations is manageably small. Examples are given of verb entries from the Pattern Dictionary
6.1 Semantic and Syntagmatic Preferences In the 1970s Yorick Wilks wrote a series of papers (Wilks 1973, 1975a, 1975b, 1977, 1978, 1980) in which he proposed, in contrast to the theory of selectional restrictions of Katz and Fodor (1963), a system of selectional preferences. He argued that, depending on context, a particular interpretation of a word may be preferred, but other interpretations must be accepted even if they do not satisfy the preference conditions. Among his examples are the following: 1. The adder drank from the pool. 2. My car drinks gasoline. The verb drink prefers an animate agent, which invites the interpretation of adder in 1 as a snake rather than a calculating machine. However, in 2 no animate interpretation of car is possible, so any reader (including an AI computation) must accept a reading in which an inanimate entity is doing the drinking. Like Sherlock 119 K. Ahmad, C. Brewster and M. Stevenson (eds.), Words and Intelligence II, 119–135. © 2007 Springer.
120
Hanks
Holmes, the interpreter, having ruled out all alternatives, must accept the only remaining explanation, however implausible, as the correct one. The distinction between preferred contexts and possible contexts is of the greatest importance for processing meaning in texts.1 2 is grammatically well-formed, but it is not very normal or regular: cars don’t drink. On the other hand, 1 represents a regularity: adders are animate entities and one of the things that animate entities normally do is drink. The Pattern Dictionary is not interested in whether or not this is a true fact about the world; that point is, more importantly, that it is a syntagmatic statement about English – dependent in part on the frequent recurrence of a particular set of nouns (animate entities) as grammatical subjects of drink and in part on an ontology that gives a list of animate entities. Syntagmatic regularities such as this can be observed in a large corpus and recorded in a Pattern Dictionary (see Hanks and Pustejovsky 2005). If a distinction is made between normal and abnormal usage, then normal uses such as 1 can be interpreted by matching to normal patterns, while the interpretation of abnormal uses such as 2 can be left to second-order semantic computation. Not only adders but also other animate agents drink. What they drink is a liquid, typically water. In fact, this syntagmatic association is so strong that (even without the additional evidence of “pool”) it is easy to supply a highly specific default semantic type for the missing direct object in 2. What did the adder drink? Water, of course. How do we know this? Because in any large collection of general texts a statistically significant association between drink and water will show up, especially if the subject of verb drink is a non-human animate agent. NLP researchers look to large ontologies such as WordNet to answer questions such as, “What is the set of all animate agents?” and “What is the set of substances that animate agents drink?” In WordNet, the hyponyms under the synset containing the expression animate thing provide a reasonably exhaustive list.2 So far so good, 1
2
Unfortunately, in subsequent research on word meaning Wilks and his colleagues did not build on this distinction. Instead, they focused on the computational manipulation of pre-existing dictionaries. A characteristic of these dictionaries is that they say comparatively little about the syntagmatics of words, while saying much (perhaps too much) about word meaning. Wilks himself now acknowledges (Wilks and Ide 2005) that work using such dictionaries as resources has not solved the so-called “WSD” (Word Sense Disambiguation) problem. However, Wilks and Ide do not go on to draw the conclusion that a different kind of dictionary might be needed, nor do they consider the possibility that the “WSD problem” may be badly formulated, i.e. embody false assumptions, and hence be insoluble. Instead, rather surprisingly, they speculate that lexicographers do not have the right kind of expertise for word sense disambiguation and that any attempt at fine-grained delicacy should be abandoned. In this paper I present the rationale for a new kind of dictionary, a dictionary of syntagmatic preferences, which is explicitly designed to show fine-grained, delicate, empirically well-founded connections between meaning and use in a machine-tractable way. The hyponym list also contains a surprisingly large number of unused terms such as eutherian, acrodont, and pleurodont. Presumably, these rare or nonexistent words in WordNet do no harm if no one – neither human nor program – ever looks them up.
Preference Syntagmatics
121
but if similar questions are asked about the direct object of the verb drink and about the attributes of gasoline, WordNet’s answers are less satisfactory. Attributes of the sense of a word are sometimes enumerated in WordNet as if they were separate senses, while in other cases they are not mentioned at all. Gasoline, for example, is listed in WordNet as a fuel but not as a liquid. Water occurs in six WordNet synsets, one of which is glossed as “a clear colorless odorless tasteless liquid” and another as “a fluid necessary for the life of most animals and plants”. The latter sense seems to be the relevant one for our question, but unfortunately it is found only as a hyponym of food. You cannot tell from WordNet that the “fluid necessary for the life of animals and plants” is a liquid. The synset that includes water as a hyponym of liquid is not linked to drinkable liquids.3 To solve problems such as this, a possible strategy might be to lump all the synsets for a given target word together and treat them as a single semantic entity, But it turns out that this tactic is even more unsatisfactory, since it removes necessary constraints: For example, water is listed in another synset as a synonym of urine. Lumping the synsets for water together could lead to the unfortunate conclusion that urine is food. To some extent, these are problems of performance by WordNet, at least some of which could probably be corrected fairly easily. However, there are also underlying problems of principle in WordNet, of which the two main ones are (a) that it ignores syntagmatics and (b) that it presents attributes (e.g. the drinkableness of water) as separate senses. What is needed is an ontology that links attributes to entities and that shows normal syntagmatics, e.g. that what most animate entities normally drink is water, that mammals also typically drink milk, and that humans also drink manufactured beverages of various kinds. It needs to enable a purposive link between gasoline (British: petrol) and motor vehicle, as well as showing that gasoline is a liquid. An ontology such as this is a necessary component of the practical application of preference semantics to processing free text. Such an ontology does not yet exist, but exploratory work is in progress on creating one at Brandeis University (Pustejovsky et al. 2004).
6.2 WSD Revisited: Reducing Lexical Entropy Wilks and his colleagues have devoted considerable energies over the years to the problem of word sense disambiguation (WSD). The problem, in a nutshell, is to enable an NLP application to decide, as far as is necessary for successful processing, what an ambiguous word means in a given text. If a word has more than one sense, which sense is activated when the word is used in a particular text, and how can we decide which sense to assign it to, and how can we decide what counts as a sense anyway? The first two of these questions presupposethat each content word in a language has a finite set of distinct, mutually exclusive senses. This supposition seems at first sight to be supported by standard dictionaries which list numbered 3
The term drinking water is so linked, but this is unsatisfactory. What the adder drank was not ‘drinking water’ but ‘water’.
122
Hanks
senses, but in fact it is based either on a misreading of those dictionaries or on a misrepresentation of the language by those dictionaries, or both. The idealizations of competing meanings that are sufficient to trigger appropriate interpretations by human readers disguise the extent to which competing meanings of words overlap. As long ago as 1972 Wilks acknowledged this difficulty very clearly, using the example of stake: It is very difficult to assign word occurrences to sense classes in any manner that is both general and determinate. In the sentence “I have a stake in this country”, and “My stake in the last race was a pound” is “stake” being used in the same sense or not? If “stake” can be interpreted to mean something as vague as “Stake as any kind of investment in any enterprise”, then the answer is yes. So, if a semantic dictionary contained only two senses for “stake”: that vague sense together with “Stake as a post”, then one would expect to assign the vague sense for both the sentences above. but if, on the other hand, the dictionary distinguished “Stake as an investment” from “Stake as an initial payment in a game or race” then the answer would be expected to be different.4 So, then, word sense disambiguation is relative to the dictionary of sense choices available and can have no absolute quality about it. – Wilks 1972. Thanks to corpus evidence, we can now see that the lexical ambiguity of the vast majority of content words is of this kind (overlapping polysemy, often in long chains of “family resemblances”). Mutually exclusive cases such as stake1 (= “post”) vs. stake2 (= “investment”), or bank1 (= “financial institution”) vs. bank2 (= “side of a river”), are rare. The mutually exclusive cases are only a small part of the problem, but according to Wilks and Ide (2006), these are the only cases that have been successfully disambiguated by computational methods. They go on to say: There is rarely a need to make distinctions below the homograph-like level for understanding, human or automated. This is a counsel of despair. Fine-grained distinctions are indeed fuzzy and they do indeed overlap, but any kind of understanding worthy of the name must face up to the problem, not turn away from it. Understanding, properly so-called, involves being able to compute, on a probabilistic, best-likelihood basis, the meaning or implicatures of woprds in context, on the basis of text input (see below for examples). Elsewhere, Wilks and Ide comment: Some linguists (e.g., Lakoff 1987; Heine 1987; Malt et al. 1999) have proposed that polysemy develops via a chain of novel extensions to 4
Splitters in the lexicographic game will point out differences in implicatures: for example, one stands to lose one’s stake in a race or game (to the bookmaker or winner), whereas one’s stake in one’s country cannot be lost in the same way. – PWH
Preference Syntagmatics
123
previously known senses, each building on its predecessors. This idea, and computational methods for it surveyed and discussed in Wilks and Catizone (2002), follows nicely on from proposals for the generative lexicon proposed by Pustejovsky (1995) and others, but adds the notion that at some point, senses diverge enough to deserve independent representation in the lexicon (either computational or mental). The “chain-of-novel-extensions” model is plentifully supported by etymological evidence, painstakingly developed in the nineteenth and twentieth centuries by Indo-European philologists (see OED passim). However, despite its undoubted philological interest, this model does not tell us how to recognize the point at which senses “diverge enough to deserve independent representation”. The problem, of course, is in identifying the point at which two senses become distinct enough to warrant separation for the purposes of NLP (or, for that matter, in dictionaries and the mental lexicon). – Wilks and Ide, op. cit. Lexicographers have long recognized that it is unusual for the senses of a word to “become distinct enough to warrant separation” and that therefore any dictionary entry that implies such a separation is simultaneously both an oversimplification and an overelaboration, written for the convenience of human users, not for NLP. However, corpus lexicographers have also noticed something else, the implications of which are more important than the rarity of mutually exclusive sense distinctions. It is this: despite the undeniable existence of indeterminate examples, the majority of uses of each word, in particular verbs and adjectives, fall into a small number of highly distinctive patterns. The patterns consist of syntagmatic regularities. These syntagmatic regularities consist of argument structures or valencies, with alternations such as active/passive, causative/inchoative, object-drop, etc. Each argument of each predicator is populated in normal usage by a set of words that recur frequently and a more or less open-ended set of words which recur rarely or not at all. The recurring words normally share some common feature, either their semantic type or some other semantic property. In other words, identifying senses that are mutually exclusive is not the heart of the word-meaning problem. The heart of the problem is identifying the influence of normal context on senses that have only partly diverged, and distinguishing normal contexts from abnormal contexts. With this in mind, the WSD problem can be reformulated. Instead of attempting to match uses of words in free text directly to a dictionary that states word meanings, we can match word uses to the normal contexts of those words. For this to be possible, we need two tools: a dictionary of syntagmatic preferences (“a pattern dictionary”) and an empirically well-founded ontology that shows the semantic types and other properties of the words that populate the patterns. Up to now, attempts to describe semantic values of verb arguments (for example PropBank and VerbNet; see Kingsbury and Palmer 2002) have been based on introspection (notably on so-called “Levin classes”, Levin 1993). These do not
124
Hanks
stand up well to scrutiny in the light of actual usage. More often than not, they are partial and/or inaccurate. (For examples, see Baker and Ruppenhofer 2002; Hanks and Pustejovsky 2005). Examples of fine-grained contrasting implicatures (preceded by “=”) are the following. 2a. 2b. 3a. 3b. 4a. 4b. 4c.
A person fires a gun (= the gun stays where it is) A person fires a bullet at something (= the bullet moves) He shook his fist at them (= his own fist) He shook his hand (= someone else’s hand) A lawyer files a lawsuit (= activates a procedure) A clerk files some papers (= puts them away) A reporter files a news story (= submits it for publication)
In principle, NLP ought to be able to deal with such distinctions. Examples such as these have provoked fears of unmanageably large lists. Such fears are groundless. The Pattern Dictionary currently contains 15 patterns for the verb fire, 36 patterns for the verb shake, and 14 patterns for the verb file; these are sufficient to represent all normal implicatures of these verbs at the sort of fine-grained level indicated here and are no less manageable than the number of senses in machine-readable dictionaries. Moreover, since the patterns for each verb are (mostly) mutually exclusive, no matter how many patterns there are, no combinatorial explosion arises. Many verbs have only one or two patterns. It is true that light verbs such as give and take have over 100 patterns; this is because a) the patterns for phrasal verbs such as take off are subsumed under the main verb and b) light-verb combinations such as take notice, take [[Fact]] into account, take account of [[Fact]], and take the plunge are listed as separate patterns. This is not a problem provided that distinctive criteria for recognizing each pattern uniquely can be given. Although word senses may overlap (i.e. they share components), patterns generally do not. Patterns are generally mutually exclusive. That is, the relationship between patterns and senses is normally many-to-one: a single sense of a word may be associated with several different patterns. The converse (one pattern mapping to two or more senses) does occur, but rarely. When it occurs (other than as a lexicographer’s error), it represents a real ambiguity in the language. (See the discussion of “he drinks” below). A checklist of words and their normal contexts is a resource for a variety of applications, including inferencing entailments, text summarization, idiomatic text generation, and machine translation. In such a checklist meanings, implicatures, synonyms, translations, and any other desired features are attached to contexts or “patterns”, rather than to words in isolation. Putting a word into a normal context greatly reduces (and often completely eliminates) its lexical-semantic entropy. In isolation, the noun orange, for example, could have any of several meanings. Put it together with the verb eat and the entropy is reduced if not eliminated. Put it together with the verb paint [plus a surface or physical object as direct object] and
Preference Syntagmatics
125
a different meaning of orange is activated. Put the verb eat together with jealousy, as in 5, and a different meaning of eat is generated. 5. It is Lachlan, eaten up by jealousy, who is plotting against my heir. The pattern underlying 5 is “[[Emotion=Bad]] eat [[Human]] (up)”, where the completive-intensive particle up is optional. The semantic role of each argument contrasts with another, more common pattern, “[[Human Animal]] eat [[Food]]”. So it is with most words. The number of stereotypical collocations for most if not all verbs is both small enough to be collected in a dictionary and widely used enough to reduce and very often eliminate uncertainty of meaning. We now know that human linguistic behaviour is much more highly patterned than was recognized before the statistical analysis of large and representative collections of text became possible. The pattern dictionary provides patterns of preferred contexts, to which actual contexts can be matched. Of course, there is not always a perfect fit between each actual occurrence and any of the patterns in the pattern dictionary, so a preferential approach must be adopted. Let us return for a moment to 1 and 2 and ask, are both these sentences normal? It is intuitively obvious that animate entities drinking water is normal and cars drinking gasoline is abnormal, and this obvious observation is supported by corpus evidence. It is necessary to distinguish systematically between the two. The tension between actual context and preferred context in unusual uses of a word might be expected to result in incoherence, but in fact it very often contributes metaphorical resonance to the meaning of the text in which the word occurs. Such metaphorical resonance is evidently part of a word’s linguistic Gestalt – but this must be the subject of a future investigation. The exploitation of semantic and phraseological preferences is clearly rule-governed, but exploitation rules remain to be explicated systematically. The rules governing the exploitation of preferences can only be properly explored once the preferences themselves have been satisfactorily identified. That is the current task of the team building the Pattern Dictionary. The proposition “Animate entities drink water” at first sight appears to be reminiscent of those found in CYC (Lenat and Guha 1990), a large knowledge base of common-sense propositions about terms, their meanings, and their entailments, organized according to theories and microtheories. 6. Fred saw the plane flying over Zurich 7. Fred saw the mountains flying over Zurich CYC claims that it can distinguish the different meanings of flying in 6 and 7 because it knows that Fred is a human and humans can fly, while mountains don’t. However, CYC is not evidence-based and does not use syntagmatic organization to support its microtheoretical organization of propositions about terms. It is hard to see how the coarse-grained syntactic parser used by CYC could be made more delicate (more fine-grained) without some statistically well-founded form of syntagmatic lexical clustering. Thus the semantic component of CYC is probably overloaded in
126
Hanks
the case of normal everyday words, though it is hard to be sure. This may be one reason why CYC claims successful applications in the processing of terms only in restricted domains such as business and physics, rather than in general language processing. As far as one can tell, the usefulness of CYC in processing free text is impaired by a combinatorial explosion of the possible senses of common words. If this is true, then syntagmatic filtering by matching free text with normal patterns may be at least part of the solution.
6.3 Why Standard Dictionaries Won’t Do Despite his 1972 reservations, Wilks and his colleagues devoted considerable efforts to making an English learner’s dictionary (Longman Dictionary of Contemporary English, LDOCE: Procter et al., 1978) machine-tractable and massaging the information in it for various purposes, including distinguishing the meanings of polysemous words. As Wilks and Ide (2006) make clear, this approach has been at best only partially successful, in part for the reasons stated in the preceding section, but also because essential information about context is not given in standard English dictionaries such as LDOCE. For human users it is not necessary to state the obvious, but for AI applications and linguistic computing it is. So, for example, the LDOCE entry for the verb drink does not explicitly state that an animate subject is expected. The entries for human, animal, beast, animate, etc., quite rightly do not explicitly state that humans, animals, and beasts drink. The entry for petrol says that it is “a liquid used to supply power to the engine of cars and other vehicles”, but not that to make a car run someone has to put petrol into the petrol tank, nor that this is a normal meaning of the phrasal verb fill up. The current focus of the Pattern Dictionary is on recording salient syntagmatic patterns and their primary implicatures at a sufficiently delicate level to enable understanding of normal text. Having stated the primary implicature of a pattern, it is always possible to add further, secondary implicatures. Having stated all the patterns in which a given word participates, it is also possible to add more, if evidence for them is found – with the proviso that any additional pattern should, ideally, contrast effectively with all other existing patterns for that word.
6.4 Corpus Analysis and Pattern Identification Analysis of the phrasal verb fill up shows that, in normal English, all sorts of containers are filled up with appropriate contents, ranging from bottles with booze to freezers with food and removal vans with furniture. It is therefore necessary to link the verb fill up to a canonical set of containers. This is done wherever possible by linking the lexical sets in patterns to a semantic ontology. Items that are not normally classified as containers (e.g. wallet) may be coerced to be honorary members of the set of containers when used with fill up, e.g. wallets may be filled up with money and people with food, but interpretation of Given the mechanism
Preference Syntagmatics
127
of coercion, it is not necessary to add the lexical items wallet and person under the semantic type [[Container]] and indeed it would be wrong to do so. At the same time, filling up a car with petrol is so conventionalized that it is generally underspecified, i.e. the actual container (the petrol tank) is not mentioned and must be inferred from the context. Part of the job of the Pattern Dictionary is to record explicitly elements of a sentence that may be underspecified in everyday usage (as in 9 below). It is normally clear from the context (both the physical situation of a speaker and textual collocates such as driver, car, or garage) that the utterance “Fill her up” means “pour petrol into the petrol tank of the car”, not “pour petrol into the car (i.e. the passenger compartment)”, still less “pour liquid into my female passenger”. In context, speakers regularly underspecify messages. At a petrol station, petrol does not need to be mentioned, unless it be contrastively (to distinguish it from diesel fuel), and in English (as opposed to Czech or Russian) people more often than not do not specify that it is the petrol tank that is filled up. In everyday conversation at petrol stations, the holonymy passes unnoticed and unremarked. Such underspecified utterances trade on established linguistic (not merely situational) conventions. The Pattern Dictionary provides a mechanism for supplying the missing semantic values and arguments of such arguments. Some patterns are mutually exclusive, but more often the contrast is between general and particular: 8. [[Human]] fill [[Container]] {up} {with [[Stuff [PL]Phys Obj]] 9. [[Human]] fill [[Road Vehicle = Petrol Tank]] {up} (with [[Petrol]]) Semantic ontologies record that petrol tank is a [[Container]] and that petrol is [[Stuff]]; thus Pattern 9 is merely a subset of Pattern 8. The point of listing them separately is to provide a mechanism for showing semantic correlations for the marked case – here, petrol and car or tank in relation to fill up. In the pattern dictionary, the specific is always preferred to the general. These two patterns then contrast with further patterns. Having specified all the normal patterns of usage for a given verb, together with their implicatures, the Pattern Dictionary is ready for use. It states primary implicatures (e.g. that Pattern 9 implies that a person inserts the nozzle of a petrol pump into the petrol tank of a car and pumps petrol into the petrol tank) and allows for the addition of any number of secondary implicatures (e.g. that this event takes place at a filling station or that this is necessary in order provide fuel to make the car’s engine operate). Now let us look at how the Pattern Dictionary deals with a more fine-grained distinction. 10. [[Human]] drink [[Liquid = Beverage]] implicature :[[Human]] takes [[Liquid = Beverage]] in through the mouth and swallows it. 11. [[Animate Entity]] drink [[Liquid = Water]]
128
Hanks
implicature: [[Animate Entity]] takes [[Liquid = Water]] in through the mouth and swallows it. secondary implicatures: [[Animate Entity]] does this in order to quench thirst. Doing this is necessary for the survival of [[Animate Entity]] 12. [[Human]] drink [NO OBJ] implicature:[[Human]] takes alcohol in through the mouth and swallows it. secondary implicatures: [[Human]] becomes drunk as a result of doing this. [[Human]] regularly becomes drunk. Doing this is bad. comment: only in simple (not continuous) tenses. These three patterns illustrate the distinction between lumping and splitting in lexicography. Distinguishing such closely overlapping senses enable delicate semantic implicatures to be represented. A more coarse-grained pattern dictionary (of the kind that seems to be the current goal of PropBank) would settle, perfectly reasonably, for a single pattern, namely 11, without secondary implicatures. (Note, however, that 10 is the most frequent pattern for drink (such is the anthropocentric nature of human language), while 11 is the least marked and 12 is the most marked). By distinguishing 12 as a separate pattern from 10 and 11, the following semantic representation can be made. When an antelope or an adder drinks (intransitive), we have a simple object-drop alternation of the transitive verb, whereas when a human drinks (intransitive), there is an ambiguity. It may be a simple object-drop alternation (as when someone at a tea party picks up their cup and drinks), or it may imply getting drink or even habitual intake of alcohol, typically in excessive quantities. Pattern 12 provides a phraseological hook on which to hang the implicature that, if someone drinks, he or she may have alcoholic tendencies. Other textual clues (but not the argument structure, which is the central concern of the Pattern Dictionary) may enable English-language users to distinguish between a single occasion of drinking (in the sense of getting drunk) and habitual drinking (in the sense of being or risking being an alcoholic). There is a stable set of just over 6,000 verbs in normal use in English, depending on whether phrasal verbs are counted separately.5 An account of all normal lexicosyntagmatic patterns of these verbs is the first target of the Pattern Dictionary. Identifying the normal usage patterns of 6,000 linguistic items is indeed a large and ambitious undertaking, but considerably less ambitious than compiling a standard dictionary. By contrast, identifying the normal usage patterns of nouns would be a task 15 or 20 times larger, and would risk missing the interrelatedness of nouns and verbs. 5
This figure excludes verbs such as abscise, absquatulate, attemper, attorn, auscultate, and auspicate, which are found in WordNet and some dictionaries but not in ordinary usage.
Preference Syntagmatics
129
6.5 Linking Patterns to an Ontology Lexical items are plugged into a semantic ontology or “type system”. The model for this the Brandeis Semantic Ontology (BSO) (see Rumshisky et al., 2006). In the BSO meaning elements and the links between them are organized independently of the lexical items that are associated with each type. At the same time, each sense of a verb is associated with one or more patterns, which express the meaning not only of the verb but also of its arguments, in terms of semantic types. Consider one of the patterns mentioned above for the verb drink: 13. [[Human]] drink [[Liquid = Beverage]] This pattern is linked to three types in the Brandeis Semantic Ontology: 14. [[Animate Entity]] (which has[[Human]] as one of its subtypes) 15. [[Drink Activity]], a subtype of [[Ingest Activity]], a subtype of [[Activity]], a subtype of [[Event]]. This type lists as its typical subject role [[Animate Entity]] and typical object role [[Beverage]] 16. [[Beverage]] = a subtype of [[Liquid]] (a subtype of [[Material Substance]], a subtype of [[Entity]]). This type lists as its telic role [[Drink Activity]]. It will readily be seen that BSO shows that the event [[Drink Activity]] has a preferential association with [[Beverage]] and vice versa. However, it would be wrong to say that[[Human]] has some kind of preferential association with [[Drink Activity]]. Most people do many other things beside drinking. It may well make sense to create a high-level link between [[Animate Entity]] and [[Activity]] – both people and animals do habitually engage in motivated actions (activities), unlike other physical objects such as rocks – but the wisdom of making such additions to the type system remains to be considered. Under the Type [[Beverage]] a long list of beverages is given, including the subtype [[Alcoholic Beverage]]. BSO does not attempt to list all the words and names – past, present, and future – that denote, have denoted, or could possibly one day denote a human being. Instead, the partial list of humans in semantic roles (builder, doctor, etc.) is supplemented by procedures for named entity recognition. Now consider two coercions (perfectly well-formed and meaningful but abnormal sentences): 17. John drank some gasoline. Here again, Event=drink=[[Drink Activity]]. The Brandeis Semantic Ontology (correctly) does not type gasoline as a [[Beverage]], but as [[Fuel]]. It should also say6 that the constitutive of gasoline is [[Liquid]], the mutual preference mapping of {[[Animate Entity]] <–> [[Drink Activity]]} and {[[Drink Activity]] <–> [[Liquid Substance]]} would provide the correct interpretation (namely that John 6
At the time of writing the BSO does not say this – probably an oversight.
130
Hanks
has engaged in a [[Drink Activity]]) and we could call for the ambulance to take John to the hospital. 18. John’s car drinks gasoline. In 18 the default event type must be coerced to [[Use Activity]] – for the literal meaning of the sentence is that cars use gasoline – so in this case the argument types are not inherited from the Event type, but vice versa. This is done by linking [[Driving Vehicle]] to [[Use Activity]] & [[Fuel]]. The combination [[Driving Vehicle]] plus [[Use Activity]] & [[Fuel]] outweighs the normal interpretation of the verb drink as a [[Drink Activity]] and coerces it to [[Use Activity]].
6.6 Refining Preferences by Corpus Analysis Wilks (1980: p. 141) states that the verb fire at prefers an animate object, and discusses sentence 19. 19. John fired at a line of stags. The notion that fire at has a preference for an animate object is widely repeated in the literature, but as it happens it is not true. It simultaneously overrestricts and underrestricts. On the one hand, targets that are fired at regularly include vehicles and locations, not just animates. On the other hand, the words found in the prepositional object slot (whether animate or inanimate) very often have the semantic property military (soldiers, tanks, aircraft, defensive positions, etc.). The property military is not a necessary condition of this argument7 military is nevertheless a criterial feature, whereas animate is not. If we compare Wilks’s example (19) with uses of fire at in a broad spectrum of general texts such as those of the British National Corpus, we see that a line of stags is an outlier in the cluster of terms that populate this slot. Possibly, the situation would be different in a domain-specific corpus of texts, for example, hunting magazines. We shall return to the question of domain-specific semantic values shortly. Facts such as these present the lexical analyst with a choice, which is resolved by the intended purpose of the lexicon. On the one hand, if the purpose of the lexicon is to express semantic preferences with reasonable certainty, the semantic types should be couched in terms of maximum generality, as in 20. (This is in fact, the position taken by Wilks in the 1970s and subsequently.) 20. [[Human]] fire_at [[PhysObj]] 7
Examples of prepositional objects of fire at with the property hunted_animal, e.g. stag, do of course also occur, and, in a way that is typical of the elastic nature of lexical sets, the set extends outwards to encompass all sorts of physical_objects in addition to the kinds mentioned here.
Preference Syntagmatics
131
On the other hand, if the purpose of the lexicon is to enable specific – but necessarily probabilistic rather than certain – predictions to be made about the likely meaning of utterances such as “unknown fired at unknown”, information about likely semantic roles is needed, as in 21: 21. [[Human (= military)]] fire_at [[PhysObj (= animate (= human) vehicle location) & military]] A problem for a dictionary based on statistical analysis of corpus data, such as the Pattern Dictionary, is that (despite occasional protestations to the contrary) users in the NLP and AI communities, just like any other group of human users, tend to expect that statements in dictionaries represent necessary conditions, not probabilities. Perhaps this expectation can best be countered by adding corpus-based, corpus-monitored statements of observed comparative frequencies automatically to each semantic value, as in 22. 22. [[Human 93% (= military 25%)]] fire_at [[PhysObj 94% (= human 49% vehicle 20% location 14%) & military 26%]] Informally, this means that 93% of subjects of fire_at in the sample were identified as Human, and 25% as Military. At the same time, 94% of the direct objects were identified as Physical Objects, subdivided as to 49% Human, 20% Vehicles (cars, aircraft, tanks, ships, etc.), and 14% Locations (military positions, buildings, Israel, etc.). Separately, 26% of the direct objects were identified as having the property Military (enemy troops, tanks, rebel positions, etc.). Statistics of this kind can form a serious basis for calculating inferences with measurable probability. Thus, corpus evidence shows that Wilks’s “ani obje” for this verb pattern should be replaced (in his terms) by “physobj obje”, since the value ani here is neither one thing nor the other – neither a strong preference nor a weak probability.
6.7 Domain-specific Patterns Yarowsky et al. (1992) argued that words tend to have a unique interpretation in each document. This claim, known under the catch phrase “One sense per discourse”, has been vigorously disputed by Krovetz (1998), who cites data showing Yarowsky’s argument (that a very high percentage of words occur in only one sense in any document) is wrong and that, outside Yarowsky’s chosen domain of encyclopedia articles, it is quite common for words to appear in the same document bearing different senses. Nevertheless, Yarowsky’s point contains a germ of truth. In most documents, at least some of the polysemous content words do have only one meaning, while in other cases the distribution of the different patterns varies according to the domain of the document. We find both domain-specific norms and domain-specific preferences. Thus, the sense of the words in a document is to some extent pre-selected by the domain to which the document belongs, a pre-selection that is reinforced as the document proceeds.
132
Hanks
Patterns are invoked in a hierarchy of markedness. To take just one of the groups of norms associated with the verb advance (the one where it is literally a verb of movement), detailed analysis of a BNC sample shows that the domain is normally military and the grammatical subject (the “external argument”) is an army moving in a particular direction towards an objective.8 Closely associated with the military norm, and patterned similarly, is a norm in the domain of everyday social behaviour, in which a person advances on another person. In addition, there are domain-specific norms. In banking and borrowing, advancing money is the norm; in discourse on philosophical and political topics, advancing ideas is the norm. “Normal combinations” must allow for domain-specific preferences; for example, in the subset of BNC composed of reports on soccer matches, the verb fire exhibits a preference for an adverbial argument with PPs past the goalkeeper, past the post, into the back of the net, while in the same subcorpus the verb climb exhibits a preference for the PP above the defenders. It is perfectly possible for these soccerspecific preferences to occur in other kinds of texts, but, as a matter of empirically observed fact, they generally don’t. Since every text is about something, the role of domain and document in setting lexical preferences needs to be investigated more thoroughly than it has been up to now. Possible but non-preferred combinations become less and less observable as semantic distance from the normal complementation patterns increases. This simple fact underlies all serious work in statistical lexical analysis, from Church and Hanks (1990) to Kilgarriff et al. (2004).
6.8 The Ubiquity of Preferences Preferences are everywhere in language systems. Preferences govern both meaning and linguistic behaviour. That is to say, the entire process of constructing an interpretation for a linguistic utterance is governed by a contextual network of interacting preferences. There are no necessary conditions governing the associations between words and meanings, but independent confirmation from different preference systems and indeed from within the same preference system can combine to yield an impression of certainty. The following preference systems at least need to be taken into account: Lexical preferences: Words prefer the company of certain other words. Example: the verb hazard overwhelmingly prefers the noun guess as its direct object, but the role of guess is also fulfilled, with minor variations in meaning, by other nouns in the same semantic class and by a that-clause: hazard a conjecture, hazard a definition, hazard a suggestion, it seemed sensible to hazard that a man of this standing would have held property in the area. It might be tempting to substitute a semantic class, say [[Speech Act]], for the lexical item guess in the direct-object 8
Even though the objective of a military advance is very often left unspecified, it is always implicitly present in the semantics of the verb advance in this contextual pattern: the difference between an army moving southwards and an army advancing southwards is that in the latter case a military objective is implied.
Preference Syntagmatics
133
relation to the verb hazard, but this would lose specificity of meaning. The verb has a preference for one particular lexical item, namely guess, and that only: more than 50% of uses of hazard as a verb in both British and American corpora have guess as their direct object. The verb is indeed used with any of a number of other (nonpreferred) nouns denoting a speech act in the direct-object slot, but in the context of this verb, these nouns mean more than just [[Speech Act]]. The strong association between hazard and guess colours the interpretation of alternative words such as hazard a definition and hazard a suggestion. In this context, even a definition has a guess-like quality. If the direct object slot is broadened to include all kinds of speech acts, the meaning of the phrase becomes blurred or lost. Syntactic preferences: The verb allow prefers complementation with a direct object and a to-infinitive (“Do not allow a thief to get into your house”), but it allows other complementations (e.g. ditransitive: “Do not allow a thief access to your house”; prepositional: “Do not allow a thief into your house”; clausal: “do not allow your house to be accessible to a thief”). Any representation of the underlying syntactic and semantic regularity here must represent both the fact that allow governs an Event as its argument and the fact that the Event is normally expressed as a to-infinitive. Domain-specific preferences: As indicated above, for many words the preferred complementation varies from domain to domain: the verbs fire and climb have a preferred sense and pattern in soccer journalism that is quite different from the normal sense in other domains. In financial journalism climb more often refers to the rising value of stocks and shares than to going up a mountain. These facts, too, need to be taken into account in any representation of the lexicon.
6.9 Conclusions The main argument of this paper is that the so-called “WSD problem” has been formulated in such a way as to guarantee its own insolubility. The assumption that each word has a finite list of meanings that can be “disambiguated” is plausible but wrong. Meanings cannot be assigned effectively to words in isolation, and that is the source of the WSD problem. The problem needs to be re-formulated before it can be resolved. The reformulation proposed here relies on the fact that each word is associated with an observable, measurable set of normal contexts. In this way, a distinction between normal and abnormal contexts can be established. This brings us back to Wilks’s theory of preference semantics, but instead of trying to formulate word meanings directly, the alternative proposed here is to measure the syntagmatics first – i.e. to compute similarity of the textual context surrounding a word in a document to the best match in an inventory of normal contexts in the Pattern Dictionary – and then to select the meaning given for the best-match syntagmatic pattern. The meanings may be more or less delicate, according to (a) practicalities, e.g. the needs of the intended application; and (b) the constraints imposed by the language itself on what can be recorded in the Pattern Dictionary.
134
Hanks
Argument structures are expressed in the Pattern Dictionary as far as possible in terms of the semantic types and attributes of the Brandeis Semantic Ontology, which states the semantic type of each sense of the word in itself, together with certain of its properties (for example, its “telic” or purpose). Given adequate pre-processing (part-of-speech tagging, parsing, and resolution of pronominal anaphora), the pattern of the arguments is very often sufficient to compute the fine-grained meaning of a word in a document with reasonable accuracy. Occasionally, however, it is necessary to invoke additional evidence, such as the domain or text type of the document or, very occasionally, the wider context.
References Baker, C.F., and J. Ruppenhofer. 2002. ‘FrameNet’s Frames vs. Levin’s Verb Classes’ in J. Larson and M. Paster (eds.), Proceedings of the 28th Annual Meeting of the Berkeley Linguistics Society, pp. 27–38. Church, K., and P. Hanks. 1990. ‘Word association norms, mutual information, and lexicography’ in Computational Linguistics 16(1). Hanks, P., and J. Pustejovsky. 2005. “A Pattern Dictionary for Natural Language Processing” in Revue française de linguistique appliquée 10(2). Katz, J.J., and J.A. Fodor. 1963. ‘The Structure of a Semantic Theory’ in Language 39. Kilgarriff, A., D. Tugwell, P. Rychly, and P. Smrz. 2004. ‘The Sketch Engine’ in Proceedings of Euralex, Lorient, France, July 2004, pp. 105–116. Kingsbury, P., and M. Palmer. 2002. ‘From Treebank to Propbank’, LREC-02: Third International Conference on Language Resources and Evaluation. Las Palmas. Krovetz, R. 1998. ‘More than One Sense per Discourse’. NEC Princeton NJ Labs. Research Memorandum. Lenat, D., and R.V. Guha. 1990. Building Large Knowledge-Based Systems: representation and inference in the Cyc project. New York: Addison-Wesley. Levin, B. 1993. English Verb Classes and Alternations. University of Chicago Press. Procter, P., et al. 1978. Longman Dictionary of Contemporary English, 1st edn. Harlow: Longman. Pustejovsky, J. 1995. The Generative Lexicon. Cambridge, MA: MIT Press. Pustejovsky, J., and P. Hanks. 2001. ‘Very Large Lexical Lexical Databases’. ACL Workshop, Toulouse. Pustejovsky, J., P. Hanks, and A. Rumshisky. 2004. “Automated Induction of Sense in Context”. COLING 2004. Geneva, Switzerland. Rumshisky, A., C. Havasi, and J. Pustejovsky. 2006. ‘Constructing a Corpus-based Ontology using Model Bias’. 19th International FLAIRS Conference. Wilks, Y. 1972. Grammar, Meaning, and the Machine Analysis of Language. London: Routledge. Wilks, Y. 1973. ‘Understanding without Proofs’. IJCAI, pp. 270–277. Wilks, Y. 1975a. ‘Preference Semantics’ in E.L. Keenan (ed.), Formal Semantics of Natural Language. Cambridge: Cambridge University Press. Wilks, Y. 1975b. “A Preferential, Pattern Seeking Semantics for Natural Language Inference” in Artificial Intelligence 6. Wilks, Y. 1977. ‘Good and Bad Arguments about Semantic Primitives’ in Communication and Cognition 10(3/4).
Preference Syntagmatics
135
Wilks, Y. 1978. ‘Making Preferences More Active’ in Artificial Intelligence 11(3); reprinted in N.V. Findler (ed., 1979), Associative Networks. New York: Academic Press. Wilks, Y. 1980. ‘Frames, Semantics, and Novelty’ in D. Metzing (ed.), Frame Conceptions and Text Understanding. Berlin, New York: De Gruyter. Wilks, Y., B. Slator, and L. Guthrie. 1996. Electric Words: Dictionaries, Computers, and Meanings. Cambridge, MA: MIT Press. Wilks, Y., and R. Catizone. 2002. ‘Lexical Tuning’ in Proceedings of the 3rd International Conference on Computational Linguistics and Intelligent Text Processing (cicling 2002) Mexico City, Mexico. Lecture Notes in Computer Science 2276. Springer. Wilks, Y., and N. Ide. 2006. ‘Making Sense about Sense’. Chapter 3 of E. Agirre and P. Edmonds (eds.), Word Sense Disambiguation: Algorithms and Applications, Berlin: Springer Verlag. Yarowsky, D., W. Gale, and K. Church. 1992. ‘One Sense Per Discourse’ in Proceedings of the 4th DARPA Speech and Natural Language Workshop.
Web Sites for Lexical Resources Corpus Pattern Analysis: http://nlp.fi.muni.cz/projects/cpa/ CYC: http://www.cyc.com FrameNet: http://framenet.icsi.berkeley.edu/ Propbank: http://www.cis.upenn.edu/∼mpalmer/project_pages/ACE.htm VerbNet: http://www.cis.upenn.edu/∼kipper/VerbNet/ WordNet: http://wordnet.princeton.edu/
This page intentionally blank
7 Historical Ontologies Nancy Ide1 and David Woolner2 1 2
Department of Computer Science, Vassar College, Poughkeepsie, NY, USA Marist College, Poughkeepsie, NY, USA
Abstract:
Static ontologies cannot capture the relevant contextual knowledge required for search and retrieval of historical documents because the entities in the world and the relations among them change over time. This demands that information represented in the ontology is temporally contextualized and that relations among entities that are relevant during different temporal intervals are available to support user queries. Furthermore, it is necessary to account for the fact that the course of the ontology’s evolution and the processes that have effected it are a part of the knowledge that should be brought to bear on the analysis of information at any given time. This chapter outlines a model for historical ontologies that is intended to meet these requirements
7.1 Introduction Ontologies have played a role in Natural Language Processing (NLP) since the heydays of symbolic AI and NLP in the 1970s, given that language understanding was assumed, particularly at that time, to require complex information about concepts and the relations among them. Since then, ontologies (and their lesser sibling, the taxonomy) have been incorporated into various language processing applications. However, as McGuinness (2003) and many other authors have pointed out, ontologies have recently become the center of attention of a much broader community, due to the pivotal role they play in the vision of the Semantic Web (Berners-Lee et al., 2001; Fensel et al., 2003). As a result, there has been an unprecedented flurry of activity in the past few years focused on construction of ontologies and the development of tools to edit and reason with them, as well as standards for their representation and access via the World Wide Web. Most current work involves the rapid construction of ontologies for specific domains by a small team of “ontology engineers”, who typically identify the relevant concepts and relations a priori and then impose it on the system. The resulting ontologies are typically static, representing a set of objects/concepts, relations, and properties that remain constant. However, in some cases, the domain modeled by an ontology changes over time, and, depending on the time perspective, a different ontology is relevant for retrieval. For example, in order to retrieve all relevant information for a query about Germany, it may be necessary to recognize that 137 K. Ahmad, C. Brewster and M. Stevenson (eds.), Words and Intelligence II, 137–152. © 2007 Springer.
138
Ide and Woolner
instances of East Germany and West Germany in an ontology representing geopolitical entities prior to 1989 are each a part of a single entity, Germany, in a later ontology. This demands that information represented in the ontology is both temporally contextualized and that relations among entities that are relevant during different temporal intervals are available to support user queries. Furthermore, it is necessary to account for the fact that the course of the ontology’s evolution and the processes that have effected it are a part of the knowledge that should be brought to bear on the analysis of information at any given time. The FDR/Pearl Harbor project is developing means to support enhanced search and retrieval from a set of documents drawn from the Franklin D. Roosevelt Presidential Library (FDRL). The documents in our collection refer to situations and events over the 10-year period prior to the bombing of Pearl Harbor, during which the definitions of and relations among “entities” – especially, geo-political entities – were in a state of constant flux. A single, fixed ontology cannot capture the relevant contextual knowledge for all of the documents in our collection, because the entities in the world and the relations among them – and consequently, the configuration of the ontology – differ depending on the date of the document. Furthermore, a query concerning a given entity such as the US Secretary of State may demand retrieving information concerning the different persons who filled this role prior to the US declaration of war on Japan. In this chapter, we first provide an overview of the FDR/Pearl Harbor project and a discussion of the requirements for an ontology to support historical research. We then define a framework for historical ontologies intended to address these requirements. Finally, we describe the FDR historical ontology and outline open problems and future work.
7.2 The FDR/Pearl Harbor Project 7.2.1 Overview The work reported here was undertaken in the context of the FDR/Pearl Harbor Project,1 which is enhancing a range of image, sound, video and textual data drawn from the Franklin D. Roosevelt Presidential Library (FDRL). The data in the FDRL represents one of the most significant collections of historical material concerning the history of America and the world in the twentieth century. The centerpiece of the collection is a body of textual material known as the “President’s Secretary’s Files” (PSF). The PSF is the most sought after collection of papers held in the FDR Library. The PSF is comprised of 150,000 documents, including letters, diplomatic correspondence, intelligence reports, memoranda, newspaper clippings, photographs, and other historical materials. The FDR/Pearl Harbor Project is concerned with a collection of 1,446 internal administration documents concerned with US-Japanese
1
Supported by US National Science Foundation grant ITR-0218997.
Historical Ontologies
139
relations between 1931 and 1941, including memoranda of conversations, letters, diplomatic correspondence, intelligence reports, and economic reports. The documents in our collection are used by historians, political scientists, policy-makers, diplomatic and intelligence analysts and others studying JapaneseAmerican relations over the 10-year period leading up to the Japanese attack on Pearl Harbor. Research on Japanese-American relations during this period focuses on the nature of the diplomatic, military/strategic, and economic relations between the two nations – not only in isolation, but also in terms of the interactions among them. Historians and political scientists therefore rely on the documents in the FDRL as a primary resource to study the interplay of the dialogue between the two countries and between high-level officials in the Roosevelt Administration. Piecing together this type of information, however, is very time consuming. The vast majority of material an historian looks at is of no use to his or her research – it represents the chaff that must be sifted through before the researcher finds a kernel of information that is of real value. Through this lengthy process of elimination, the historian slowly builds his case; often with the result that it can sometimes take years to sift through all of the information required to arrive at a conclusion. Information retrieval and extraction provide the potential to sift the information in the FDRL documents in ways that will greatly enhance the speed and utility of the research process. In a large body of data, enhanced search and retrieval techniques might uncover new evidence (or a body of evidence not previously viewed in its totality) that when analyzed could result in a shift in the historical interpretation of a given event. The Japanese decision to pull out of the London Naval Talks in December 1935, for example, might take on new importance in the history of US-Japanese inter-war diplomacy if the data were to show that this decision played a significant part in the deterioration of relations between the two states. As the preceding example shows, historians and others studying historic documents seek not only to uncover facts, but also attitudes and opinions that result from, and/or lead to, events within the historical period during which they were produced. In particular, they want to know what the “actors” – the people, governments and military leaders and policy-makers – were thinking at the time. The FDR Project is therefore concerned with identifying evidence of such attitudes in the wording of documents in the corpus, and attributing this information to the appropriate person or entity. Attitude analysis can provide provocative information for the historian or political scientist to explore further by examining the documents themselves. For example, a quick analysis of a literal transcription of a statement by FDR to Japanese Ambassador Nomura on August 17, 1941 concerning the Japanese use of force in Southeast Asia shows a significantly higher percentage of words denoting strong (emphatic) language and power than a transcription of his remarks to Nomura on December 2, 5 days before the attack on Pearl Harbor (see Figure 7.1). The same is true to a lesser degree for the language in Secretary Cordell Hull’s report of Nomura’s replies in the same memoranda. One might have expected that conversations would become more heated as the attack on Pearl Harbor approached, but in fact the language of both FDR and the Ambassador appears to become more
140
Ide and Woolner CATEGORY
FDR/AUG17
FDR/DEC2
NOMURA/AUG17
NOMURA/DEC2
Positive
5.76
7.61
4.76
3.49
Strong
17.07
9.78
14.29
8.72
Power
7.98
2.17
7.94
5.23
Negative
2.44
3.26
.79
2.33
Hostile
1.55
1.45
0
1.74
Fig. 7.1. Percentage of words in several categories
conciliatory. At the same time, FDR’s language on December 2 contains more “positive” words than on August 17, while the reverse is true of Hull’s report of Nomura’s language. Thus the suggestion of the analysis runs counter to the widely held public perception that FDR “conspired” to force the United States into the war by provoking the Japanese in the final months leading up to the attack on Pearl Harbor. 7.2.2 Entity and Event Annotation The FDR/Pearl Harbor Project corpus has been annotated for a wide range of linguistic phenomena and entities, including persons, titles, dates, locations, and organizations (military, government, civilian, etc.), as well as documents, treaties, policies, ships and other military apparatus, raw materials, monetary references, etc. Lexical units in the corpus are additionally annotated for a wide range of semantic categories, including words indicative of opinion and attitude. In addition to entities, the corpus is annotated for events, including major historical events, which may or may not be mentioned in the documents; minor events referred to in the documents, such as a visit by the Japanese Ambassador to the Secretary of State; and communication events. Because a large portion of the documents in the collection are so-called “memoranda of conversations”, many are near-transcriptions of meetings between Japanese and US officials. We have therefore focused on communication events down to the level of the utterance (e.g., “X asked that ”) and apply attitude-recognition procedures to each utterance attributed to a given speaker. Note that the memoranda themselves represent complex communication events, in which several layers of subjectivity may exist. For example, a memorandum may comprise a report from Secretary Welles to FDR summarizing what the Japanese Ambassador said during a meeting with Secretary Hull and how the Secretary replied. Event and entity annotation of the FDR documents was accomplished using the General Architecture for Text Engineering (GATE) system developed at the University of Sheffield (Cunningham et al., 2002). Our work involved considerable extension of the pattern matching rules and gazetteer lists in the ANNIE entity recognition system provided in GATE to handle our data. Automatic annotation of the full corpus was bootstrapped using machine learning based on hand-validated annotations in a 100 document (10,000 word) sub-corpus.
Historical Ontologies
141
7.2.3 Attitude Analysis Our work builds on the increasing body of research concerned with the detection of attitudes and opinions in text2 to provide information to the historian about the orientation of language in the documents and document segments in our corpus. Because the content of all the documents in the collection is linked to the time-line representing the progression of events, it provides a means to study the impact of events on attitudes and vice versa, as well as the overall progression of attitude change over time. Much of the current work on attitude/opinion analysis is concerned with identifying favorability/unfavorability (polarity) toward a given topic, and/or an indication of the author’s emotional state with respect to that topic. Methodologically, this work expands approaches to content analysis undertaken by social scientists in the 1960s and 1970s, by largely relying on pre-defined lists of categorized words, phrases, collocations, etc. constructed either by hand or automatically. Most of this research has focused on polarity of opinion toward the document topic (Pang et al., 2002; Turney, 2002) or in individual sentences (e.g., Wiebe et al., 2004; Yu and Hatzivassiloglou, 2003), although there have been some recent attempts to address smaller text segments using deeper linguistic analysis (Bethard et al., 2004). Because our documents deal with a limited domain and are (more or less) stylistically consistent, we are developing means to automatically identify attitude and opinion by exploiting the rich syntactic and semantic annotation of our corpus and its supporting ontology. Rather than whole documents or single sentences, we focus on contiguous text segments that are attributed to a given point of view or “profile”, the different types of which include direct statements (letter content that is not reported speech or quoted material), reported speech, and third-hand reported speech, as noted above, in order to determine whose attitude is represented.
7.3 Ontology Support for Historical Research Ontology support for historical documents poses particular problems due to changes in the “facts” about the world that the ontology represents. One approach to these problems is to construct a series of independent ontologies, each of which is called into the play for the appropriate documents. There has been some work within the ontology community on “evolving” or “dynamic” ontologies (e.g., Heflin and Hendler, 2000; Kahng and McLeod, 2000; Davies et al., 2002), primarily because of the practical need to modify ontologies as they are developed to correct errors, enlarge or shrink the set of included concepts, or adjust concepts to reflect changes in the domain. Such ontologies exist as a series of versions, similar to versions of computer software (Klein and Fensel, 2001; Klein, 2002; Stojanovic and Motik, 2002),
2
For a good overview of recent work, see Exploring Attitude and Affect in Text: Theories and Applications. Papers from the 2004 AAAI Spring Symposium, Technical Report SS-04-07, AAAI Press, 2004.
142
Ide and Woolner
with the underlying assumption that, like computer software, the relevant or “best” ontology is the most recent one. In contrast, ontology “versions” have equal status in our application; the validity of a given version is dependent on the temporal context of the user’s query. Temporal contextualization can be treated as a special case of “multiple views” into an ontology. Multiple ontology views are most often intended to enable users to access information relevant to a given situation or viewpoint rather than to a given timeframe. Nonetheless, the same strategies used to support multiple views can be used to represent temporally contextualized information using a mapping mechanism such as that provided in the web ontology language (OWL) (Bechhofer et al., 2004; Patel-Schneider et al., 2004). However, while OWL and similar representation mechanisms provide means to map different ontology versions, they provide no means to express the semantics of the differences.3 In our application, the type of change that modifies the ontology is as important as fact of the change itself – that is, each change can be viewed as a particular type of event that is represented in the ontology and therefore accessed for reasoning and retrieval. The work on ontologies that is most relevant to our application has been done in the geo-spatial domain, which models objects of the different branches of geography, including physical and political geography, geology, geomorphology, climatology, meteorology, etc. Within this field, ontologies of change and process relations have been developed and used to interconnect ontologies representing different “snapshots” of the geographical world as it unfolds over time, in order to enable spatio-temporal reasoning (Grenon and Smith, 2004; Kauppinen and Hyvönen, 2007). These models are necessarily focused on processes that affect geographical entities and the specifics of their spatial characteristics and overlaps as they evolve over time. As such, this work does not address some of the problems of handling the FDR data, and at the same time addresses areas irrelevant to our domain. For example, Grenon and Smith’s SNAP and SPAN spatial ontology allows for widely variant views of the geographical realm depending on granularity, from large-scale geographical features such as mountains and oceans, down to the level of particular vehicles, buildings, and even individuals. Kauppinen and Hyvönen’s ontology time series, which is most related to our historical data, is primarily concerned with computing geospatial overlaps among geopolitical regions over a historical time period. However, in both of these projects the need to deal with entities that change over time is critical, and we have been able to adapt and extend many of the fundamental ideas and methods of these models to handle historical data.
7.4 The Historical Ontology The historical ontology provides two perspectives on the domain it models: a synchronic view, representing the state of the world during a given time interval, and a diachronic view that traces the changes in the domain as they unfold over 3
Extensions to OWL have recently been proposed to provide means to represent change semantics (Avery and Yearwood 2003).
Historical Ontologies
143
time. The synchronic perspective is provided by a series of snapshot ontologies that are linearly-ordered over an encompassing time span – in our case, the 10-year period from 1931–1941 – each of which provides a snapshot of the world during a temporal sub-interval. The diachronic perspective is provided through a single, encompassing time and event ontology that covers the entire time span covered by the series of snapshot ontologies. The time spans associated with the snapshot ontologies in our model are not fixed in duration; rather, they are determined by the occurrence of historical events that change the knowledge represented in the ontology, such as the Japanese invasion of Manchuria (which changes Manchuria’s status) or the signing of the Tripartite Pact (after which Japan becomes one of the Axis Powers). These key events may or may not be referenced in the documents themselves. Any event referenced in the documents is instantiated in the time and event ontology and associated with its date of occurrence. Figure 7.2 shows an overall view of the historical ontology. 7.4.1 A Model for Historical Ontologies We define an historical ontology H =< OS OE T >, where OS is a temporal ontology series, OE is a time and event ontology, and T is the time span covered by H. 7.4.1.1 The Temporal Ontology Series The basis of the historical ontology is a temporal ontology O = < S T > which consists of the set of instances S representing entities that exist at any time during a time span T . Each instance i ∈ S is associated with a temporal interval Ti ⊃ T and is represented as a triple < identifier Ti P > where P is a set of properties associated with the instance. Instances in O represent endurant entities in the world, that is, entities that have continuous existence and a capacity to endure, including
Snapshot Ontology1
Snapshot Ontology2
T e m p o r a l i n t e r v a l s
INSTANCES
Fig. 7.2. Overview of the historical ontology
Time and Event Ontology
Key events
144
Ide and Woolner
physical entities such as a particular person, a ship, a document, an army, etc. as well as governments, countries, and cities. In our application, T is the 10-year interval covered by our documents. The limits (i.e., start and end points) of temporal spans associated with instances in O define exclusively and exhaustively a set of change points. Change points are zero-length temporal intervals that mark the points at which one or more instances in O come into or go out of existence, or are modified. For example, in 1940 Vichy France comes into existence as a geo-political entity, whereas French Indochina as a French colonial protectorate ceases to exist as a unified entity. Modifications to an instance occur when property values change, as, for example, when Cordell Hull takes on the role of US Secretary of State. A temporal ontology series OS is a tuple < O C >, where O is a temporal ontology and C = t0 tn is an ordered sequence of change points ti < ti + 1 0 ≤ i < n, so that T = t0 tn . Thus, OS includes a series of n “snapshot” ontologies defined by contiguous, non-overlapping temporal intervals covering the entire span of time represented by T . Each instance included in O belongs to and persists over at least one snapshot ontology. A snapshot ontology for a given time span Ts = ti ti+1 is created by identifying all instances that persist over Ts .4 Figure 7.3 provides a graphic representation in which the temporal limits associated with four ontology instances define five change points t0 through t4 , which in turn define four distinct and contiguous temporal spans. A snapshot ontology representing the “state of the world” during a given span is constructed from all resources that persist over that span. For example, in Figure 7.3 instance i1 is associated with interval [t1 t3 ], instance i3 is associated with interval [t1 t4 ], and instance i4 is associated with interval [t2 t3 ]. This information can be used to construct ontology O3 =< i1 i3 i4 t2 t3 >.
ONTOLOGIES O1
i4
O2
O3
O4
i3 INSTANCES i 2 i1 t0
t1
t2
t3
t4
CHANGE POINTS
Fig. 7.3. A temporal ontology series
4
The model for representing temporal ontologies and ontology series is adapted from the model for ontology time series defined in Kauppinen and Hyvönen, 2007.
Historical Ontologies
145
7.4.1.2 The Time and Event Ontology In addition to a temporal ontology series, the historical ontology includes a time and event ontology OE =< R T >, where R is a set of instances representing events that occur in time, such as a communication between Roosevelt and his Secretary of State, the attack on Pearl Harbor, the imposition of an oil embargo on Japan, etc., and temporal intervals. T is the entire span of time represented in the temporal ontology series described above. Event instances are associated with temporal intervals. Historical events may be categorized as either KeyEvents, such as a military invasion or a change in an administrator, which cause modifications to the ontology and are therefore associated with temporal intervals that are change points; or InformationalEvents that do not modify the ontology, such as a communication event involving two government officials or a visit to the White House by the Japanese Ambassador. Note that some historical events represented in E are not mentioned in the documents in the collection, but rather comprise a portion of general world knowledge concerning major events prior to and during World War II. 7.4.1.3 Ontology Change Events A class of OntologyChangeEvents is defined that provide mappings between instances in different snapshot ontologies, including, for example, unification, separation, name change, change of political control, change of administrator, etc. OntologyChangeEvents are conceptually similar to the change bridges described in (kaupp); however, we follow (Grenon and Smith, 2004) in making OntologyChangeEvents entities in their own right. This enables reasoning involving change events themselves, including patterns of causal relations, etc., which are of particular interest to historians and political scientists. OntologyChangeEvents may be qualitative or substantial. Qualitative changes include the following: – Change in property value: In many cases the value of one or more properties of an entity will be instantiated by different values at different times. For example, after the Japanese invasion of Manchuria, the value of the governed-by relation changes, but the country or area itself remains the same and is the subject of the change – that is, the property itself remains associated with the entity while transitioning through successive values. – Qualitative creation: Henry Lewis Stimson takes on the role of Secretary of State. – Qualitative destruction: Henry Lewis Stimson ceases to be Secretary of State. Qualitative changes are temporally transitive – that is, the entity affected by the change persists across time spans. For example, Stimson is still the same person instantiated in the ontology before and after he becomes Secretary of State. Similarly, Japan is still Japan even after it becomes an Axis Power.
146
Ide and Woolner
Substantial changes occur when entities are created or destroyed, as, for example, when a geo-political entity is divided up so as to produce two or more new geo-political entities, or when two or more geo-political entities are unified. For example, French Indochina was once one entity, governed by French Colonial rule; after the Japanese invasion in 1940, its territory is divided into Northern Indochina and Southern Indochina, each with its own government and control. Effectively, at the point of change the concept of French Indochina as a geo-political entity becomes obsolete, while two new entities representing Northern Indochina and Southern Indochina are introduced. Substantive changes are temporally intransitive. 7.4.1.4 Temporal intervals Temporal intervals are associated with both historical events (KeyEvents and InformationalEvents) and OntologyChangeEvents. Change points are a special sub-class of temporal interval used to delimit the temporal spans associated with members of OS , as defined above. Events in OE are related to temporal intervals via a “TemporalLocation” property. Time in our framework is regarded as a linear continuum; temporal intervals are related to one another with the primitive relation “before”, a strict total order that holds between two temporal intervals when one is earlier than the other. In principle, temporal intervals can be defined at any level of granularity and may overlap or be discontiguous. In our current implementation, we have taken a simplified approach wherein instances of temporal intervals in the ontology represents a single date in month-day-year form. In the terminology of interval temporal logic, we can say that every temporal interval has length zero and therefore represents a temporal point consisting of a single day. Each event in OE is associated with one such temporal interval; events in our ontology are thus regarded as strictly punctual, with no meaningful duration. Durational events – for example, the Japanese occupation of Northern Indochina or the war between Great Britain and Germany – can be inferred on the basis of events (e.g., a country is at war with another power if it has declared war on that country and no truce has been established) or information in OS (e.g., the change points associated with the instance of Northern Indochina whose properties indicate that it is under military occupation by Japan define the interval during which this property holds). 7.4.2 Relations Among Ontologies The historical ontology framework includes several types of relations within and among ontologies. Intra-ontological relations exist between constituents of a single ontology, for example, the relation of part to whole between southeast Asia and the Pacific Region. Trans-ontological relations exist between entities that are constituents of different ontologies, that is, between two or more snapshot ontologies or between the endurant temporal ontology and OE . A typical example
Historical Ontologies
147
involves a participation relation between an agent in one of the endurant ontologies with an event in OE , such as the Japanese military force’s participation in the invasion of Manchuria. Finally, Meta-ontological relations exist between whole ontologies or between an ontology and an entity. An example of a betweenontology relation is the relation of temporal order between different snapshot ontologies in OE . Figure 7.4 gives an example of some relations among components of OS (in white) and OE (in gray). Instances are depicted as squares and classes are in ovals. The center of the figure is a separation event (“separation21”), which is an entity in OE . It is associated with a temporal interval that is a change point (“t6 ”), and through the “before” and “after” relations provides a bridge from the instance of French Indochina and the instances of Northern and Southern Indochina into which is it split after change point t6 . These three instances represent endurant entities, and are therefore a part of OS . French Indochina is a constituent of the snapshot ontology associated with (indexed by) change point t5 , and potentially other snapshot ontologies associated with change points prior to t5 as well. Northern and Southern Indochina are constituents of the snapshot ontology indexed by t6
CHANGE EVENT has-type SEPARATION has-type FRENCH INDOCHINA
before
NORTHERN INDOCHINA
after
SEPARATION21 after SOUTHERN INDOCHINA
has-time
has-type CHANGE POINT
t6
has-time JAPANESE INVASION
has-type
KEY EVENT
Fig. 7.4. Ontology relations
148
Ide and Woolner
and may be constituents of snapshot ontologies with later change point indexes.5 Figure 7.4 also shows the key event from OE that is associated with t6 .
7.5 The FDR Ontology The FDR/Pearl Harbor project is building an historical ontology based on the model presented in the previous section representing the entities and events in its document collection. The starting point for the ontology is the Suggested Upper Merged Ontology (SUMO) (Niles and Pease, 2001) together with the Mid-Level Ontology (MILO) and several ontologies available from the Agent Semantic Communication Service (ASCS) Agent Semantic Communication Service (ASCS).6 Our current adaption includes only the information from these ontologies that is relevant to our domain, which comprises a relatively small subset – for example, of the over 1,500 classes in MILO, we use only 185. As such, our strategy for building the our ontology is both top-down, in the sense that we begin with the upper level concepts defined in ontologies like SUMO, as well as bottom-up, in that we rely on the entities identified in our document collection to determine the set of classes and relations to include. Our most substantial modification to the ontologies is extension; for example, we extend the “form of government” class to include “collaborative governments”, such as Vichy France or northern Indo-China between July 1940 and July 1941, and “governments-in-exile,” such as the Dutch Government during the same period. We also eliminate inapplicable classes such as “Former Soviet or Eastern European Country” and a sizable number of nations/governments that did not exist until after the end of World War II. All of our ontologies were developed within GATE (Bontcheva et al., 2004) using Protégé.7 The FDR temporal ontology includes only those endurant entities that are explicitly mentioned in the document collection or appear in the associated metadata (author, recipient, location, date, etc.), which fall into the following general categories: – geopolitical entities, such as countries, cities, and regions. – geopolitical organizations, primarily governments, their major subdivisions (diplomatic, executive, legislative, etc.), departments, and officials. – military organizations, most of which are associated with governments. This category includes the various military branches and forces, and military positions. – military vehicles/apparatus, such as ships, tanks, aircraft, etc.
5
6 7
It should be noted that it would be possible for French Indochina to be a constituent of a snapshot ontology or ontologies with change points later than t6 if Northern and Southern Indochina were to have re-merged. http://reliant.teknowledge.com/DAML/ http://protege.stanford.edu
Historical Ontologies
149
– geographical objects, including major forms such as continents, lakes, oceans, mountains, and islands. Note that geographical objects also include geopolitical entities such as countries and cities. – geographical artifacts, including major artifacts with strategic importance such as roads, bridges, canals, etc. – documents, such as treaties, pacts, modus vivendi, etc., regarded as physical objects. – agreements/contracts/cooperations, which includes pacts, treaties, etc. as entities that depend for their existence upon the countries involved. Some agreements can exist as physical document without also being a contract, as, for example, the modus vivendum presented by the Japanese to the US several weeks prior to the bombing of Pearl Harbor. – people, most of whom fill government or military positions in the US, Japan, Great Britain, France, Germany, and Russia, but also several others who served as formal or informal advisors to the Roosevelt Administration. – political organizations, including entities such as the Nazis and the Axis and Allied Powers. Entities in the time and event ontology include both events mentioned in the documents and major historical events that cause modifications to the domain model. An important class of events in the ontology is communication events, which are sub-classified at a relatively fine level of granularity due to their importance in our data. A major goal of the FDR/Pearl Harbor project is to enable retrieval of communication content (typically individual statements but also including whole documents), apply attitude analysis to determine its orientation along dimensions such as power/control, submission, hostility, cooperation/friendliness, etc., and map attitudinal orientation over time. For example, the historian may wish to investigate changes in attitude in responses by any representative of the Japanese government to a question posed by a US government official between June 1941 and December 1941. We have adopted a portion of the ontology of communication verbs in the FrameNet database (Ruppenhofer et al., 2005), which includes sub-classes such as request, statement, judgment-communication, etc., as a starting point to which we have added further sub-classifications.8 In particular, we have made distinctions on the basis of polarity, which is not accounted for in FrameNet (e.g., “acclaim” and “condemn” are both categorized as judgement_communication verbs). Communication event classes in FrameNet were mapped to subclasses of LinguisticCommunication in SUMO where possible, or used to extend SUMO.
8
We utilized a clustering algorithm to automatically generate sub-classes of communication events and sense-tag the verb sets with WordNet senses; see (Ide, 2006) for a description of the method.
150
Ide and Woolner
7.5.1 Open Problems and Future Work The creation of the FDR ontology poses some interesting challenges beyond accounting for temporal contextualization. For example, in some cases, the temporal transitivity of a given change – i.e., the persistence or non-persistence of an entity as a result of the change – is not clear-cut. This problem arises in the representation of France, which is divided after its submission to Nazi Germany in 1940 into two politically distinct regions: the northern area occupied and controlled by the Germans, and the southern area that is officially ruled by the puppet Vichy government. However, it is inappropriate to eliminate the original instance of France as a geo-political entity associated with its entire land area from the ontology after June, 1940, since the concept of the whole of France as a sovereign nation remains, if only due to the existence of the Free French Forces of Charles DeGaulle based in London that claimed to be the legitimate government of France during this period. There is no straightforward way to model this situation in existing systems such as SUMO without violating constraints that are obviously appropriate in most cases, such as the requirement that a geopolitical area has only one government, or without losing the information that the geopolitical area represented by the whole of France corresponds to the two geopolitical areas represented by Vichy France and Germanoccupied France. This raises the general question of the degree to which the identity of a geopolitical area persists as it undergoes political and geographical changes, that is, need there be a general concept of, say, France that persists over time despite such changes? Conversely, at what point are changes substantial enough to effectively create a different or new concept? Our current implementation supports modifications and the creation and deletion of ontology instances, but it does not support modifications to ontology classes. In principle, the model outlined above could be extended to enable modifications to ontology classes as well as instances, but in practice, such modifications would involve considerable maintenance overhead and potentially lead to conflicting or invalid instance declarations. However, to handle historical data covering significantly greater time spans, it may be necessary to provide for modification of class definitions.9 We are currently exploring this possibility.
7.6 Conclusion The historical ontology framework outlined in this chapter is designed to support historical research involving the documents drawn from the FDRL that relate to Japanese-American relations between 1931 and 1941. With an ontology in the background, historians have a previously unavailable capability for “generic search”, involving, for example, requests to see documents in which any representative of the Japanese government (rather than a specific person or list of people) is mentioned. 9
Creation and destruction of classes in the current model would be handled by associating classes with the temporal intervals during which they are valid.
Historical Ontologies
151
In addition, the historical ontology provides for temporal contextualization of query terms, in order to retrieve all relevant results. For example, if a query concerns the Secretary of State, the results will include documents referring to Henry Lewis Stimson prior to 1933, and Cordell Hull thereafter. Temporal contextualization can also be explicitly provided in the query itself; for example, a query referring to members of the Axis Powers that is constrained to the period between 1936 and 1939 will yield results for Germany and Italy only, since Japan became an Axis Power only after September, 1940, the date of the signing of the Tripartite Pact. The two-part architecture of the historical ontology separates temporally-defined entities, including events and temporal intervals themselves, from endurant entities that exist in the physical and geo-political realms. We feel that this makes the task of domain modeling somewhat more manageable, since this division corresponds conceptually to familiar perspectives on reality, that is, the world of objects and entities (in the snapshot ontologies) vs. events. Also, the inclusion of OntologyChangeEvents as a first class object in the time and event ontology enables retrieval of information represented in the ontology itself about the changes to an entity or property over time, such as the course of change in the Asian regions under Japanese control between 1935 and 1941, or all “invasion events” that occurred in 1939.
References Avery, J. and J. Yearwood. 2003. dOWL: A Dynamic Ontology Language. ICWI 2003: 985–988. Bechhofer, S., van Harmelen, F., Hendler, J., Horrocks, I., McGuiness, D., Patel-Schneider, P. and L.A. Stein. 2004. Owl Web Ontology Language Reference. On-line publication at http://www.w3.org/TR/owl-ref/ Berners-Lee, T., Hendler, J. and O. Lassila. 2001. The semantic web. Scientific Amercian 284(5): 34–43. Bethard, S., Yu, H., Thornton, A., Hativassiloglou V. and D. Jurafsky. 2004. Automatic extraction of option propositions and their holders. Proceedings of AAAI Spring Symposium on Exploring Attitude and Affect in Text. AAAI Press. Bontcheva, K., Tablan, V., Maynard, D. and H. Cunningham. 2004. Evolving GATE to Meet New Challenges in Language Engineering. Natural Language Engineering 10(3/4): 349–373. Cunningham, H., Maynard, D., Bontcheva, K. and V. Tablan. 2002. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02). Davies, J., Duke, A. and A. Stonkus. 2002. Ontoshare: Using Ontologies for Knowledge Sharing. Proceedings of the International Workshop on the Semantic Web at the Eleventh International World Wide Web Conference. Fensel, D., Hendler, J., Lieberman, H. and W. Wahlster. 2003. Spinning the Semantic Web: Bringing the World Wide Web to Its Full Potential. The MIT Press. Grenon, P. and B. Smith. 2004. SNAP and SPAN: Towards Dynamic Spatial Ontology. Spatial Cognition and Computation 4(1): 69–104.
152
Ide and Woolner
Heflin, J. and J.A. Hendler. 2000. Dynamic Ontologies on the Web. Proceedings of AAAI/IAAI 2000, 443–449. Ide, N. 2006. Making Senses: Bootstrapping Sense-Tagged Lists of Semantically Related Words. In Gelbukh, A. (ed.), Computational Linguistics and Intelligent Text Processing. Lecture notes in Computer Science 3878, Springer, pp. 13–27. Kahng, J. and D. McLeod. 2000. Dynamic classification ontologies. In Arbib, M.A. and J. Grethe (eds.), Computing the Brain: A Guide to Neuroinformatics, Academic Press. 241–254. Kauppinen, T. and E. Hyvönen. 2007. Modeling and Reasoning about Changes in Ontology Time Series. In Kishore, R., Ramesh, R. and R. Sharman (eds.), Ontologies: A Handbook of Principles, Concepts, and Applications in Information Science. Springer, 319–338. Klein, M. 2002. Supporting Evolving Ontologies on the Web. In Lindner, W. and J. Stuller (eds.), Proceedings of the EDBT 2002 PhD Workshop, 51–58. Klein, M. and D. Fensel. 2001. Ontology Versioning on the Semantic Web. Proceedings of the International Semantic Web Working Symposium (SWWS), 75–91. McGuinness, D.L. 2003. Ontologies Come of Age. In Fensel, D., Hendler, J., Lieberman, H. and W. Wahlster (eds.), Spinning the Semantic Web: Bringing the World Wide Web to Its Full Potential. MIT Press, 171–94. Niles, I. and A. Pease. 2001. Towards a Standard Upper Ontology. In Welty, C. and S. Barry (eds.), Formal Ontology in Information Systems, ACM Press. 2–9. See also http://www.ontologyportal.org Noy, N. and M. Klein. 2003. Ontology Evolution: Not the Same as Schema Evolution. Knowledge and Information Systems 5. Pang, B., Lee, L. and S. Vaithyanathan. 2002. Thumbs up? Sentiment classification using machine learning techniques. Proceedings of EMNLP 2002, 79–86. Patel-Schneider, P., Hayes, P. and I. Horrocks. 2004. Owl Web Ontology Language Semantic and Abstract Syntax. On-line publication at http://www.w3.org/TR/owl-semantics/ Popov, B., Kiryakov, A., Ognyanoff, D., Manov, D. and A. Kirilov. 2004. KIM – A Semantic Platform for Information Extraction and Retrieval. Journal of Natural Language Engineering 10: 3–4. Cambridge University Press, pp. 375–392. Ruppenhofer, J., Ellsworth, M., Petruck, M. and C. Johnson. 2005. FrameNet: Theory and Practice. On-line publication at http://framenet.icsi.berkeley.edu/ Sider, T. 2001. Four-Dimensionalism. An Ontology of Persistence and Time. Oxford: Clarendon Press. Stell, J.G. and M. West. 2004. A 4-dimensionalist Mereotopology. In Varzi, A.C. and L. Vieu (eds.), Formal Ontology in Information Systems. IOS Press, 261–272. Stojanovic, L. and B. Motik. 2002. Ontology Evolution Within Ontology Editors. Proceedings of the Onto Web-SIG3 Workshop at the 13th International Conference on Knowledge Engineering and Knowledge Management (EKAW), 53–62. Turney, P. 2002. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification Reviews. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), 417–424. Wiebe, J., Wilson, T., Bruce, R., Bell M. and M. Martin. 2004. Learning subjective language. Computational Linguistics 30:3. Yu, H. and V. Hatzivassiloglou. 2003. Towards answering opinion questions: Separating facts from opinions and identifying the polarity of opinion sentences. Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL 2003).
8 An Amorphous Object Must be Cut by a Blunt Tool Makoto Nagao National Institute of Information and Communications Technology, Tokyo, Japan
8.1 Example-Based Machine Translation I first proposed Example-Based Machine Translation (EBMT) as an artificial intelligence approach to language translation, and named it the “machine translation by analogy principle.” It was presented at a NATO workshop on Artificial and Human Intelligence which was held in France in 1981. In the late 70s I was struggling to write analysis, transfer and generation grammars for machine translation between Japanese and English. Basic grammar rules are not many, but if we want to handle real texts we have to write many additional grammar rules, say, several hundreds. Still the grammar is not complete. There appear expressions which cannot be handled by such a set of grammar rules. They are called extra-grammatical sentences. I realized that there is no person who has written a complete grammar of English, for example. A language is always changing and there exists no notion of “complete” grammar for a language, and therefore we have to introduce the concept of “learning” to a language system. Another problem I was confronted with was the quality of translated sentences. There were varieties of reasons for the poor quality of the translation. One obvious reason is that machine translation systems so far basically depended on the compositionality principle, that is, an input sentence is decomposed into minimum units (words), and these minimum units are translated unit by unit, and then synthesized into a sentential string of a target language referring to the structure of the input sentence of a source language. In such a machine translation system which is called a rule-based machine translation (RBMT) system, improving a grammar is a hard task. We have to check where the real reason for the failure of analysis of a sentence is, and have to change some of the existing rules or add new rules. By doing this, analysis for that particular sentence will become successful, but there may arise bad side effects, such that some sentences which were analyzed successfully before cease to be analyzed well. In RBMT the addition or change of a rule will have a very complex effect on the systematicity or consistency of a grammar as a whole. This process of grammar 153 K. Ahmad, C. Brewster and M. Stevenson (eds.), Words and Intelligence II, 153–158. © 2007 Springer.
154
Nagao
improvement is very time consuming and will become almost impossible when the set of grammar rules becomes several hundred or more. Another difficulty with RBMT is that too many parsing results are produced and there is no reliable method to choose a proper structure for an input sentence. To overcome these difficulties I was forced to consider some other approaches to machine translation. At school students learn a second language by memorizing a lot of phrases and sentences with their translations as pairs, and utilizing these for the translation of new expressions. The process is first to find a similar expression in memory to a given expression, and then to produce a similar translation expression by referring to the translation expression in memory. When a person cannot find similar expressions in his/her memory to a given expression, or when a sentence translated according to his/her memory is pointed out wrong and a correct translation is given by a teacher, he/she memorizes it in a pair with the original expression. Usually expressions in memory are not very long. They are nowadays called chunks. Therefore when a long sentence is given for translation we first find out similar chunks in the sentence to those in an example database and replace them with their translations. Then we utilize grammar rules to synthesize translation expressions of the chunks into a translation sentence. These grammar rules are at the sentential level. In these processes we have to do morphological analysis/synthesis and some other grammatical treatment. This translation process is analogical to human translation process, and I named it the analogy-based machine translation. Later, it was called the example-based machine translation (EBMT). Analogy-based machine translation has the good characteristic that translation quality can be improved step by step by accumulating example translation pairs. This is a kind of learning process of the system. Another good point is that the system does not depend on the compositionality principle, and so, can produce expressions which are rather free from the words and phrasal structures of the source language expressions. This is the main reason why EBMT can achieve high quality translation. For example, the expression: I’m the one who decides. is translated as (I) (decide) (the one) by RBMT, but a better translation: (emphatic expression) can be given by EBMT. One more difficulty we encountered in the development of RBMT was to write a transfer grammar from one language to another. In systems such as those between French and Italian, or Japanese and Korean, the transfer grammar is not so complicated because there is a structural parallelism between the two languages. However for a language pair such as Japanese and English, writing a transfer grammar is very difficult. We have had so far no linguistically healthy study of this, but we have done lots of trial and error study on this part of RBMT. EBMT does not have a transfer grammar component, it just has an example database which is composed
An Amorphous Object Must be Cut by a Blunt Tool
155
of expression pairs of a source language and a target language, and this acts as a transfer from one language to another. Elimination of the transfer part is a great economy in the construction of a system. When I proposed the idea of EBMT, computer power was very poor and there was not memory space to keep hundreds and thousands of example expressions. Also there was no accumulation of parallel text corpora to extract enough proper example pairs in the two languages. We had to wait fifteen or more years to realize at least partially the idea of example-based translation.
8.2 Characteristics and Problems of EBMT There are roughly three basic methods in machine translation, namely rule-based MT (RBMT), example-based MT (EBMT) and statistics-based MT (SBMT). RBMT is based on linguistic theory and is a basic approach to machine translation. Everybody wishes that RBMT will make sound developments. However as is discussed in the previous section, there are difficulties in improving a grammar, particularly a transfer grammar, and in producing high quality translation. EBMT is better than RBMT in these points. EBMT can be improved gradually by increasing example pairs, and can achieve better quality in translation. Let us consider how frequently each grammar rule in RBMT is used in the analysis of sentences. The figure shows rather symbolically the frequency curve of the use of rules. Basic rules are used frequently, and specific rules are used less frequently. From this figure we can see that when the number of rules are increased, an additional rule is not used so often, it just corresponds to a very few specific expressions. In the extreme a rule corresponds to one idiomatic expression and its variances. This can be regarded just as the case of example-based analysis of a sentence. frequency of use
addition of rules
basic rules
rules for specific expressions (very many such rules)
In RBMT we have to struggle to add a new rule which keeps the consistency of a grammar as a whole. Compared to this difficulty we can add just an example pair to the memory in EBMT to cope with a new expression. SBMT basically utilizes the frequency property of sequences of three or more words. It may behave comparatively well for language pairs in the same language family such as French and Italian, but may have difficulty for language pairs like Japanese and English. Another difficulty of SBMT is to get reliable statistics for all existing three word sequences because available parallel text corpora are very limited.
156
Nagao
People started serious research on EBMT quite recently, and have not enough experiences, so that one cannot declare definitely the real priority of EBMT over RBMT and SBMT. I discussed some good points of EBMT over RBMT and SBMT in the above, but there are many problems in EBMT which remain to be solved in the future. The following are some of them. We don’t know what kind of expressions are to be stored as good examples or chunks for EBMT, and how many such chunks are necessary to cover a text category. Are they several thousands, or more than one million? It will depend on the text category for machine translation. Nowadays there are fairly large amounts of parallel text corpora for English and French, English and Japanese, etc., but not for many other language pairs. We have also to develop an automatic system to extract effective chunks from a parallel corpus. To compare the similarity of two expressions a thesaurus is used. It represents a kind of distance between two words. A thesaurus is generally organized in a tree structure, and the distance between two words is usually calculated by the number of steps to move from a word to another in this thesaurus tree. Therefore the quality of a thesaurus tree directly influences the calculation of similarity between two chunks, and thus the quality of translation. Similar expressions in a source language do not necessarily have a similar expression in a target language. For example: ( ) man picture wall ( ) man bridge river
: (a man) hang a picture on a wall. : (a man) build a bridge on a river.
In such a case both expression pairs are stored in an example database, and a new expression is compared to these examples. ( ) I clothes hanger In this case the distances between clothes and picture, clothes and bridge hanger and wall, hanger and river are calculated by using a thesaurus. If “clothes and picture”, “hanger and wall” are judged closer than the other pairs, the verb “hang” is chosen as the translation. If not, the result is the choice “ build ”, and I build the clothes on a hanger. is proposed as the translation. But this translation is wrong, and so the correct expression I hang the clothes on a hanger. is given to the example database.
An Amorphous Object Must be Cut by a Blunt Tool
157
has more than fifteen different correspondences to English verbs, such as hang, build, spend, sit down on, multiply, start, wear, lock, play, pour on, spread, apply, tie, · · · · ·, and we do not have enough experience to clearly distinguish these by the accumulation of example expressions. However we will be able to solve this problem by increasing example pairs. To get flexibility or efficiency in the calculation of similarity, some words in an example phrase can be replaced by variables with parameters which indicate the property/meaning of replaceable words at the specific positions in an example. But the specifications of these parameters are very difficult to give, and once parameters are introduced, the change or the improvement of parameters in example phrases becomes almost impossible. EBMT basically does not need to introduce such a semantic parameter. For the discrimination of meanings of a word, example phrases including the word need just to be stored. The next problem will be to find the best set of chunks which covers an input sentence well. When an example database becomes large, this process will take a lot of time. The final merge of all allotted chunks and the remaining components in a sentence may also be done by a chunk at the sentential level which includes variables corresponding to chunks at the phrase level. This final merge will be done by grammar rules at the sentential level. We have to study chunks at the sentential level, because there are so many long sentences which have different structures. In summary, the construction of a machine translation system by EBMT will be much easier, faster and more cost-effective than the system construction by RBMT, because in EBMT we have just to accumulate example pairs step be step, while in RBMT we have to write and improve/tune grammars of analysis, transfer and synthesis. The change of some parts in RBMT has effects on other parts of the grammars and the system, and thus the improvement efforts sometimes have a chaotic impact.
8.3 Language is Amorphous Language is produced in a human brain which has linguistic competence, and this linguistic competence can be specified by a rigorous formal system called a grammar. This is the position of not only Chomsky but also of many other linguistic people. However we can question whether there is an inevitability to specifying human language function by such a rigorous system as a grammar. A grammar represents just an aspect of a language, disregarding meaning and extra-linguistic functions and other aspects of a language. Human brain function is in fact not so rigorous. My supposition of a basic brain function is the function of perceiving the similarity between two elements (sentential strings, sounds, figures, story processes, etc., anything). I guess every human brain activity can be explained by this function operating on a huge amount of past experiences in the brain. In language translation the human brain judges simply that these two expressions are similar so that a similar translation expression can be produced. There is no strict criterion for the
158
Nagao
similarity judgment in the brain, and people therefore make mistakes sometimes. Human brain function is very vague and not amenable to the same sort of rigor as phenomena usually treated by physics. I think that human brain function must be grasped not by a strict analytical attitude, but rather vaguely. EBMT was proposed from this standpoint. I would say this approach is quite Asiatic or Japanese, contrasted with very rigorous Western approaches, such as RBMT or SBMT. People respect theories which are supported by logics and mathematics rather than their common sense or experiences and their vast amount of knowledge. But there are many things or phenomena which cannot be explained by theories completely. Many human activities are in this category. EBMT is not theoretical, and may not give satisfaction to people who prefer theoretical approaches. But I think that it reflects more straightforwardly human brain activity. I don’t know in fact what approach reveals the language function best, but would like to take the side of EBMT because it is not analytic but more natural in the handling of what one wants to express. I doubt whether we can analyze such a vague object as a language strictly, and construct a necessary and sufficient set of grammar rules for it. We know that an unfixed/amorphous object cannot be cut by a sharp knife, but can be cut by a blunt tool more easily. This suggests that a vague object is to be handled by a vague method, and not so rigorously. This is the reason why example-based machine translation seems to be more successful than rule-based or statistics-based machine translation. Anyway, this can be seen as a typical Japanese style approach.
9 Homer, the Author of The Iliad and the Computational-Linguistic Turn Sergei Nirenburg Institute for Language and Information Technologies, University of Maryland, Baltimore County, MD, USA Abstract:
This paper analyzes two sets of opposing opinions about the nature of meaning representations and knowledge resources. The first of these axes of disagreement is the opposition between an ineffable, “revealed” language of thought in the Fodor tradition and Wilks’ position that (using its strongest formulation) elements of the language of knowledge representation are essentially elements of a natural language. The second opposition is between a “scientifically” defined ontology, in Guarino’s sense, and human-oriented resources of knowledge about language, such as MRDs or WordNet. An attempt will be made to clarify some of the motivation behind these differing opinions. I will try to formulate my own positions on the above issues and will use as illustrations some modules of ontological semantics, a computationally-tractable theory of meaning, as implemented in the OntoSem text analyzer and the knowledge resources that support it
This paper continues an ongoing discussion of the nature of natural language processing (NLP), specifically, AI/NLP, that strain within NLP that is closest to the concerns of artificial intelligence (AI). AI/NLP studies issues of how to build machines that understand and generate language in a coherent manner approaching that of people. I happen to adhere to a representationalist approach that is concerned with explicit modeling of the processes of understanding and generating text.1 1
Such modeling relies on knowledge that might include but is not reducible to stochastic methods, a form of case-based text processing carried out essentially by reference to human performance recorded in large text corpora. In all essentials, the questions that such case-based reasoning asks are best formulated in terms of text generation, e.g., “What did people say/write next given what has already been generated up to the current point in the current text/sentence?” In this paradigm, it is not necessary (and not possible, without major modifications to the supporting knowledge) to seek the meaning of the text (for analysis) or the intention of the speaker (for generation). In fact, complex statistics of co-occurrence of natural language elements in documents is sometimes (as in latent semantic analysis, Landauer et al. 1998) declared to constitute the meaning of natural language texts. Case-based approaches may be viable, at least, in conception, for tasks that can be interpreted as relying, at the core, on comparisons of textual strings. This is why the successes of such approaches include such applications as detecting cheating in student essays, establishing text authorship or (possibly, more controversially) machine translation.
159 K. Ahmad, C. Brewster and M. Stevenson (eds.), Words and Intelligence II, 159–193. © 2007 Springer.
160
Nirenburg
The representationalist approach involves modeling memory and language behavior using methods that potentially can help to provide answers to questions like What does this text mean? What goals does the author pursue? What new knowledge do I gain from it about my model of the world, including knowledge about types and about tokens of elements of that model?, How does this text alter my agenda of goals and plans? etc. This approach requires a knowledge apparatus that is a combination of knowledge about the world, knowledge about language and knowledge about the speech situation, in the broad sense that applies also to reading and writing, not only dialog. Knowledge about the world includes knowledge about imaginary, hypothetical and non-existent entities as well as those that really exist. It is also multi-faceted and includes knowledge about types of object and events and their instances. The knowledge is used to support text understanding and generation in a variety of NLP systems that rely on methods beyond sophisticated comparisons of textual strings. My points of departure in this paper are some of Wilks’ (2001) arguments against Fodor and Lepore’s (1998) criticism of Pustejovsky’s (1995) generative lexicon and his criticism (Wilks, 2002) of Guarino’s (1997, 1998; Gangemi et al. 2001) program of making the nature and format of ontologies for NLP sufficiently precise. Fodor, Lepore and Guarino are representationalists, just like Wilks (and Pustejovsky), so in this sense all the protagonists belong to the same large camp that stands in opposition to various forms of asemanticism – behaviorism, connectionism, statistical NLP, etc. But within the representationalist camp, further divisions appear. Thus, Wilks and, to a large degree, Pustejovsky are what in AI circles is informally known as “scruffies,” on account of their preference for dealing with the messy subject matter first and with developing clean metalanguages for the description of this subject matter later. Within the representationalist camp, the “scruffies” stand in opposition to the “neats” who start with idealized, almost always logic-based, languages and only having honed these ideal tools by studying their properties, turn to using them as metalanguages for the description of messy reality. Guarino is one of many “neats.” For the purposes of this paper, we will induct Fodor and Lepore into this cohort, too (though they might be offended to be so cavalierly bunched with AI types). Of course, the delimitation of camps or factions is always relative and unstable – consider, e.g., Sowa’s (2000) demonstration of similarities between the theories propounded by such apparent opposites as the formal philosopher Montague and the quintessential “scruffy” Schank. I would like right away to make a disclosure: my work in the area of Ontological Semantics (Nirenburg and Raskin 2004) makes me squarely a “scruffy.” It is not surprising, therefore, that I broadly agree with the opinions and especially the methodological and paradigmatic positions of Wilks (and Pustejovsky). But there are a few, as it were, intra-factional disagreements on which I would like to comment. Some of my arguments will revisit the topics of the dialog with Wilks in Nirenburg and Wilks (2001) concerning the nature of symbols in meaning representations. I will comment on some issues related to the nature of the representation language, the purpose of procedures in largely declarative knowledge representations and ambiguity and paraphrasing in static knowledge resources for NLP. In doing so,
Homer, the Author of The Iliad and the Computational-Linguistic Turn
161
I will make use of some examples of representational solutions in OntoSem, the most recent implementation of ontological semantics. I will also suggest that, while a century ago analytical philosophy turned to the study of language as a safer and more productive way of studying reality (or, at least, as a necessary precursor to studying reality), one way to bring philosophy into the twenty-first century is for it to describe a computational-linguistic turn and start discussing the use of natural language and languages of meaning representation by intelligent agents – either computer programs or people who build, study or use them.
9.1 Language ↔ LANGUAGE Here is what Wilks has recently had to say about the nature of symbols in knowledge representations. “Meaning in the end is best thought of as other words and that is the only position that can underpin a lexicon and procedure based approach to meaning and one should accept this – whatever its drawbacks – because the alternative is untenable and not decisively supported by claims about ostension. ”(Wilks F “F”). “The persistent, and ultimately ineradicable, language-likeness of purported ontological terms means that we cannot ever have purely logical representations, purged of all language-like qualities.” (Wilks Ontotherapy). So, for Wilks, language itself supplies the lexical stock of the metalanguage for the description of meaning. This is, incidentally, not a claim that all metalanguage is a natural language, it is a claim about the nature of its symbols. Can the lexical elements of a metalanguage really be just “other words?” If so, there seem to exist a number of peculiarities that make these “metawords” quite dissimilar from the words that appear in regular texts and those whose meanings are defined in human-oriented dictionaries (let’s call them simply words). First, metawords would have to be learned by native speakers of the language in question, or else they would not be able to make simple inferences. This is true if we assume that, as is traditionally held, the metalanguage for representing meaning should not be ambiguous. Indeed, the actual interpretation of table (elements of the metalanguage are presented here in small caps) will fail if this metaword is not somehow disambiguated at least between the furniture and diagram senses. At the same time, something like table-furniture and table-diagram, if used as metawords, are already not quite elements of the object language – while I have not checked it, I do not think that such words will be attested in many corpora of English. Moreover, the meaning of even such complex metawords would often be only approximately understood by the native speaker who does not have access to their definitions (cf., e.g., the symbols for-profit-corporation or involuntaryolfactory-event in the OntoSem ontology). These difficulties have been observed over and over again during the process of ontology acquisition in the OntoSem environment.
162
Nirenburg
One can, of course, argue that the symbols of the metalanguage need not be unambiguous and support this claim by observing that meaning of a text could be generated dynamically – just like people do it! – from a collection of ambiguous symbols appearing in it, and, therefore, the symbols of the metalanguage need not be anything but words in a language. Of course, this position concentrates on word senses and their disambiguation; under this approach, as under others, there will still be a need for extracting and representing elements of meaning that do not, in the general case, reduce to the meanings of words and phrases (e.g., causality, modality, and many other phenomena). But even within that rather narrow purview, there are problems with this approach. It simply passes the buck to the processing component of the system from the static knowledge component. As a corollary, any indexing, co-referring or use of such “potential” meaning representations in reasoning becomes inefficient and cumbersome. Indeed, to use such a representation as a source of heuristic knowledge in the left-hand side of an inference rule, one will first have to run the dynamic disambiguation procedure, and do it every time. This model may be an acceptable hypothesis in studying human language understanding. However, when one must deal with the constraints of computer hardware, one is best advised to follow the precepts of dynamic programming and to store any intermediate results overtly so that they can be later accessed and used efficiently. In general, it seems to me that those who insist on word-like character of symbols tend to take for granted the amazing meaning extraction ability of people and do not make sufficient allowances for the major constrains under which computer modeling of these processes must operate: the unfortunate need to make all knowledge used by computers explicit and unambiguous.2 Second, metawords can be treated as regular terminology (e.g., chemical nomenclature: indeed, a lay person cannot be expected to know the names of chemical substances), and ontological concepts can be explained using natural language in a manner similar to the content of human-oriented lexicons (as is advocated, e.g., in Brewster et al. 2005). This will result in lexicons that are primarily compiled to support the operation of computer systems but are also intended for people. To underscore the special status of basic semantic primitives, Wilks (Wilks et al. 1996: 19) uses the example of instructions on how to use the phone in a foreign
2
Given these premises, it is not surprising that over the last 50 years or so suggestions, plans and promises were repeatedly made concerning teaching computers to understand just like people. However logically simple and appropriate this desire may be, it is only attainable if the machine can indeed learn, and the easiest way toward that is for the machine to be able to understand language. Attempts to bootstrap this ultimate learning from very partial means may lead to some local advances but do not seem to hold much promise for the main goal of building an intelligent machine that can manipulate language with a facility similar to that of people. It does not seem that this task can be finessed in any serious way: one must create much more knowledge in machine-tractable form before the machine is able to understand and use this ability – among other things – to enhance its own understanding ability autonomously.
Homer, the Author of The Iliad and the Computational-Linguistic Turn
163
country. He says that the “text assumes you already understand all the ‘manipulative primitives’ such as hold-receiver-to-ear or insert-coin.” A lexicon/ontology intended for a computer program would have to find a way of explaining, in a formal metalanguage, these and other notions that are assumed to be known by humans. However, since the entries cannot presuppose human background knowledge, they will end up irritatingly detailed, inelegant and difficult for people to read. Third, if metawords are viewed as regular terminology, then, because of their use by the linguistic community, sense shifts, sense splits and other modifications will be attested in time. Such changes will have to be recorded in human-oriented lexicons. But people, unlike machines, have other means of learning new senses of the words – from “live” text understanding. As machines lack this ability, there is no nontrivial sense in which metawords used in a machine-oriented knowledge resource will change their meaning over time. Semantic interpretations might “take a life of their own” in applications and lead to unexpected closures and inferences but this would be due to human error (or, more likely, inconsistency and lack of anticipation of the modes of use of the knowledge). In the case of machine-oriented resources such errors can be amplified much more than in human-oriented lexicons or thesauri where the most such errors may lead to would be a situation where the reader may gloat in his or her self-perceived superiority to the lexicographer on seeing an inconsistency or a definition that is too tightly circular. So, if meaning is indeed other words, as Wilks suggests, then these metawords are really different from “regular” words in their use, they must be learned anew by the native speakers and they should resist any semantic shifts or other “unplanned” fluctuations. This, plus the needs of an NLP system, really impart a special status to the metawords and make them in reality much closer to the status they are assigned in NLP-oriented ontologies. How far should the metaphor of representation languages being natural languages be stretched? There is some circumstantial evidence that people separate word senses from meanings. Consider the verbatim recall experiments most of which show that people are able to recall the meaning of some texts much better than the exact way that meaning was formulated in the text. For example, Chafe (1977) reports results of delayed-recall experiments that demonstrate that people not only forget the actual words but also habitually report not the actual meaning of the utterance they are asked to recall but rather a presupposition or an entailment of it. This suggests that results of text (or speech) understanding are recorded in a language-independent way after (usually subconscious) disambiguation and when they are used as components of input to the process of text (speech) production, the(lexical, syntactic and other) means of their realization in language are not directly recalled from the results of earlier analysis but are rather selected from the typically broad inventory of synonymic realization means available to the language producer/consumer. Recall experiments have even been used in forensics (e.g., Johnston 2004). They suggest that people store knowledge in terms other than natural language word senses.
164
Nirenburg
People who suffered temporary aphasia reported a disconnect between language of thought and language of communication. Thus, the nineteenth century French clinician Jacques Lordat went through a period of aphasia and recorded his observations as follows: “Within twenty-four hours all but a few words eluded my grasp. Those that did remain proved to be nearly useless, for I could no longer recall the way in which they had to be coordinated for the communication of ideas I was no longer able to grasp the ideas of others, for the very amnesia that prevented me from speaking made me incapable of understanding the sounds I heard quickly enough to understand their meaning Inwardly, I felt the same as ever I used to discuss within myself my life work and the studies I loved. Thinking caused me no difficulty whatever My memory for facts, principles, dogmas, abstract ideas, was the same as when I enjoyed good health I had to realize that the inner workings of the mind could dispense with words” (quoted in Kapur 1997). Sacks (2005) reports other similar cases. While it is possible that the language of thought used by Lordat was the same as the one he could not temporarily use for communication, it is equally possible that the two functions of thinking and communication are actually carried out using different languages. The Iliad was written not by Homer but by another man with the same name. Why is this funny? A plausible answer is, because this sentence violates the hearer’s expectation of unambiguity of representational symbols (or structures represented by symbols) in a felicitous dialogue.3 (It is probably beside the point that this answer also casts doubts on the degrees to which designators in this most real of the possible worlds are really rigid; moreover, I am not sure that Kripke (1982) and others after him considered this rigidity to be a scalar value.) Analysis of this sentence is not just a matter of simple reference resolution and certainly not ostension – few of us can boast to be able or to have ever been able to point to a person and felicitously say that this is Homer (or even “This is Homer Simpson”). The language symbol is ambiguous (it can refer to any person named Homer). However, if we substitute, say, John, for Homer (or “This novel” for “The Iliad”?) in the input text, it stops being funny. Moreover, in the absence of conversational context, the sentence with John instead of Homer should elicit a clarification question: “John who?” It is only because we have a separate, non-language related representation for our knowledge about Homer the bard that the above sentence works as a joke. In this representation, the name itself is just one of its properties. Others will identify this individual, among other things, as the author of the Iliad, as having been blind and – even – as possibly never having existed. This representational structure is, incidentally, not a part of the ontology (or the lexicon or the thesaurus). It is a part of a model of the human long-term memory of instances of ontological concepts, with their own values of ontological properties defined in the ontology. In ontological semantics this collection of assertions is called fact repository (FR). FR is an important source of heuristics for both general reasoning and semantic
3
On the semantic analysis and generation of humor see, e.g., Raskin (1986) and Hempelmann et al. (2006).
Homer, the Author of The Iliad and the Computational-Linguistic Turn
165
analysis itself. Thus, ontological semantics includes knowledge obtained from FR in determining the preferences of preference semantics.
9.2 Science and Metaphors Experience shows that making an intelligent agent’s knowledge sound and consistent, as Guarino recommends, and at the same time expecting it to be sufficiently deep and broad to support realistic reasoning applications is a tall order. The same argument goes for the agent’s reasoning processes. If one has to decide which of these conditions should be relaxed, I side with Wilks in believing that breadth and depth of coverage are more important than formal discipline. Guarino’s requirement is imported from formal logic and its extensions, such as model-theoretic semantics. Is logic necessary for AI/NLP? I agree with Wilks that “[s]emantic wellformedness is not a property that can decidably be assigned to utterances in the way that truth can to formulas in parts of the predicate calculus and as it was hoped for many years that syntactically well-formed structures would be assignable to sentences.” (Fodor “Fodor”). Still, importing methods, biases and metaphors from other fields of science and technology is often a valid strategy and has a venerable history. For example, in the seventeenth century Harvey had the idea of blood circulation by making a mental analogy with pumps. In the late nineteenth century the Young Grammarians studied laws of language change because of the influence of the Origins of Species. In the twentieth century Sapir and Whorf overtly used the relativity metaphor from physics. It is therefore plausible that the requirements that Guarino would like to impose on an object that he would be happy to call an ontology are the result of metaphorical thinking and projection from another field, in this case, formal logic. (It is immaterial for this argument that formal logic, at its inception, was motivated by the desire to analyze natural language; this desire could be interpreted in the light of seeking formal ways to distinguish texts about texts from texts about the world, a concern associated with the linguistic turn mindset in analytical philosophy. Formal logic is a discipline unto itself and its methods are true imports into the study of language, as is corroborated by the title of McCawley’s (1981) book: “Everything that linguists have always wanted to know about logic (but were ashamed to ask).”) Sometimes such metaphorical thinking and the importation of ideas and methods is fruitful. In some other cases, it succeeds mostly to remind one of the well-known saying to the effect that if one has a hammer then everything looks suspiciously like a nail. When such cross-pollination succeeds, it almost always requires significant modification of the original metaphor: there are probably more differences than similarities between actual hearts and actual pumps. In the field of NLP systems, such modification practically always involves the introduction of large amounts of descriptive detail to the original statements of the imported theory. Often, the needs of such description lead to extensions of the original theory (e.g., to take the example application of text generation, extensions of systemic functional grammar
166
Nirenburg
theory (e.g., Halliday 1985) by Mann and Matthiessen (1983) or of the MeaningText theory (e.g., Mel’ˇcuk 1995; Mel’ˇcuk et al. 1995) by Iordanskaja et al. (1991)). In the end, the underpinnings of the resulting systems may bear only a passing resemblance to the original theoretical statements. In fact, it is this phenomenon that explains Wilks’ seemingly paradoxical “theorem” stating that “there is no linguistic theory, no matter how bad, that cannot support the development of a successful NLP system.” The above does not mean that metaphors should not be used to drive progress in science, simply that some such metaphors are a better fit than others. Using formal logic as the means of “shaping up” descriptions of world knowledge and knowledge of language is a corollary of the desire to eliminate inconsistencies and to make the descriptions usable by current formal reasoning systems. But this position assumes the expressive power of representations to be a handmaiden to the formal properties of the metalanguage. If we are to make progress in attaining a human level of performance in intelligent software agents, such a paradigm is too limiting. People operate reasonably well with knowledge bases that are incomplete and unsound. So, instead of constraining the expressive power of representations to gain immediate partial results, one may consider keeping expressive power unchained and developing richer, if possible, less general and more knowledgedependent methods of reasoning, including reasoning for text understanding. (An interesting example of a graduated attitude to the tension between expressivity and computability is the decision to include several “dialects” in OWL, the knowledge representation language for semantic web applications (Java et al. 2005, 2006).) I think that there is a more apt metaphor for AI/NLP work than the formal logic one. It is the psychological metaphor introduced into economics by Kahneman, Twersky and their associates (e.g., Kahneman et al. 1982). It suggests, roughly, that homo economicus should not, as was done previously, be viewed as a fully rational agent operating using well-defined utility functions but rather as a complex cluster of reason, habit and emotions. As a result, predicting behavior of people and groups in economics-related areas is becoming even more complex, as it involves even more variables than previously thought. The Nobel prize awarded in 2002 to Kahneman and others for this work is an oblique endorsement of the primacy of descriptive adequacy over available reasoning methods. Indeed, it is difficult to expect that a realistic model of an intelligent agent in general can be created solely on the basis of deductive reasoning over sound and complete knowledge bases. This became clear to logicians in early twentieth century, and they have been hard at work on alleviating this state of affairs. A number of modifications of reasoning tools (in the form of specialized logics) have been suggested. Thus, allowing abductive reasoning helps to partially alleviate the brittleness of strong logical methods. Attaching probabilities to knowledge and inference processes also helps with applications but excludes from the purview the task of explaining and motivating the interpretation of utterances. All such tools still presuppose some version of soundness and consistency of the data. Otherwise, it is believed, the knowledge representation and reasoning enterprise would be judged as not scientific.
Homer, the Author of The Iliad and the Computational-Linguistic Turn
167
Guarino’s concern is strictly scientific: enforcing formal constraints on ontologies (and, by extension, all formal representations of knowledge about the world and the language). Wilks’ polemical position is that this desire is misguided and futile, in part on the grounds that the complexity of the world and language makes it unlikely that any strict formalism will adequately describe it. And if this approach is called “unscientific”, so be it. It is not entirely clear whether it is under all circumstances that being labeled “unscientific” should be considered an insult. Newton and Galileo were acclaimed as the first scientists not because they knew more than their old-style contemporaries or predecessors but (at least, in part) because they changed the purview of their inquiries. While the “medieval scholars” attempted to catalog and reconcile a vast number of individual facts, thoughts and observations in a coherent whole, the scientists curtailed the purview of their pursuits and concentrated on a smaller set of phenomena about which they could make coherent statements (theories) corroborated by experimentation and successful applications. Thus, science should be associated, among other things, with relying on the notion of the “wastebasket” – a set of phenomena that it – consciously or not – chooses to ignore. In other words, science can be considered to be the art of the possible. If, as a result, the description of the phenomenon under investigation is at least sufficient to support applications, fine. If there are important applications that are not supportable, then the method should be somehow extended (or relaxed) to allow reformulation under less strenuous requirements.
9.3 Ontology ↔ ONTOLOGY “Items in ontologies and taxonomies are and remain words in natural languages – the very ones they seem to be, in fact – and this fact places strong constraints on the degree of formalisation that can ever be achieved by the use of such structures.” (Wilks, Ontotherapy). This statement argues against the plausibility of Guarino’s program of “firming up” existing NLP-oriented ontologies by applying formal constraints, such as identity criteria, or making the concept of ontologies more precisely defined, for instance, on the basis of set inclusion and eschewing the notion of properties. It is not clear that it is productive to talk about whether thesauri and lexicons are different from ontologies, because this does not help, beyond maybe establishing battle lines. In fact, I am quite prepared to admit that OntoSem represents meaning using not ontology but an entirely different construct referred to by the same name. Traditional definitions tell us that ontologies must describe what is there in the real world. The world of a transparently intelligent software agent should include many very different things. It would be in some sense like my world, which includes, among a vast amount of other entities, knowing that violets are blue and that when people feel sudden pain they might say “Ouch” in English and that Horatio Nelson died in 1805 but also that Horatio was a friend of Hamlet’s and studied at Wittenberg and that I am glad that I remembered to send my mom flowers for her birthday the other day and that my friend Jack does not like swimming.
168
Nirenburg
The agent’s world view, just like that of a human agent, would also accommodate contradictions. Indeed, people can and do operate rather well with incomplete and ambiguous knowledge I would like a truly intelligent agent to attain the same capabilities. It is clear that the state of the art in 2006 cannot support this level of sophistication. However, it seems important to me that the ultimate goal of building such an agent remains central – especially because much of the recent and current work in AI and NLP eschews such “remote” objectives and concentrates on picking the “low-hanging fruit.” In an argument about what kind of knowledge to impart to an intelligent software agent – an ontology or a lexicon/thesaurus – there is a third option. One can require the agent to have both, and then some. Semantic and pragmatic interpretation must rely not only on knowledge of lexical meaning, it also requires contextual knowledge and world knowledge. I will touch on processing context in connection with procedural semantics later. In this section, I will suggest how factual knowledge that was not directly mentioned in the text can and should be leveraged for extracting text meaning. To prepare the reader for this discussion, I must take a brief detour to introduce the terms and notions of ontological semantics. The OntoSem system is the latest implementation of the theory of ontological semantics (Nirenburg and Raskin 2004). It is a text-processing environment that takes as input unrestricted raw text and carries out preprocessing, morphological analysis, syntactic analysis, and semantic analysis, with the results of semantic analysis represented as formal text-meaning representations (TMRs) that can then be used as the basis for many applications. Text analysis relies on: •
The OntoSem language-independent ontology, which is written using a metalanguage of description and currently contains around 8,500 concepts, each of which is described by an average of 16 properties. • An OntoSem lexicon for each language processed, which contains syntactic and semantic zones (linked using variables) as well as calls for procedural semantic routines when necessary. The semantic zone most frequently refers to ontological concepts, either directly or with property-based modifications, but can also describe word meaning extra-ontologically, for example in terms of modality, aspect and time.4 The current English lexicon contains approximately 30,000 senses, including most closed-class items and many of the most frequent and polysemous verbs, as targeted by corpus analysis. The base lexicon is expanded at runtime using an inventory of lexical rules. (An extensive description of the lexicon, formatted as a tutorial, can be found at http://ilit.umbc.edu.) • An onomasticon, or lexicon of proper names, which contains approximately 350,000 entries. 4
As concepts, specifically, abstract-objects, aspect, modality and time do belong to the ontology. But as features used in describing meanings of texts in ontological semantics, they don’t carry ontological status, because their values pertain to contextual meaning, not persistent knowledge about objects and events.
Homer, the Author of The Iliad and the Computational-Linguistic Turn
169
•
A fact repository, which contains real-world facts represented as numbered “remembered instances” of ontological concepts (e.g., speech-act-3366 is the 3366th instantiation of the concept speech-act in the world model constructed during the processing of some given text(s)). • The OntoSem syntactic-semantic analyzer, which covers preprocessing, syntactic analysis, semantic analysis, and the creation of TMRs. Instead of using a large, monolithic grammar of a language, which leads to ambiguity and inefficiency, we use a special lexicalized grammar created on the fly for each input sentence (Beale et al. 2003). Syntactic rules are generated from the lexicon entries of each of the words in the sentence, and are supplemented by a small inventory of generalized rules. We augment this basic grammar with transformations triggered by words or features present in the input sentence. • The TMR language, which is the metalanguage for representing text meaning. OntoSem knowledge resources are at this time acquired primarily manually (though note that the knowledge acquirers use a variety of efficiency-enhancing tools – graphical editors, enhanced search facilities, capabilities of automatically acquiring knowledge for classes of entities on the basis of manually acquired knowledge for a single representative of the class, and the like). The ontology has been under continuous development, with varying levels of effort, for around 20 years. It took approximately two and a half years of work by a PhD-level linguist to compile the current lexicon. (Although the OntoSem environment has always utilized an English lexicon, previous versions aimed for a coarser grain-size of description and did not reflect recent theoretical and practical advances). The ontomasticon was extracted automatically from corpora and structured sources. The fact repository is populated automatically from text-meaning representations. Knowledge acquisition is largely driven by lacunae found during the processing of actual texts; it is expedited using OntoSem’s DEKADE environment (see McShane et al. 2005). We are currently working on developing a “push me pull you” knowledge acquisition strategy that incorporates machine learning (ML) of lexicon and ontology into our knowledgerich environment: the more knowledge we learn with the help of ML, the more resources we will have to support the learning of still more knowledge. We do not consider the “knowledge bottleneck” to be anywhere near the impasse that many make it out to be: acquiring knowledge simply requires effort, no different from or more extensive than the effort currently being exerted in creating annotated corpora. A high-level view of OntoSem text processing is illustrated in Figure 9.1. TMRs represent propositions connected by causal, temporal, rhetorical and other relations (see Nirenburg and Raskin 2004, Chapter 6 for details). Propositions are headed by instances of ontological concepts, parameterized for modality, aspect, proposition time, overall TMR time, and style. Each proposition is related to other instantiated concepts using ontologically defined relations (which include case roles and many others) and attributes. Coreference links form an additional layer of linking between instantiated concepts. Now we are ready to discuss the use of non-ontological world knowledge in text analysis. It might seem trivial to state that world knowledge (the extratextual
170
Nirenburg
Fig. 9.1. A high-level view of core OntoSem Architecture
context) is needed for text meaning analysis. But while most people acknowledge this need, few actually propose operational models for doing so. The fact repository in OntoSem is a core component of such an operationalization. A module such as the OntoSem fact repository has not prominently featured in earlier semantic interpretation approaches, including Wilks’ (e.g., 1975, 1977) preference semantics. I will illustrate its use in text understanding by showing how it helps to treat reference resolution, a notoriously difficult NLP problem. Within OntoSem, processing reference is understood as detecting all referring expressions in a text or a corpus and associating them with their anchors in the fact repository (FR), which is a collection of interlinked real-world instances of objects and events extracted from text after it has been interpreted by the OntoSem analyzer. The information in the FR both supports the processing of any given text (it is a substrate of computer-tractable knowledge) and is supplemented by information from that text. Under this conception of full resolution of reference, the text string Colin Powell is not resolved until it is linked to its anchor in the FR, if there is one, or until a new FR anchor is instantiated, if none yet exists. Thus, the OntoSem engine must try to link every pronoun, relative date (last week), relative time (later), definite description (that man), etc., not only to other co-referential elements in the given text, but to the actual anchor in the ever-growing world model. This is reference beyond co-reference.
Homer, the Author of The Iliad and the Computational-Linguistic Turn
171
The fact repository contains a list of “remembered instances” of ontological concepts. In other words, whereas the ontology has the concept city, the FR contain entries for London, Paris and Rome that are instances of the ontological concept city; whereas the ontology has the concept sports-event, the FR has an entry for the Salt Lake City Olympics. Facts from TMRs for the input texts are converted (possibly, after some filtering) into persistent objects in FR. FR, thus, becomes the assertion component of the overall knowledge base, just as the ontology forms its description component and the lexicon connects elements of both ontology and the fact repository with their realizations in a language. Below is a pretty-printed version of the automatically generated TMR for two short sentences that illustrate a difficult case of reference resolution (most sentences we process are much longer): Colin Powell met with Jack Straw. The American official asked for support. Ontological concepts are written in small caps. For orientation, all references to Colin Powell are in boldface, while all references to Jack Straw are in italics. Colin Powell met with Jack Straw. human-79 agent-of meeting-80 has-name [first “Colin”] [last “Powell”] fr-reference HUMAN-FR24 root-words “Colin Powell” meeting-80 time < find-anchor-time agent HUMAN-81 agent HUMAN-79 root-words “meet with” HUMAN-81 has-name [first “Jack”] [last “Straw”] agent-of meeting-80 fr-reference HUMAN-FR40 root-words “Jack Straw” The American official asked for support. nation-104 has-name “United States of America” root-words “American” fr-reference nation-fr213 SOCIAL-ROLE-105 agent-of propose-107 relation nation-104 authority-attribute 0.7 root-words “official”
172
Nirenburg
co-reference HUMAN-79 request-action-107 agent SOCIAL-ROLE-105 theme support-22 time find-anchor-time root-words “ask” support-22 theme-of request-action-107 root-words “support” These TMRs are read as follows. In the first sentence, the input “Colin Powell” instantiates the concept HUMAN, which is appended with the number 79 since this is the 79th time a HUMAN was instantiated during this run of the analyzer. When the system checked the FR for Colin Powell, it found a suitable match (what we call an anchor), which is called HUMAN-FR24 (the 24th HUMAN stored permanently in the FR). Below is an excerpt: human-FR24 has-first-name “Colin” has-last-name “Powell” has-middle-initial “L” has-middle-name “Luther” social-role secretary-of-state citizenship nation-FR213 coreference cabinet-member-FR2, military-officer-FR1, military-officer-FR3, etc. has-city-of-birth city-FR1465 has-nation-of-birth nation-FR213 gender male marital-status married has-date-of-birth absolute-time (year 1937) has-spouse human-FR2134 has-children human-FR2323, human-FR2324, human-FR2325 agent-of attend-academic-institution-FR3, earn-degree-FR356, attend-academic-institution-FR4, earn-degree-FR400, speech-act-FR151, speech-act-FR152, request-action-FR23, etc. Thus, a given object or event has a fleeting number associated with it for the given text and a static FR number associated with it. The concept MEETING is instantiated for the lexical string “meet with” because the syntax of the input matched the following verbal sense of meet in the lexicon: meet-v3 def
“to meet with s.o.”
Homer, the Author of The Iliad and the Computational-Linguistic Turn
ex syn-struc
173
“She met with her boss about the upcoming deadline.” subject root $var1 cat n root $var0 cat v pp root $var2 cat prep root with (obj root $var3 cat n)
sem-struc meetING agent value ∧ $var1 agent value ∧ $var3 ∧ $var2 null-sem +; shows the the meaning of the preposition; has already been accounted for in the sem-struc The time in the TMR for sentence 1 is specified as “< find-anchor-time”, which means “before the anchor time of the given text”. OntoSem attempts to determine the anchor time using procedural semantic routines that rely on the dateline of the given article and other such heuristics. The AGENTs of MEETING are Colin Powell and Jack Straw, respectively, as can be seen by tracing the concept numbers to their specifications. It is hoped that this brief walk through the first TMR is sufficient for general orientation. The most interesting aspect of the sample pair of sentences is the reference connection between the American official and Colin Powell, which the OntoSem analyzer can automatically establish using lexical, ontological and world knowledge. More specifically: FR information human-fr24 social-role citizenship
; Colin Powell secretary-of-state nation-fr21
; USA
Lexical information One of the meanings of “official” is described as SOCIAL-ROLE with AUTHORITYATTRIBUTE = .7 (that is, a social role with a great deal of authority). The adjectival form of a NATION (e.g., American) followed by a SOCIALROLE (e.g., official) means that the SOCIAL-ROLE has CITIZENSHIP in that NATION (this is lexically encoded as a sense of American, Canadian, and other such adjectives). Ontological information SECRETARY-OF-STATE is a kind of GOVERNMENTAL-OFFICIAL. The analyzer exploits this knowledge roughly as follows. It searches for antecedents that are semantically compatible with the meaning SOCIAL-ROLE with AUTHORITY-ATTRIBUTE .7 (e.g., it would reject “peon”, which is a SOCIALROLE with AUTHORITY-ATTRIBUTE 0; it would reject any ABSTRACTOBJECTs; etc.). Once it has a list of potential antecedents, it attempts to narrow
174
Nirenburg
it down to exactly one. One way of doing this – as in our example – is to match features about an entity noted in the text to features about it stored in the FR (the same way as humans do, by the way). Here, the information we have is that the official is American, which we know refers to citizenship based on the selected sense of American in our lexicon. So the analyzer checks the citizenship of each candidate entity in the FR and finds that Colin Powell matches whereas Jack Straw does not. This translates into a high confidence preference for Colin Powell as the antecedent for the American official. If we did not have the necessary disambiguating evidence in the FR, we would use a combination of other heuristics to favor one or another coreference link. Many of our heuristics are those used widely by the knowledge-lean approaches (see, e.g., Mitkov 1998), but we add semantic heuristics available only in our knowledge-rich environment. Although it is not immediately evident from this one example, our approach to using FR information for reference resolution is actually generalized – that is, it extends beyond cases of citizenship and social roles. However, since we are less than a year into this particular type of work, we are still discovering new types of cases and ways of exploiting the FR information to its best advantage. As our example above has illustrated, difficult cases of reference resolution require extra-textual knowledge: in fact, the need for extra-textual knowledge is precisely what makes certain cases of reference difficult for computers (the same reference resolution is not difficult for people because they bring all the necessary knowledge to the table; or, if they don’t, they seek it: “Is Jack Straw or Colin Powell an American?”). Therefore, even if one attempts only to establish textual co-reference links, the knowledge to do that will often derive from repositories of world knowledge about the given entity. Our previous work in reference resolution has only strengthened our belief that, when it comes to reference, there can be no hard line drawn between the text (including the knowledge therein) and the world knowledge brought to bear when interpreting it.
9.4 Being Happy With What You Have I share Wilks’ concern about covering the actual language and world phenomena even at the expense of not conforming to the discipline desired by Guarino. I agree with Wilks that such discipline is both impossible to impose on realistic knowledge resources and may be, in the final analysis, immaterial. I do not share his pessimism about the attainability of broad and deep coverage. This pessimism and the understandable reluctance to undertake the complex task of creating resources with a depth of description sufficient for AI/NLP needs leads Wilks to two parallel conclusions: (a) acquisition of such knowledge resources is possible only if done automatically and (b) acceptance of available resources, such as MRDs and WordNet, stressing their utility over their obvious shortcomings (see comments on WordNet below). As MRDs and WordNet use natural language as their metalanguage (either entirely or, in the case of WordNet, predominantly), the theoretical position of allowing representation languages to be similar to natural languages gets another, teleological, boost in addition to its motivation by the linguistic turn.
Homer, the Author of The Iliad and the Computational-Linguistic Turn
175
Wilks stresses that ontological and lexical resources (e.g., WordNet or other MRDs) can themselves be objects of research. What is meant here is that useful information can be mined from such sources to support various NLP applications. This is indeed manifestly true for statistics-based processing and possibly in support of non-semantic analysis of text. But it is not so evident when such knowledge is supposed to serve AI/NLP applications. The survey of MRD research by Ide and Véronis (1993) covered the period before statistics-based NLP really took off. It described work devoted to creating machine-tractable lexicons from MRDs, not providing features to support classification. The survey had a suggestive title “Extracting knowledge bases from machine-readable dictionaries: Have we wasted our time?” and concluded: “The previous ten or fifteen years of work in the field has produced little more than a handful of limited and imperfect taxonomies.” I think that this statement is basically appropriate today, too: the utility of MRDs in developing computational-semantic lexicons has been truly limited, in spite of successes declared in the contributions in Guo (1995) or the upbeat picture in Wilks et al. (1996). It might be that this inability to seriously bootstrap the development of a computational-semantic lexicon from MRDs that led Wilks to take the position that “imperfect resources are better than no resources.” This position is reasonable enough. However, the level of imperfection in various resources is highly variable. And some of them, for example WordNet, fall seriously short of being able to support semantic interpretation. Here is a brief sampling of reasons why. It has become common practice to consider as self-evident that the extensive citing of WordNet in the NLP literature is proof of its utility. That conclusion is, actually, unfounded: people are certainly trying their best to find good uses for it since it is available, but that does not imply that their attempts have shown great promise or that success will improve with better machine learning techniques. A common result of machine-learning efforts with and without WordNet is a small increase in results using WordNet and no indication of where the given work can proceed. Take as examples two experiments from the realm of word sense disambiguation (WSD): Stetina et al. (1998) achieve 75.2% accuracy by choosing the first lexical word sense in a dictionary, and 80.3% using WordNet, and Mihalcea and Moldovan (1998) reach 58% precision in WSD using semantic density in WordNet. However, here the experiments stop: the ML methods have been used, they do the best they can with the available resources but are still far from 100% or from human performance on natural tasks, such as understanding and disambiguation in vivo (as opposed to tasks that are unnatural for people, such as fitting a word sense into a Procrustean bed of a predefined set of senses). These relatively low ceilings of results are only to be expected if the difficult problems of NLP are approached using resources that do not target the difficult problems, and using procedures that – because they do not use sufficient amounts of deep knowledge – have to be satisfied with results that may be state-of-the-art but are unimpressive in absolute terms. Naturally, the argument from the other side is that the field – not to mention society – needs results right away, and there is no time to build large knowledge resources. Our response is that time will be spent either way, and if time is spent
176
Nirenburg
on developing the resources that the community really needs for higher-end applications, in the long run it will be well worth the effort. I believe that the amount of annotation work required to allow the statistical methods to tackle higher-end applications exceeds the amount of work needed to create broad and deep knowledge resources for knowledge-based NLP. I firmly believe that the best way to use statistical methods is to apply them to the task of aiding the knowledge acquisition efforts of knowledge-based NLP. When the results of statistical analysis of corpora are validated and used by humans, the overall efficiency of knowledge acquisition is significantly enhanced.5 No available resources that claim to provide semantic support for NLP have proved directly applicable to the OntoSem analyzer, though some have been indirectly useful: e.g., WordNet is among the many on-line and paper sources of synonyms that OntoSem acquirers can use during manual acquisition. A comparison between the representation of verbs expressing change in WordNet and OntoSem will serve as an illustration of the difference in semantic richness of these two resources. In describing the presentation of verbs of change in WordNet Fellbaum (1999: 252) writes: “ Verb phrases like change magnitude, change shape, and change surface were entered [as nodes in WordNet] on the basis of purely semantic considerations. These concepts were needed to distinguish three groups of verbs that were otherwise all daughters of one node containing the verb change. To have represented verbs like increase, dwindle, and wax as sisters of verbs like flatten, bend and twist as well as of verbs like buckle, fold, and smoothen just did not seem felicitous and seemed to result in a semantically non-homogenous class.” OntoSem takes the semantic specification of verbs denoting change a large step further, representing these notions beyond iconic listing in a hierarchy. All verbs of change in OntoSem are lexically mapped to the ontological concept CHANGEEVENT but their respective lexicon entries specify their meaning in terms of preconditions and effects. Take, for example, the verb increase, whose meaning depends on the theme of the increase. E.g., if the THEME of the increase is mapped to a SCALAR-ATTRIBUTE – like price (mapped to COST) or height (mapped to HEIGHT) – then the PRECONDITION has a lower value on the given abstract scale (0–1) than the EFFECT does. A call to a meaning procedure that incorporates the correct scalar into the representation of the change event is listed in the lexical entry for all change events. So, a TMR for the price increased (in presentation format) will be: change-event theme cost precondition.cost.value < effect.cost.value time < speeech-act.time 5
In the OntoSem environment, the use of statistical methods in knwoeldge acquisition is a central direction of work at the time of writing. Incidentally, OntoSem also uses statistical techniques in the processing itself, as a source of heuristic decisions for the cases when other heuristics are weak or non-existent.
Homer, the Author of The Iliad and the Computational-Linguistic Turn
177
To return to the broader picture, I will mention just four of the many coveragerelated and organizational limitations of WordNet. First, it effectively (though apparently not by design) follows the methodology of splitting word senses, with all the concomitant problems that this poses for NLP. Of course, as WordNet was developed specifically for human use, this methodological choice cannot be held against it. Second, WordNet does not handle the semantics of adjectives well, as reported in Fellbaum (1999); compare this with OntoSem’s fundamental treatment of even the most polysemous of adjectives, as described in Raskin and Nirenburg 1999. Third, different diatheses of a given verb are presented in different parts of the hierarchy: e.g., active sell has a superordinate of exchange while middle sell has a superordinate of be – which Fellbaum describes as a result of the design of WordNet (Fellbaum 1999: 256–257). She further notes that “Researchers who have tried [to] find the semantic properties that are both necessary and sufficient to characterize the class of verbs that can undergo middle formation have not been completely successful ” (259) The search for such semantic overlap is, in our opinion, an invented problem: there need not be any such properties, and an environment for representing semantics should best start from the needs presented by the language rather than the restrictions of a given formalism. Fourth, complex expressions and complex notions (even if expressed succinctly) cannot be integrated, as reported by Fellbaum (1998) (who focuses on idioms, but the same issues arise with semantically compositional expressions). Among the types of excluded entities are: (a) idioms that do not fit into any of WordNet’s categories N(P), V(P), Adj(P) or Adverbial(P): e.g., the more the merrier; (b) structures that require negation like not give a hoot; (c) full sentences; (d) idioms that contain variables, like blow one’s stack; (e) idioms that express concepts that can’t be paraphrased by a single notion, like drown one’s sorrows; (f) idioms meaning become smth.., as in hit the roof. OntoSem, for comparison, permits all of these types of entities, with their corresponding semantic representations, to be expressed in lexical entries that can include variables, optional elements, and expressions of any length or complexity. In short, OntoSem imposes no limits on the granularity of semantic (not to mention syntactic) expressiveness: semantics can be expressed by any combination of ontological mappings, preconditions and effects, property values, values of mood or aspect, etc.; and if a means of representation does not exist, we create it to fill a practical need. But the main issue with WordNet is its weakness in supporting ambiguity resolution. Ambiguity is, in our view, the main challenge for NLP. It is, therefore, reasonable to say that if a lexicon and an ontology used for NLP do not support disambiguation, they cannot be sufficient for truly high-level applications. WordNet’s inability to support ambiguity resolution is understandable because ambiguity poses virtually no problem for humans, and WordNet seeks to depict how humans organize lexical knowledge. In other words, if WordNet accurately depicts how humans organize lexical knowledge, then use of the resource should presuppose all of the world knowledge, pragmatics, goals and general analytical skills possessed by humans. Machines, however, do not have these advantages.
178
Nirenburg
A relevant comparison is the utility of a thesaurus to a native speaker versus its relative opaqueness to a language learner. WordNet is used by many as a source of knowledge in NLP simply because it is there. Its actual efficacy varies among applications: e.g., Vieira and Poesio (2000) found it of little help in reference resolution, and its utility in query expansion for information retrieval has been mixed (see below). The widespread use of WordNet for NLP has spurred efforts to make it a better NLP resource, with version 2.0 including more noun-verb links and a topical organization for certain domains. However, the nature of this resource as a hierarchy of semantically undefined lexical items remains, we believe, a insurmountable disadvantage for machine processing. In sum, the fact that WordNet combines some ontological knowledge with lexical knowledge does not, by itself, disqualify it from use in applications. What limits its utility is the lack of knowledge (e.g., about selectional constraints) that is required to support automatic extraction of meaning from text.
9.5 Ambiguous Symbols ↔ Unambiguous Expressions If one did not know about Wilks’ track record in his “day job” as a practicing NLPer (e.g., Wilks 1975 on compositional procedural semantics using preferences or Wilks and Stevenson 1997 on word sense disambiguation), from the papers under discussion one could get an impression that he does not consider the issue of ambiguity resolution central to the enterprise of AI NLP. His position is that atoms (symbols) in the representation language can be (and cannot be but) ambiguous while representation language expressions can be made to be unambiguous and that this resolution of ambiguity “would have to be resolved by the processor that used them” (Nirenburg and Wilks 2001). If so, the processor must have access to sufficient knowledge to make disambiguation decisions. But simply saying that something will be the responsibility of the processor does not absolve Wilks from the responsibility of describing how this is to be done. This looks suspiciously like passing the buck. What does the position that in a representation language atoms are (ambiguous) natural language words and expressions are unambiguous, entail? It entails that there is no representation that is declaratively unambiguous and therefore whenever a representation language statement is required to support some processing (either related to text meaning extraction or to application-oriented reasoning), it must first be disambiguated by running some procedure to understand what specific inferences it supports. Note that this disambiguation will have to be carried out every time an element of representation is called upon for reasoning – because in its static form the representation is not disambiguated. Thus, faced with a representation language statement of the kind “car consume gasoline,” any system will first have to understand that in this context car does not mean railroad car and consume does not mean intake of foodstuffs through the mouth by higher animals (supposing that gasoline is not ambiguous). It is only after such disambiguation that the inference maker will be able to use this knowledge (together with other knowledge elements,
Homer, the Author of The Iliad and the Computational-Linguistic Turn
179
certainly) to establish (among a number of other entailments) that when somebody says “It’s a long drive, we must buy gasoline,” the reason for buying gasoline is to make sure that the car has enough gasoline to consume in order to complete this drive. Note that if such a finding is recorded, once again, using ambiguous words as representation atoms, then every time these inference results are accessed, another quantum of disambiguation will have to be run. Even without going any deeper, this is surely a less-than-efficient proposition. Another practical consideration at this point would involve supplying outputs that are amenable for processing using the currently available reasoning systems. These systems typically require absence of ambiguity in representations as well as completeness and soundness of the knowledge before any discussion of the possible utility of the reasoning systems themselves. One example of such a procedure was the AQUA project within the AQUAINT R&D program devoted to question answering. AQUA used the JTP reasoning system (Fikes et al. 2003). The knowledge over which this system reasoned in this project was automatically produced by the OntoSem semantic analyzer (Beale et al. 2003), whose results were then automatically converted into the representation language used by JTP. Of course, the conversion between the two representation language formats was largely semantics-free. Another example is the SemNews system (Java et al. 2005), which uses OntoSem to annotate RSS news feeds with TMRs and make them available on the internet. The format conversion here was between the custom metalanguage of OntoSem and a dialect of OWL, the representation language of the Semantic Web.
9.6 Ambiguity in Representations and its Causes “[T]he key distinction between Wordnet and an ontology is this: Wordnet has lexical items in different senses (i.e. that multiple appearance in Wordnet in fact defines the notion of different senses) which is the clear mark of a thesaurus. An ontology, by contrast, is normally associated with the claim that its symbols are not words but interlingual or language-free concept names with unique interpretation within the ontology. However, and given that this issue is one of great antiquity, the position argued here is that, outside the most abstract domains, there is no effective mechanism for ensuring, or even knowing it to be the case, that the terms in an ontology are meaning unique.” (Wilks, Ontotherapy). This statement combines the issue of natural language as metalanguage with the issue of whether ontologies are different from thesauri. Wilks motivates the opinion using the example of Lisp, a programming language, instead of a natural language: “For example, it is generally agreed that in the basic original forms of the Lisp programming language the symbol ‘NIL’ meant at least false and the empty list, though this ambiguity was not thought fatal by all observers. But it is exactly this possibility that is denied within an ontology (e.g. by Nirenburg) though there is no way, beyond referring to human effort and care, of knowing that it is the case or
180
Nirenburg
not.” (Ontotherapy). The fact that NIL started to be used in Lisp in several different meanings might indeed have happened initially by chance. But this situation was then reviewed by the developers of the language (and apparently the observers mentioned by Wilks), who adjudged this situation benign. It was recognized that these uses were in a complementary distribution and that the interpreter and the compiler would not face conflicts in evaluating expressions with NIL. As a result of this judgment, the ambiguity of NIL was retained for the sake of convenience, tradition and parsimony. It is beyond question that if this ambiguity interfered with or hampered processing, it would have been eliminated. The situation with ambiguity in knowledge resources used in AI/NLP is more complex. Indeed, there are several kinds of ambiguity that are apparently impossible to eliminate in building ontologies, lexicons and fact repositories capable of supporting realistic applications. This is, again, a case of a conflict between the desire to have “clean” knowledge resources and the need to achieve coverage and depth. In discussions of scientific method, the two most important desiderata concern fidelity to empirical evidence and simplicity and consistency of logical formulation. But fidelity to evidence takes precedence in cases of conflict (e.g., Caws 1967; see also Nirenburg and Raskin 2004: 73–74). In other words, it is a good idea to strive for “clean” knowledge resources, but when the needs of description make this goal impossible or too difficult in building an application (though not necessarily if the project is purely theoretical), it should be waived. In the rest of this section, I will present a few examples of synonymy and potential ambiguity in OntoSem TMRs, ontology and fact repository. Practice has shown that the relation between texts and text meaning representations is not one-to-one. It can clearly be one-to-many, as text meaning representations are subject to paraphrasing (that is, exhibit synonymy). One possible cause is the possibility of expressing certain meanings with different levels of specificity (or, alternatively, vagueness). For example, in the OntoSem ontology, the event embargo is a leaf node on the path: embargo → sanction → commerce-event → financial-event → social-event → event Each of these concepts is used to explain the meaning of different words in the lexicon, and these links are primary clues for the construction of TMRs. So, for the input The US has imposed an arms embargo on Somalia. the TMR will contain an instance of the event concept embargo. For the input The US has imposed sanctions on Somalia. the TMR would contain an instance of sanction (note that the concept sanction expresses only one of the two senses of the English sanction, the other being explained in terms of the ontological concept approve). sanction will also anchor the TMR for
Homer, the Author of The Iliad and the Computational-Linguistic Turn
181
The US has sanctioned Somalia by prohibiting arms imports into the country. Since sanctioning imports is, in fact, embargoing, the meaning of this past sentence could be paraphrased by a TMR using embargo, a more specific concept than sanction. The connection between paraphrases in TMR may even be more remote; for example, the input can be paraphrased as: The US has prohibited arms imports into Somalia. The TMR for this sentence will feature the concept import (which is not a parent of embargo but its ontological first cousin once removed: it is a part of the path import → commerce-event → ). The question is, how to resolve and use these paraphrases. It is a practical question because such resolution would facilitate a correct answer to the question “What sanctions were imposed on Somalia?” when the fact repository contains only the statement that US imposed an arms embargo on Somalia. One algorithm that can be applied to the resolution of paraphrases is described by Mahesh et al. (1997), where it is used to provide additional support for basic disambiguation algorithms in OntoSem. For example, if the system could not disambiguate sanction between sanction and approve, due to the corresponding constraints being insufficiently strong, an attempt was made to specialize the TMR by moving down the ontological hierarchy and using ontological descendants of both sanction and approve to see whether only one of the candidates conformed to the selectional and other constraints. In the above examples, embargo would work fine, while approve would not. The algorithm in question has been used not only in cases of ambiguity but also when the basic analysis fails to find a single coherent interpretation because no solution conforms to all the required constraints. In such cases, the algorithm essentially relaxes constraints by moving up the ontological hierarchy. Thus, it might involve using commerce-event instead of either import or embargo. If a successful TMR “anchor” is found using this method, then the resulting TMR will be effectively ambiguous between several candidate meanings. This state of affairs is accepted because OntoSem prefers to output vague (but not necessarily inconsistent) solutions rather than producing no solution at all. Now to a few brief comments about ambiguity in static knowledge resources. In an ontology, ambiguity can result from multiple inheritance, which causes problems for the interpretation of the [admittedly, clumsy] sentence “Don’t hit me with the newspaper you work for” if the lexicon has just one sense of newspaper. A particular fact repository could (and almost certainly will!) have multiple entities that the system failed to co-refer. It is entirely plausible and to be expected that Walter Scott could be stored there as human-FR334 and “the author of Waverley” could be stored as human-FR5298. Moreover, some knowledge about Walter Scott may be also stored under author-FR94, with some of the information about him appearing
182
Nirenburg
in the human-FR334 frame and some other, in author-FR94. This is another illustration of why establishing co-reference relations among FR entities is a crucially important task.
9.7 Procedures in Representations Procedural attachments appear in a broad variety of declarative knowledge representation schemata and are traditional enough to be described in introductory textbooks (e.g., Brachman and Levesque 2004). In AI/NLP, one of the reasons to use them is to compensate for descriptive difficulties that result from the desire to help text analysis by allowing an analyzer to choose from among a small set of senses for each word. To make the sets of senses small, it is often necessary to “bunch” them, that is to group somewhat distinct senses into coarser entities. Difficulties in processing arise here because, while it is easier to select one of three rather than one of sixteen during processing, the actual interpretation of that sense can become unfocused. Pinpointing the actual meaning is then delegated to procedures that take into account additional (usually from the immediate context) information that cannot be readily stipulated in static knowledge resources. In this section I discuss connections between sense discrimination, lexicon content and the use of procedures to enhance the static knowledge resources. A major point of disagreement between Fodor and Lepore on the one hand and Wilks and Pustejovsky on the other hand is their attitude to procedures as components of lexical semantics. Fodor and Lepore do not believe in them. Pustejovsky uses them essentially because he thinks that it could be done, with all the theoretical and practical benefits accruing. Wilks believes that lexical specification cannot be carried out without procedural knowledge, in part because he does not believe that a purely declarative metalanguage could be devised. Moving from the language side to the metalanguage side, Guarino criticizes WordNet and OntoSem ontologies because, among other issues, they do not provide identity criteria for all concepts. Wilks argues that this requirement (among other restrictions) is only tangentially useful for reasoning in NLP; moreover, he doubts the possibility of constructing a realistic and useful ontology conforming to Guarino’s discipline. The decision to include procedures in the representational apparatus can render some of Guarino’s criticism moot because there may remain fewer word senses (and concepts in an ontology) and, therefore, fewer cases of unspecified identity conditions. But to compensate for that, Guarino’s task will be made more complicated because now “cleaning” the knowledge resources would involve entities (namely, procedures) of a different kind from declarative concepts. The fact that the English words window, book or newspaper (“Don’t hit me with the newspaper you work for!”) exhibit regular polysemy does not necessarily require that their meanings be represented by a single ontological concept, a fact that attracted Guarino’s critical attention). Also, in general, lexical elements (atoms) of the ontological metalanguage (ontological concepts) do not necessarily correspond to word senses in natural language. Moreover, as mentioned above, many word
Homer, the Author of The Iliad and the Computational-Linguistic Turn
183
senses are interpreted not by pointing to an ontological concept but through the use in the text meaning representation language of parametric, extra-ontological features, such as speaker attitudes (called modalities in OntoSem), co-reference relations, rhetorical relations or by procedural means – for example, by calling a meaning procedure attached to a lexicon entry to determine the contextual interpretation of a word sense. With respect to delimiting the number of word senses to be encoded in the computational-semantic lexicon, there are three options: bunching, splitting and passing the buck. Bunching would imply positing one sense for window (to be interpreted as both an aperture and as a frame), expressed using one ontological concept with multiple inheritance; this is a widely recommended strategy (e.g., Palmer et al. forthcoming; Ide and Wilks forthcoming; Nirenburg and Raskin 2004: 331–344) for use in computational linguistic applications. Splitting involves trying to capture as many lexical-semantic distinctions as possible and therefore positing as many different senses for a word as such distinctions would require. This is the approach taken by many lexicographers in developing human-oriented dictionaries. Finely split senses, if used in a semantic analyzer, lead to the necessity of describing each of them in sufficient detail for an analyzer to be able automatically to select the appropriate one. This approach leads to obvious logistical difficulties. It also is exposed to the same criticism to which Weinreich (1966) subjected the semantic theory of Katz and Fodor (1963), namely, that there is no good criterion to limit polysemy and thus stop splitting senses. Nirenburg and Raskin 2004: 331–332, comment: “ [H]aving determined that one of the senses of eat is ‘ingest by mouth’, should we subdivide this sense of eat into eating with a spoon and eating with a fork, which are rather different operations? Existing human-oriented dictionaries do not have theoretically sound criteria for limiting polysemy of the sort Weinreich talked about. It might be simply not possible to formulate such criteria at any but the coarsest level of accuracy.” Finally, “passing the buck” means, in the case of window, positing one sense but not committing to a particular ontological concept or concepts as explanation in the representation itself; instead, encoding the decision as a procedure that decides on the appropriate meaning when called during actual text analysis (that is, when textual context is present). The assumed existence of such procedures is a major prerequisite of the generative lexicon approach of Pustejovsky (1995). The main claim of this approach is twofold: (a) word senses of a lexeme are systematically related by general relations from a specified inventory; and (b) once this inventory is established, it becomes possible to split a lexeme into fewer senses because all the other senses will be derivable using lexical rules based on these senses. Nirenburg and Raskin (2004: 115–133) examine Pustejovsky’s claims in some detail and conclude, among other things, that in many cases there is no effective difference between the amount of work necessary to compile a generative lexicon and that required for a more traditional enumerative lexicon. Indeed, while some lexical rules – for example, rules capturing phenomena from derivational morphology – have wide applicability and have been shown (e.g., Viegas et al. 1996) to be useful in practice, the utility of many other rules, in particular, those reflecting regular polysemy – most famously, the “grinding” rule (Briscoe and
184
Nirenburg
Copestake 1991) – are much less obvious because of the number of “exceptions” that must be individually encoded and for which the application of the rule must be explicitly blocked. In these cases, the purported advantage over enumerative lexicons does not materialize. Another important class of phenomena – in OntoSem referred to as semantic ellipsis (e.g., McShane 2005), cf. I forgot his name vs. I forgot (to bring) the umbrella – is equally difficult for both enumerative and generative lexicons. Fodor and Lepore contend that a rule like: X wants Y → X wants to do with Y whatever is normally done with Y cannot be formulated because its application to particular lexemes, e.g., beer in John wants some beer cannot be safely recorded anywhere but with the appropriate sense of beer. They advocate a totally individual and separate element of the language of thought for each of the word senses. But this criticism applies equally to the case when drink is marked as the typical purpose of beer in the lexicon and to the case when determining this constraint is delegated to an inference rule operating over the ontology and the lexicon. My criticism is from the opposite side – it seems that one can never guarantee that sufficient disambiguating constraints can be obtained from the content of the lexicon entry (no matter whether the lexicon is enumerative or generative) because one can normally do things with beer that are not normally done with, say, more generic drink or liquid – for example, John might want to make Carbonnades a la Flamande. So, the disambiguation process must take into account not only the knowledge in the ontology and the lexicon but also the results of processing the broader context. This realization is ultimately a vote for the use of scripts, typical event sequences, in automatic text understanding. While a number of scripts have been acquired for the OntoSem ontology, the semantic ellipsis resolution algorithms described in McShane et al. 2004 do not yet use script-based knowledge in the left hand sides of their inference rules. Incidentally, in OntoSem the approach to resolving semantic ellipsis is mixed – some cases invoke the use of meaning procedures (see below) while others are treated through lexicalization. To illustrate this lexicalization method with just one example from the above paper, consider the verb invite. When followed by a direct object indicating a human (or, by extension, an organization) and a prepositional phrase or adverb indicating location (or destination) directly or metonymically, it actually means “invite someone to come/go to that place”; the verb of motion is semantically elided. Examples include the following: • • •
Civilians invited into the prison by the administration to help keep the peace were unable to stanch the bloodshed. “If they invited us back tomorrow to govern the mainland, frankly we would hesitate,” Vice Foreign Minister John H. Chang told a US governor’s delegation. All 13 OPEC oil ministers were invited to the meeting.
Homer, the Author of The Iliad and the Computational-Linguistic Turn
•
185
He often is one of a handful of top aides invited into the Oval Office for the informal sessions at which President Bush likes to make sensitive foreign-policy decisions.
The OntoSem lexicon entry that covers this use of invite is as follows, in presentation format: invite-v2 def “+direct object (human) + pp of destination, implies ‘invite to come”’ ex “She invited him to the meeting/to Paris” syn-struc subject root $var1 cat n v root $var0 directobject root $var2 cat n pp-adjunct root $var3 cat prep root (or to onto into on) obj root $var4 cat n sem-struc invite agent value ∧ $var1 theme come destination value ∧ $var4 agent value ∧ $var2 ∧ $var3 null-sem+ The syntactic structure (syn-struc) says that this sense of invite requires a subject, direct object and PP, and that the PP must be headed by the word to, onto, into or on. The semantic structure (sem-struc) is headed by an invite event, whose agent is the subject of the clause (note the linked variables) and whose theme is come. The agent and destination of come are the meanings of the direct object and prepositional object, respectively, of the input clause. (We gloss over formal aspects of the entry that are tangential to the current discussion.) Note that there is no verb of motion in the input text: come is lexically specified since it is a predictable semantically elided aspect of meaning in the given configuration. For a description of procedural treatment of ellipsis in OntoSem, see McShane (2005). In summary, inclusion of procedural knowledge in semantic analysis and in computational lexicography may not lead to economies in knowledge acquisition. In this, I am in agreement with Fodor and Lepore. A choice of granularity of sense distinctions in the lexicon, however, does not influence the decision about the nature of representational symbols in any way. On the other hand, the position that the representation language has, unlike Fodor’s, some explanatory power in terms of a core of undefined properties, does not necessarily mean that its elements belong to a natural language. In the OntoSem ontology, objects and events are given denotation in terms of unique sets of property-value pairs that describe them. Thus, the question of the nature of symbols relates only to the meta-meta-language of semantic description,
186
Nirenburg
that is, to the inventory of ontological properties. In OntoSem currently only about 350 concepts out of the total of over 8,500 concepts (with, on average, 16 properties specified for each) belong to meta-meta-language. The properties are weakly and circularly defined in terms of constraints on their domains and ranges. For example, the domain of the property mass is constrained to the ontological subtree with the root at physical-object and the range of mass is constrained to a positive real number or interval. Based only on the formal comparison of the constraints on their domains and ranges, the property mass is indistinguishable from the property length (and a few other properties). These properties are, however, distinguished by the inventories of inference rules (including meaning procedures within OntoSem proper and external inferences in any reasoning system that uses OntoSem knowledge resources) in which they occur. But it is still incorrect to contend that length or range as elements of ontology are also senses of natural language words. Within the semantic analysis and reasoning environment, they are simply labels, names of variables used to formulate various constraints and inference rules. The system does not have an understanding of what their meaning in natural language may be. It is the knowledge acquirer who must take into account the intended meanings of these properties. So, it is only to people that these primitives can be considered elements of natural language, not from the point of view of the system. Incidentally, the preference for using meaningful (for people) names for variables in computer programs does not impinge on the success or failure of the program, though their denotational semantics may be formulated, informally, using the appropriate senses of words. This is the use of denotational semantics intended by McDermott (1978): “The semantics is for our use, as a tool for analyzing knowledge representations,” and is not intended to be an integral part of the representational or reasoning system itself. So, a compromise on the issue of the nature of symbolic representations may be reached around the statement: “The defining vocabulary of the metalanguage of symbolic representations, its ‘meta-meta-language,’ can receive its denotational semantics in terms of word senses in a natural language.” Of course, some of these word senses will be quite different from what we usually think of as word senses. Indeed, in many cases the knowledge acquirer will have actually to learn from the acquirer-oriented definitions (which are similar in form, content and intent to definitions in regular human-oriented dictionaries) the meaning of properties in the ontology. Thus, for example, the definition for the OntoSem property measures-property is given the following definition in the OntoSem ontology: “this property is used to specify what property is measured when one measures something: e.g., if one measures a table, it might be its length, width, height; with verbs other than ‘measure’, there can be even more ambiguity: ‘evaluate the race conditions’ can mean measuring the air temperature, slipperiness of the surface, etc. Thus, measures-property is generally used in conjunction with an indication of the thing measured: e.g., measure theme: table, propertymeasured: height.” The meaning of text elements and combinations thereof depends to varying extents on the meanings of the elements of the context in which they appear.
Homer, the Author of The Iliad and the Computational-Linguistic Turn
187
Whereas outside of context many nouns (e.g., sheep) and verbs (e.g., dance) conjure a relatively stable image, the adverbials approximately, very and nearly depend crucially on their context to concretize their meaning. In the Ontological Semantic (OntoSem) text processing environment, context-dependent meaning is arrived at by a combination of static lexical descriptions (which link to a language-independent ontology) and procedures that attempt to specify text meaning based on ontological, contextual and other information. In OntoSem, the procedural elements are encoded in meaning procedures (MPs). Meaning procedures are used to account for reference resolution, final assignment of case roles, resolution of absolute time and location, specification of approximations, etc. As an example, the meaning procedure delimit-scale is used in the lexicon entries for words such as very, quite, somewhat, etc. The meaning of a word like very modifies the value of the property that is the meaning of the word that very modifies. For example, in very big, the property will be size, a scalar attribute in OntoSem nomenclature. The meaning of big in OntoSem is represented as a value (an interval) on the abstract scale {0, 1} that is defined as the legal filler of the range slot of the scalar attribute size. The meaning of very is procedural: it is a command to make the scale interval representing big narrower and to move it toward the nearest extreme of the scale (the actual numerical calculation will not be detailed here). If, suppose, the meaning of big is the interval {.8, 1}, then the meaning of very big may be {.8, 1}. More formally, here is (in a simplified presentation format) the OntoSem lexicon entry for very: very-1 cat adv def “toward the more extreme end of the given scale” ex “very big, very late, very small” syn-struc mods $var0 (cat adv) root $var1 (cat (or adj adv)) meaning-procedure delimit-scale (value ∧ $var1) extreme .1 The syn-struc (syntactic structure) says that very ($var0) is an adverb that modifies an adjective or an adverb ($var1). Unlike typical lexicon entries, this one has no static sem-struc (semantic structure) zone, since the meaning of very relies on its composition with what it modifies. Instead, it has a meaning-procedure zone that calls the delimit-scale MP with three arguments: •
the value of the meaning of $var1 (indicated by ∧ $var1), which is a value between 0 and 1; • whether the scalar value is shifted toward the extreme or the mean of the given scale; • the amount by which the value is augmented.
188
Nirenburg
So, very small would be calculated by taking the value of small (size {0, .3}), and shifting it to the extreme by .1, returning a value of (size {0 .2}). Analogously, moderately small would be calculated by taking the value of small (size {0, .3}) and shifting it toward the mean by .1, returning a value of (size {.1, .4}). (The MP for moderately has the 2nd argument as “mean” rather than “extreme”.) An interesting situation occurs if one modifies a scalar such that its value is off the scale. For example, extremely is defined as shifting the scalar value by .2 toward the extreme, so an extremely extremely expensive car will be calculated as follows: extremely + extremely + expensive .2+.2+.8=1.2 The value 1.2 lies outside of the {0, 1} scale; however, this is exactly what we want as a semantic interpretation of extremely extremely: a value that exceeds any expectation for the property given the specific object in context. OntoSem-style meaning procedures seem to be a natural way of encoding the meaning of contextdependent elements that are too often excluded from the purview of both formal semantics and work on the semantic annotation of corpora.
9.8 Computational-linguistic Turn Wilks’ positions on the interaction of language and knowledge – which he has held with remarkable consistency at least since his dissertation work (Wilks 1968) – have been repeatedly expressed in his polemics with Fodor, Guarino, myself and many others. These positions seem to be strongly influenced by Wittgenstein’s (1953) views on the treatment of meaning using the metaphor of “language games,” by ordinary language philosophy (e.g., Ryle 1953; Austin, 1962) and in general by what has been called by Bergmann (1964) – and popularized by Rorty (1967) as – “linguistic turn.” When linguistic turn is applied to philosophical deliberations, this means, in simple terms, that studying meaning in language is essentially studying the world, and that the world can be described only through the use of elements of language. “To execute the linguistic turn one would have to subscribe to a substantial metaphilosophical thesis namely, as Rorty (1967) puts it, ‘that philosophical problems are problems which may be solved (or dissolved) either by reforming language or by understanding more about the language we presently use’. Rorty and Bergmann take the view that philosophical problems are problems of language as the ‘least common denominator’ of the metaphilosophical positions of both camps in analytic philosophy, Ideal Language Philosophy and Ordinary Language Philosophy.” – Watzka (2002). Could it be that Fodor’s position ascends to ideal language philosophy even as Wilks’ reflects a form of ordinary language philosophy views?
Homer, the Author of The Iliad and the Computational-Linguistic Turn
189
One of the motivations for the linguistic turn and the development of ordinary language philosophy was to get away from the obfuscatory, esoteric and overly technical and terminological language used by earlier philosophers. Another was to distinguish ordinary language from formal or conceptual language. It is not easy, however, to pinpoint precisely what language is ordinary and what language is not ordinary. And it does not help that Ryle (1971), as illustrated by Hacker (1996: 160), distinguishes the “use of ordinary language” from the “ordinary use of language” and “ordinary linguistic usage.” The distinction between ideal language and ordinary language was formulated essentially before the age of computation. While it is difficult for people to maintain the distinction between the ideal and the ordinary, it seems that for a computer system this distinction is essentially moot because no language is ordinary for a computer program. It cannot rely on the powerful reasoning mechanisms that humans possess (and that allow them, for example, to make sense of difficult philosophical discourse – whether it was conducted in ordinary or ideal language or – as seems more plausible – in some mixture of well-defined and ordinary, that is, ambiguous, language). Maybe it is time to suggest a modification of the linguistic turn for the age of the computer: a computational-linguistic turn, whereby philosophers will formulate their arguments about the world and language in a metalanguage that will be suitable for a software agent in a society that includes both other software agents and people. Software agents can be programmed to communicate among themselves in the metalanguage and with people in language. Philosophers then would be able to observe transcripts of both kinds of communication and judge whether the messages are meaningful and appropriate using, if they so choose, their favorite version of denotational semantics.
9.9 Stand Up (And Be Counted) Philosophy Yorick Wilks is as close to being a polymath as anybody in the fields of computational linguistics and AI. He has developed, alone or leading research teams, a number of computer systems in machine translation, cognitive modeling, information retrieval and extraction, dialog and other application areas. He has formulated important AI theories, notably that of preference semantics. He has contributed to the conception and development of a number of NLP-oriented resources, such as machine-tractable dictionaries, and researchers’ toolkits, such as GATE. What makes him truly stand out among AI/NLP practitioners is his remarkable ability to assess developments in the field, their importance and their promise for solving long-standing problems in applications. This is the mark of a philosopher, which is what Yorick is and has been all along, first and foremost. What makes him different from most other philosophers interested in AI, cognitive science, computational linguistics and related disciplines is that he does not shy away from standing up and being counted among the creators, not only assessors and critics of systems and theories. Add to this the high culture of argument and the sense of timing and style
190
Nirenburg
worthy of the best public performers in any field, and it becomes clear that Yorick is in a class by himself among colleagues. The field should be looking forward to many more of his contributions.
Acknowledgement Many thanks to Marge McShane, Patrick Hanks and an anonymous reviewer for fine-grained and very constructive criticism.
References Austin, J.L. 1962. How to do Things with Words. Oxford: Clarendon. Beale, S., S. Nirenburg and M. McShane. 2003. Just-in-time Grammar. In: Proceedings of the 2003 International Multiconference in Computer Science and Computer Engineering. Las Vegas, Nevada. Bergmann, G. 1964. Logic and Reality. Madison, WI: University of Wisconsin Press. Brachman, R. and H. Levesque. 2004. Knowledge Representation and Reasoning. San Francisco: Morgan Kaufmann. Brewster, C., J. Iria, F. Ciravegna and Y. Wilks. (2005) The Ontology: Chimaera or Pegasus, In: Proceedings of the Dagstuhl Seminar on Machine Learning for the Semantic Web. Briscoe. E. J. and A. Copestake. 1991. Sense Extensions as Lexical Rules. In: Proceedings of the IJCAI Workshop on Computational Approaches to Non-Literal Language. Sydney, Australia. Caws, Peter. 1967. Scientific Method. In: Paul Edwards (Editor-in-Chief), Encyclopedia of Philosophy. Vol. 7, New York: Macmillan, pp. 339–343. Chafe, W.L. 1977. Creativity in verbalization and its implications of the nature of stored knowledge. In: R.O. Freedle (Ed.), Discourse Production and Comprehension. Norwood, NJ: Ablex, pp. 41–56. Fellbaum, C. 1998. Towards a Representation of Idioms in WordNet. In: Proceedings of the COLING-ACL Workshop on the Usage of WordNet in Natural Language Processing Systems. Montreal. Fellbaum, C. 1999. Verb semantics via conceptual and lexical relations. In E. Viegas (Ed.), Breadth and Depth of the Lexicon. Dordrecht, Holland: Kluwer Academic Publishers, pp. 247–262. Fikes, R., J. Jenkins and G. Frank. 2003. JTP: A System Architecture and Component Library for Hybrid Reasoning. In: Proceedings of the Seventh World Multiconference on Systemics, Cybernetics, and Informatics. Orlando, Florida, USA. Fodor, JA. and E. Lepore. 1998. The Emptiness of the Lexicon: Critical Reflections on J. Pustejovsky’s The Generative Lexicon. Linguistic Inquiry, 29: 2. Gangemi, A., Guarino, N. and A. Oltramari. 2001. Conceptual Analysis of Lexical Taxonomies: The Case of WordNet Top-Level. In: Proceedings of FOIS 2001. Maine. Guarino, N. 1997. Understanding, building and using ontologies. International Journal of Human-Computer Studies, 46. Guarino, N. 1998. Formal ontologies and information systems. In: Proceedings of the First International Conference on Formal Ontologies in Information Systems. Trento, June. Guo, C. 1995. Machine tractable dictionaries: Design and construction. Norwood, NJ: Ablex. Hacker, P.M.S. 1996. Wittgenstein’s Place in Twentieth Century Analytic Philosophy. Oxford University Press.
Homer, the Author of The Iliad and the Computational-Linguistic Turn
191
Halliday, M.A.K. 1985. An Introduction to Functional Grammar. London-Baltimore: E. Arnold. Hempelmann, C., V. Raskin, and K. E. Triezenberg. 2006. Computer, Tell Me a Joke ... but Please Make it Funny: Computational Humor with Ontological Semantics. In: Proceedings of the Nineteenth International Florida Artificial Intelligence Research Society Conference, Melbourne Beach, Florida, USA, May 11–13. AAAI Press, pp. 746–751. Ide, N. and J. Véronis. 1993. Extracting Knowledge Bases from Machine-Readable Dictionaries: Have We Wasted Our Time? In: Proceedings of the First International Conference on Building and Sharing of Very Large-Scale Knowledge Bases (KB&KS’93). Tokyo, Japan. Ide, N. and Y. Wilks. forthcoming. Making sense about sense. In E. Agirre and P. Edmonds. (Eds.), Word Sense Disambiguation: Algorithms and Applications. Springer. Iordanskaja, L., R. Kittredge and A. Polguère. 1991. Lexical selection and paraphrase in a meaning-text generation model. In: C. Paris, W. Swartout and W. Mann. (Eds.), NaturalLanguage Generation in Artificial Intelligence and Computational Linguistics. Boston: Kluwer Academic Publishers. Java, A., T. Finin and S. Nirenburg. 2005. Integrating Language Understanding Agents into the Semantic Web. In: Proceedings of the First International Symposium on Agents and the Semantic Web. Arlington. VA, November. Java, A., T. Finin and S. Nirenburg. 2006. Text Understanding Agents and the Semantic Web. In: Proceedings of the 39th Hawaii International Conference on System Sciences. Johnston, R. 2004. Ice cream verbals. The Journal of the Law Society of Scotland, June. Kahneman, D., P. Slovik and A. Tversky (Eds.) 1982. Judgment Under Uncertainty: Heuristics and Biases. Cambridge: Cambridge University Press. Kapur, N. (ed.) 1997. Injured Brains fo Medical Minds: Views from Within. Oxford University Press. Katz, J. J. and J.A.Fodor. 1963. The structure of a semantic theory. Language, 39:1. Kripke, S.A. 1982. Naming and Necessity. Harvard. Landauer, T.K., P.W. Foltz and D. Laham. 1998. Introduction to latent semantic analysis. Discourse Processes, 25. Mahesh, K., S. Nirenburg and S. Beale. 1997. If You Have It, Flaunt It: Using Full Ontological Knowledge for Word Sense Disambiguation. In: Proceedings of Theoretical and Methodological Issues in Machine Translation (TMI-97). Santa Fe, NM. Mann, W.C. and C. Matthiessen. 1983. NIGEL: A Systemic Grammar for Text Generation. Technical Report ISI/RR-85–105, Information Sciences Institute, Marina del Rey, California. McCawley, J. 1981. Everything that Linguists have Always Wanted to know About Logic (but were ashamed to ask). Chicago: University of Chicago Press, and Oxford: Blackwell. McDermott, D. 1978. Tarskian semantics, or No notation without denotation. Cognitive Science, 2:3. McShane, M. 2005. A Theory of Ellipsis. Oxford University Press. McShane, M., S. Beale and S. Nirenburg. 2004. OntoSem Methods for Processing Semantic Ellipsis. In: Proceedings of the Workshop on Computational Lexical Semantics at HLT-NAACL 2004. Boston, May. McShane, M., S. Nirenburg, S. Beale and T. O’Hara. 2005. Semantically Rich Human-aided Machine Annotation. In: Proceedings the Workshop on Frontiers in Corpus Annotation II: Pie in the Sky, ACL-05, Ann Arbor, MI. Mel’ˇcuk, I. 1995. The Russian Language in the Meaning-Text Perspective. Vienna/Moscow: Wiener Slawistischer Almanach.
192
Nirenburg
Mel’ˇcuk, I., A. Clas and A. Polguère. 1995. Introduction à la lexicologie explicative et combinatoire. Louvain-la-Neuve: Duculot. Mihalcea, R. and D. Moldovan. 1998. Word sense disambiguation based on semantic density. In: Proceedings of the COLING-ACL Workshop on the Usage of WordNet in Natural Language Processing Systems. Montreal. Mitkov, R. 1998. Robust Pronoun Resolution with Limited Knowledge. In: Proceedings of ACL. Nirenburg, S. and V. Raskin. 2004. Ontological Semantics. Cambridge, MA: MIT Press. Nirenburg, S. and Wilks, Y. (2001) What’s in a symbol: ontology, representation and language. Journal of Experimental and Theoretical Artificial Intelligence (JETAI), 13(1): 9–23. Palmer, M., H. Dang and C. Fellbaum. forthcoming. Making fine-grained and coarsegrained sense distinctions, both manually and automatically. Journal of Natural Language Engineering. Pustejovsky, J. 1995. The Generative Lexicon. Cambridge, MA: MIT Press. Raskin, V. 1986. Semantic Mechanisms of Humor. Dordrecht: Reidel. Roy, J.-M. 1998. Cognitive Turn and Linguistic Turn. In: Proceedings of the 20th World Congress of Philosophy. Boston. August. Ryle, G. 1953. Ordinary language. Philosophical Review LXII. Ryle, G. 1971. Ordinary language. In: G. Ryle, Collected Papers. London: Hutchinson. Sacks, O. 2005. Recalled to Life. The New Yorker, October 31. Sowa, J.F. 2000. Knowledge Representation: Logical, Philosophical, and Computational Foundations. Pacific Grove, CA: Brooks Cole Publishing Co. Stetina, J., S. Kurohashi and M. Nagao. 1998. General word sense disambiguation method based on a full sentential context. In: Proceedings of the COLING-ACL Workshop on the Usage of WordNet in Natural Language Processing Systems. Montreal. Viegas, E., Onyshkevych, B., Raskin, V. and S. Nirenburg. 1996. From Submit to Submitted via Submission: On lexical rules in large-scale lexicon acquisition. In: Proceedings of ACL-96. Vieira, R. and M. Poesio. 2000. Processing definite descriptions in corpora. In: S. Botley and T. McEnery (Eds.), Corpus-based and Computational Approaches to Anaphora. Benjamins, Amsterdam. Watzka, H. 2002. Did Wittgenstein ever take the linguistic turn? Revista Portuguesa de Filosofia, 58. Weinreich, U. 1966. Explorations in semantic theory. In: T.A. Sebeok (Ed.), Current Trends in Linguistics. Vol. III. The Hague: Mouton. Wilks, Y. 1968. Argument and Proof in Metaphysics, from an empirical point of view. Unpublished PhD thesis (Professor R.B. Braithwaite), University of Cambridge. Wilks, Y. 1975. A preferential pattern-matching semantics for natural language. Artificial Intelligence, 6: 53–74. Wilks, Y. 1977. Making preferences more active. Artificial Intelligence, 11: 197–223 Wilks, Y. 2001. Fodor – “Fodor” strikes back. In: F. Busa and P. Bouillon. The Language of Word Meaning. Cambridge, UK: Cambridge University Press. Wilks, Y. 2002. Ontotherapy or how to stop worrying about what there is, Invited presentation, Ontolex 2002, Workshop on Ontologies and Lexical Knowledge Bases, 27th May, Held in conjunction with the Third International Conference on Language Resources and Evaluation – LREC02, 29–31 May, Las Palmas, Canary Islands. Wilks, Y., B. Slator and L. Guthrie. 1996. Electric Words: Dictionaries, Computers and Meanings. Cambridge, MA: MIT Press.
Homer, the Author of The Iliad and the Computational-Linguistic Turn
193
Wilks, Y. and M. Stevenson. 1997. Sense tagging: semantic tagging with a lexicon. In: Proceedings of ANLP-97. Washington, DC. Wittgenstein, L. 1953. Philosophical Investigations. Oxford: Blackwell. Woods, W. 1975. What’s in a Link: Foundations for semantic networks. In: D. Bobrow and A. Collins (Eds.), Representation and Understanding: Studies in Cognitive Science. New York: Academic Press.
This page intentionally blank
10 Philosophical Engineering Nigel Shadbolt School of Electronics & Computer Science, University of Southampton, Southampton, UK
When Newton wrote his great scientific treatise in 1687 it was entitled The Mathematical Principles of Natural Philosophy. As a term Natural Philosophy was understood to mean the objective study of the physical world. As the scientific revolution took hold new terms came into existence; physics and chemistry and the even later terms science and scientist gained widespread currency. Prior to the nineteenth century science simply referred to knowledge of the world. The scientific method came to refer to a particular inductive method of empirical enquiry. Essentially this was the formulation of theories capable of generating predictions susceptible to empirical confirmation or refutation. As theories became endorsed by observation the underlying principals and axioms assumed a privileged status – we came to talk of laws. A body of experimental observation gave rise to systems of laws that enabled the derivation of large scale and predictable behaviours from small scale regularities. Laws were discerned that governed the behaviour of matter, the orbits of the planets and the processes that power the stars; the sciences of chemistry and physics assumed a fundamental position in our understanding of the universe in which we live. Topics once described as branches of philosophy came to have an independent authority of their own. This authority rested on a powerful blend of practical experimental observation, peer review publication, the demand for replication and the ever present possibility that the claims of science could be challenged or refuted (Gower, 1997). Whilst science can be seem to have emerged from natural philosophy the practice of engineering might appear much more divorced from philosophical speculation. If engineering emerged from anywhere it was a wide variety of crafts ranging from masonry and architecture to iron working and watch making (Hill, 1984). Moreover, the transformational technologies of the eighteenth century were often developed literally out of the hands of craftsmen – whether these were Stevenson’s locomotives or Harrison’s chronometers. Rail and accurate navigation were two reasons the British Empire came to dominate the world. As a result in the nineteenth century engineering was an admired and respected profession (Rae and Volti, 1993). What makes our modern world distinctive is a coming together of science and engineering (Buchanan, 1985). Questions that we attempted to understand using 195 K. Ahmad, C. Brewster and M. Stevenson (eds.), Words and Intelligence II, 195–207. © 2007 Springer.
196
Shadbolt
pure reason have fallen to the scientific and empirical approach. The science of thermodynamics allowed a radical breakthrough in technologies from the internal combustion engine to the refrigerator. The science of aerodynamics enabled enormous advances in the design of aircraft. Sometimes the engineering preceded the science and on other occasions the science was needed before the engineering could commence. However, in the modern era there has always been an intimate connection between the two – between synthesis and analysis. This is a theme we will return to.
10.1 Computing Meaning It is a matter of history that chemistry, physics and biology were all once the province of philosophy. However, there were other topics in philosophy that were regarded as much less amenable to a scientific or engineering approach. How could one operationalise concepts such as meaning or knowledge? But this was precisely what the philosophers at the turn of the last century had in view. Philosophers such as Frege, Russell and Wittgenstein were all trying to put mathematics itself on a sound logical footing (Dummett, 1991). Indeed logic was though to provide a full account of the meaning of language and not just mathematics. The history of these attempts is well documented (Kneale and Kneale, 1962). What is not so readily recognised is that this project led to a basis for meaning not in humans but in computational devices. The approach could trace its origins back to Aristotle; we refer to it as a realist position. The assumption is that we engage directly with an objective reality. Reality consists of pre existing objects with attributes. Our engagement may be via reflection, perception or language. Wittgenstein’s Tractatus enshrines a view that both language and logic picture and represent the world directly (Wittgenstein, 1922 translated by Pears and McGuiness, 1961). On this view logic was the means to provide meaning for both mathematics and language. It provided a way of explaining how humans could come to a common understanding of the world. Importantly it promised a language for science. Wittgenstein was working against a background in which others such as Frege and Russell were looking to axiomatise all of mathematics. It was a context in which members of the Vienna Circle and Unity of Science movement – philosophers such as Frank, Hahn, Neurath, Schlick, Carnap, Waismann – sought reductive accounts of science and all behaviour into sets of laws and inferences. The understanding of logic and mathematics that came out of this work – the work of Carnap and Tarski (Tarski, [1936]1994) in the 1930s provided a natural semantics for the computational devices that would be built a few decades later. Church and Turing all used essentially declarative and logical formulations for their characterisation of the fundamental properties of computation and computational devices. All of these formulations were ultimately rooted in Tarskian semantics. This required the following ingredients; (i) a set of symbols in a language L, (ii) a set rewrite rules that enabled sentences in the L to be generated, (iii) a universe of
Philosophical Engineering
197
discourse comprising individuals, (iv) the truth values T and F (v) an interpretation function I from the symbols to individuals in a universe of discourse. The compositionality of the language ensured that sentence fragments mapped to various sets of individuals. Sentences mapped to truth values. Taskian model theory provided a semantics rooted in denotation for programs. Symbols in the programs of computational devices stand in a direct sense for objects and relations, events and processes in the universe of discourse. This universe can be the natural numbers or else it can more directly represent other objects and relations – those in the world. We can contrast the realist or positivist view of meaning with another tradition – this asserts that there is no simple mapping into external objects and their attributes in the world. As humans we construct the objects and their attributes of interest to us. This construction may be via intention and perception; it may be culture and species specific. Philosophical exponents of this constructivist view include Husserl, Heidegger and intriguingly the latter Wittgenstein (see Anscombe and Rhees, 1953). As Wittgenstein studied language and human behaviour he began to have real doubts that his earlier positivist or logical account could work as a full explanation of meaning. He began to view language interactions as games, complex procedures, contextualised functions that construct a view of the world. One could only understand the meaning of words if one fully appreciated all the contexts of their use.1 In particular, he argued that unless we were literally born and brought up as humans we could never understand the variety and depth of meaning associated with the terms of our language. To understand meaning was to share in a form of life – this could never by fully captured in mechanical devices or logical denotations.
10.2 Meaning in the Semantic Web Against these major philosophical traditions lets consider one of the most sustained efforts to analyse and synthesise meaning within artefacts – the branch of computer science we call Artificial Intelligence (AI). From the outset of AI as a distinct discipline there has been a sustained effort to build rich knowledge representation languages that could encode meaning about the world and the processes within it (Brachman and Levesque, 2004). Early efforts encoded knowledge as nodes and links – the nodes representing the objects and classes of importance whilst the links encoded relations between those objects. Later efforts put more structure into the nodes resulting in so-called frame languages. Another school broke with an object oriented perspective and promoted the use of rules and logic based grammars as a more powerful method of
1
Wilks (2005) argues that aspects of a view of meaning as use can be found in approaches to computational language processing that pay full attention to the statistical patterns of occurrence of terms in very large corpora of human language. text.
198
Shadbolt
representing and reasoning with knowledge. Others argued for hybrid approaches that unified the best of object and rule-based representations. It should be said that many regarded such structured representations as misguided – they argued that knowledge was better represented in networks of connection strengths; representations became distributed over these networks (Rumelhart and McClelland, 1986). More generally some argued that our representations could only be approximate or statistical in nature (Jordan and Weiss, 2002). Notwithstanding these concerns AI Knowledge Representation Languages generally made the assumption of Tarskian semantics. It provided a guaranteed and powerful way to build levels of formal meaning both on and between machines.2 We as the system designers injected the denotations by agreeing between ourselves the interpretation of particular symbols and the operations over them. Thus in the program below (Figure 10.1) if we understand the symbols in a particular way – as denoting certain objects and encoded certain relations then we have defined certain familial relationships.3 And any computer running the same program with the same interpretation of symbols is also representing the same concepts and relationships. This is indeed an old idea in AI.
equal(X, X). sibling(X,Y) := parents(X,M,F) and parents(Y,M,F) and not equal(X,Y). auntORuncleDirect(C,A) := parent(C,P) and sibling(P,A). auntORuncleMarriage(C,A) := auntORuncleDirect(C,X) and married(X,A). auntORuncleMarriage(C,A) := auntORuncleDirect(C,X) and married(A,X). auntORuncle(C,A) := auntORuncleDirect(C,A). auntORuncle(C,A) := auntORuncleMarriage(C,A)
Fig. 10.1. PROLOG encoding of family relationships4 2
3
4
Wilks has over the years argued against too simple an adoption of this logicist position (Wilks, 1990, 2000). Whilst at the same time advancing a range of proposals that have procedural and use based characterisations of meaning at their centre (Wilks, 1991). In this case after defining the notion of equality and of being a sibling we are able to provide declarative definitions for the concept of an aunt or uncle. These definitions explicitly state that one can be an aunt or uncle directly as the sibling of a child’s parent or else via marriage to a sibling of a child’s parent. Example due to Lloyd Allison.
Philosophical Engineering
199
The point that it is important to note is that this beguiling and powerful abstraction is very much at the heart of the Semantic Web enterprise. In order to support data interoperability between machines we aim to establish a clear determination of the meanings of the terms we use. The computer scientist’s recruitment of the term “ontology” to mean an agreed conceptualisation is one of the cornerstones of the Semantic Web (Berners-Lee et al., 2001). If we agree as a community of computer scientists that particular sorts of classes and relationships between them are important ways of describing a domain of interest then using a language such as OWL (Hendler, 2004) or indeed a weaker format such as Resource Description Framework Schema (RDFS) can provide a framework within which the terms are linked to instances of that term as data that is exchanged between our systems – machine to machine or person to machine (Shadbolt et al., 2006). RDFS (Brickley and Guha, 2004) is a simple triple-based language for defining ontologies. The triples all define relations between objects; i.e. object1 – predicate – object2. The instances of the ontology are also RDF-triples (relating instances of objects with objects and other instances). Each component of a triple can be expressed as a URI (Berners-Lee et al., 2005). Thus the particular attraction of RDFS is that it gives us RDF statements that are capable of being referenced as data anywhere on the Web (Shadbolt et al., 2006). We see this realised in the fragment of knowledge representing an individual in RDF below (Figure 10.2). In this fragment the first few lines define the particular syntax being used and then the concept of a contact is referenced from a URI http://www.w3.org/2000/10/swap/pim/contact# – this actually defines a vocabulary for personal information management. We are asserting series of triples about the contactable entity http://www.w3.org/People/EM/contact#me – namely his full name, his mailbox and his title. In this world we are engineering and operationalising knowledge about the world. Epistemology and Ontology become changed – we construct information processing systems that capture particular conventions about the world. The Semantic Web seeks to make these conventions very explicit through the mediation of an ontology. Thus it might appear that we have recruited a logical realist philosophy to support
Fig. 10.2. Fragment of RDF describing contact details for Eric Miller
200
Shadbolt
the Semantic Web enterprise. Wilks has argued that too often work researchers seem to believe that not only is this necessary but sufficient for operationalising meaning on the Web or in machines (Nirenburg and Wilks, 2001; Wilks, 2004). If we look in more detail at a concrete example of a Semantic Web application we will see how, up to a point, this is the case. But any actual application also reveals a constructivist character that would be much more congenial to researchers such as Wilks. As we shall see in the example below the meanings we encode are agreements based on complex social conventions that can and do change. When this happens our encodings may, in turn, have to be reinterpreted or else changed.
10.3 An Example As part of the Advanced Knowledge Technologies (AKT) Interdisciplinary Research Collaboration (www.aktors.org) we have been researching the use of ontologies to support both machine interoperability and the deployment of intelligent web services and content to users. A particular project, MIAKT (www.aktors.org/miakt), funded as part of the UK e-Science initiative illustrates how both a realist and constructivist stance can be discerned with respect to the meanings encoded within our computational systems. The MIAKT project (Dupplaw, in press; Dasmahapatra et al., 2006) attempts to support clinical decision-making and information management in the area of symptomatic breast disease. Within the UK this area of medicine is carried out using multidisciplinary teams of medical experts. They follow a protocol in which they jointly integrate a range of information from across clinical and medical domains to come up with recommendations of treatment and care. A patient may have been subject to a variety of diagnostic, screening and imaging tests. It is common for specialists in X-ray imaging, ultrasound, and Magnetic Resonance Imaging (MRI) to be considering a case with specialists in histopathology, oncology and surgery. Teams of between three to six subject matter experts are common at these multidisciplinary meetings. These experts all have their own developed vocabularies and conceptual frameworks. These in turn will reflect a particular pattern of training; it can depend on the geographical region within which they practice. From country to country one notices the effect on the terms used of different ways of delivering and paying for diagnosis and treatment. Within MIAKT the approach was not to construct one overarching conceptual framework within which each term was unambiguously placed. Rather a number of viewpoint ontologies where constructed. For example one ontology would relate to the features, attributes and relationships seen using X-ray imaging. Another ontology would describe the features, properties and inter relations seen when using Magnetic Resonance Imaging (MRI). A third ontology expresses the features and properties observed using ultrasound. A fourth the features, properties and patterns seen in histopathology – the structure of the cells and their context as seen after staining and viewing using microscopy. A fifth related to the various medical procedures and the information records relevant to a patient and their treatment history.
Philosophical Engineering
201
This last ontology was one that was used to bring the others into play as required. Mappings would exist between some elements of one ontology and another; some classes and concepts were common across ontologies. The relationship of these various views is shown in Figure 10.3 below. Within MIAKT each imaging module is composed by a set of image feature descriptors, a set of diagnosis descriptors (capturing high-level abstract features) and a set of concepts for describing meta-image information. We can refer into a particular region of interest shown on a particular X-ray using such a feature label – indeed we can use the label to retrieve an image or else use its presence to invoke another computational process. Concepts used in this way are often the outcomes of classifications. When constructing our ontology we enter them as explicit or declarative concepts. The presence of such a concept when two computational processes are in play might be a significant condition necessary for the invocation of one process by another. They are being used as denotational labels. The types of computational service that we could invoke across the web using our MIAKT framework included; image analysis services able to delineate various parts of an image, classification services that would suggest labels for aspects of the images under review, image registration services that sought to overlay images taken across time to detect potential changes, data retrieval services that would aggregate information held across the various systems. In the process of actual clinical decision making there is a complex interplay of factors. If one takes a particular medical image then deciding if a Region of Interest (ROI) is suspicious is not simply the presence of a specific set of features or descriptors. A conclusion of suspicious ROI may depend on a range of inter-related factors. As one introduces further image types and information sets clinicians rely on experience to help them make judgements. It is not just the features present within the image that are important. So too are the procedures that produced the images. When we consider where the terms in our ontologies originate we see a complex pattern of interactions. The use of a concept in medical practice requires the recognition of instances as instances of appropriate classes. The classes, features and
Fig. 10.3. MIAKT ontologies
202
Shadbolt
labels embodied in our ontologies are the end products of decision-making procedures in a particular community of practice (Dasmahapatra et al., 2006). A concrete example in breast cancer is the identification of apocrine cells. These cells are a common source of misdiagnosis – they are often classified as malignant. One way to ensure that the cells are correctly identified is to have followed a particular staining protocol. “Recognition of the dusty blue cytoplasm, with or without cytoplasmic granules with Giemsa stains or pink cytoplasm on Papanicolaou or haematoxylin and eosin stains coupled with a prominent central nucleolus is the key to identifying cells as apocrine.” NHS Guideline In such a case the meaning of the observation or classification of a tissue sample as malignant or non-malignant has to be associated with the means by which the observation was made. If such behavioural rule following is not adhered to the degree of inter subjective agreement decreases. In a reported in breast cancer study this can mean of 50% between pathology experts when diagnosing for example a ductal carcinoma in situ (Fechner, 1997). Rule following makes concept labelling reproducible. Within medical practice the desirable property of reproducibility is enhanced by this type of rule-following. Anything that promotes behavioural reproducibility is likely to be preferred and therefore assumes a normative status. Such forces will tend to sift good from bad ontologies since they become associated with effective decision making outcomes. These norms affect the way information is both collected and recorded. Within medicine these means adhering to specific protocols. Of course these protocols are themselves always subject to scrutiny and revision. The upshot is that for a majority of terms in our domain of interest we have essentially incomplete knowledge. We cannot know that a term or concept will remain defined in a particular way over time. The philosopher and logician Waismann, 1945 used the term vagueness (Waismann, 1945) to refer to this essential property of our conceptualisations. He argued that this conferred an inherent indeterminacy or open texture (Waismann, 1968) of meaning; and that this was not a weakness of language but a strength. In practice there will be some tension between procedures and norms developed locally within teams of experts and those advocated by various co-ordinating bodies at a national level – for example various guidelines issued by the NHS (NHS, 1997). An example of which was given above. The ontology in MIAT was also in part determined by the requirement to integrate particular knowledge intensive services. For example the classification services for X-ray and MRI; the suspicious region delineation services: MRI registration; natural language generation of patient summaries and so on. The modelling of the target domain has also been tailored to meet the requirements and idiosyncrasies of human users. It is task centric. For the computer scientist “the formalization of knowledge in declarative form begins with a conceptualization, which includes the objects (concrete or abstract, primitive or composite, real’ or fictional) presumed or hypothesized to exist in some area of interest and their assumed interrelationships (functions or relations).” This
Philosophical Engineering
203
classic definition of an ontology (Genesereth and Nilsson, 1987) is widely acceded to in knowledge engineering and AI. However, this does not commit a computer scientist to a naïve realist view of how our statements about the world and the world itself connect (Wilks, 2005). Any one actually engaged in knowledge modelling faces subtle and difficult decisions. Firstly, in a world of continuously expanding knowledge, what is the boundary of a particular conceptualisation? Any answer to this question will have a profound effect on any ontology. It is not just the boundary of a particular ontology that is at issue, the challenge extends to the components themselves. If there are entities we can individually identify, how are concept (category) boundaries defined? Do concepts really reflect genuine invariants existing in the world beyond conceptualisation? Is a concept an exact account of the instances falling into it? In our MIAKT application the BCIO is a product of compromise between generality and speciality constrained by the actual resources available to do the modelling. The approach taken in the MIAKT project was to focus on a set of core concepts. The relevance of non core concepts is based on their distances from the core set. In BCIO, we have defined this metric as the number of property references from the core concepts to the non-core ones. There is a long debate in philosophy about the existence and relationship of physical and mental realities. Our position is that in the medical domain, cells, tissues, organs, and human bodies have material existence, and offers an intuitively obvious basis for providing “ground truth” for a knowledge representation system. This does not mean, however, that organs such as the heart and liver have some kind of natural primacy and that any intelligent system would be bound to delineate reality in this particular way. The requirement of conveying (relevant) information upon invocation of a particular conceptual term guides our selection of what to retain in our ontology. Since relevance is tied to a combination of our interests and goals, as well as a shared conceptual and cognitive apparatus that assimilates perceptual input into stable worlds, what we privilege as ground truth can only be (explicitly or otherwise) justified by keeping these contributing dynamics in mind. So, for instance, hearts and livers do not show up in our ontology as they are hardly, if ever, invoked in managing breast cancer. In describing possible metastases of the cancer to the liver, for example, it is the existence of metastases that makes its way into an ontological category (in the TNM classification of tumours), not the specificity of the affected organ being the liver. Scientists and their view of the world are having a stronger influence on the way that we think. This is not a new view and has articulated over 50 years ago by Quine (Quine, 1953). The entities that scientists are committed to in their theories of the world, in many cases, largely shape our understanding and thus the ontologies that we build. Moreover, scientific disciplines do not present clean boundaries. This makes delimiting the domain of discourse a difficult task. For instance, a breast cancer imaging ontology might or might not contain information on patient management, post-treatment, and prognosis, depending on the relevance criteria.
204
Shadbolt
Loosely speaking, ontology tells us “what is” while epistemology gives us “how to”. However, in practice it is hard to keep these separate and indeed the “what is” is very often determined by the procedure followed. For instance, breast cancer screening protocols specify not only what exists in the domain of discourse but also how they are elicited, e.g. what should be added when by which means at what amount. It is necessary to distinguish ontological and epistemological knowledge, yet equally necessary to model the latter for repeatability of a medical procedure (Bodenreider et al., 2004). Different imaging modalities effectively provide different views on the same entities that are restricted by some sort of physical boundaries, e.g. a cyst with clearly defined margin. How to determine such co-localised entities is nontrivial (Masolo et al., 2003). Researchers have been attacking this issue using different approaches: the multiplicativists consider co-localised entities as different individuals while the reductionists propose that they are different visions of the same spatio-temporal entity. In order to explicitly model the correlation between abnormalities and regions of interest (ROIs), we require an eclectic mixture of the above. In our approach, we introduce several layers of abstraction (Figure 10.3). Entities at each layer are abstracted from those at lower layers and the evidence for those at higher layers. This application nicely demonstrates the limitations of defining a concept only extensionally. Although we can enumerate several symptoms of a particular breast disease, e.g. carcinoma in-situ, it is simply impractical to list all known physical and pathological observations associated with a particular disease. Also because of expanding domain knowledge it is infeasible to define a concept solely extensionally. What this experience shows is that no formalisation approach is complete nor can be said to fully capture reality. All modelling encounters this problem; when we model we represent and record but also determine what features and aspects of reality to disregard. We disregard or set aside aspects of the world for many reasons. They may not appear relevant to our current interests or else are simply to complex to characterise. Rather then a formal definition we may offer back the original object of interest (the MRI scan and histopathology slide) and invite the human to reinterpret the material in the context of a particular task and within a particular professional context. We rely on our ability to interpret within a rich context of experience and expertise.
10.4 Conclusions The purpose of this short essay has been to show how in an age of where information is held, transacted and reasoned over computationally we are encountering longstanding and deep philosophical issues. We have looked at an example where medical ontologies classify and encode relationships between concepts invoked in medical procedures and operations. These
Philosophical Engineering
205
conceptual structures support understanding between human experts with different backgrounds as well as between human and software components and between software components alone. We have also seen that when we attempt to model the world and move towards an extensional characterisation we confront substantial problems of formalisation. This does not mean that formalisation has no role and indeed we have seen how it can be used effectively in our knowledge intensive applications. Once the information is indexed against an ontology it can be treated declaratively by our machines can interoperate these agreed terms – the specialists and experts will recognise them as relevant and legitimate. But they come into being procedurally against social and institutional norms. One of the most substantial challenges to formalisation is of course the fact that our science and technology is moving so very fast. In all fields we see that information gathered and understood at a particular time comes to be regarded very differently as new discoveries are made. In many of the scientific arenas in which the Semantic Web is being used the meanings that we are encoding are likely to change. New forms of instrumentation, ever more powerful imaging methods, novel computational analysis will all contribute to changes in the conceptualisation indeed the very concepts used. However, this does not mean that conceptualisation should not be attempted. Notwithstanding any of our computational work the large metaphysical questions remain. What is the essence of being and being in the world? But our science and technology is moving questions that were originally only philosophical in character into practical contexts. I suggest that this is akin to what happened with natural philosophy from the seventeenth century – chemistry, physics and biology. As our science and technology evolves new philosophical possibilities emerge. We now live in an age where we can and do engineer meaning.
Acknowledgments Research for this paper was in part funded by the British Engineering and Physical Sciences Research Council (EPSRC) under the MIAKT grant GR/R85150/01 and under the AKT IRC grant GR/N15764. The author is grateful to other members of the MIAKT and AKT projects, in particular, Kieron O’Hara, Bo Hu, David Dupplaw, Paul Lewis and Srinandan Dasmahapatra. The author also gratefully acknowledges many discussions on the nature of meaning and the philosophy of language with Yorick Wilks. The idea for this essay arose in discussions with Tim Berners-Lee who first proposed the use of the term Philosophical Engineering.
References Berners-Lee, T., Fielding, R. and Masinter, L. 2005. Uniform Resource Identifier (URI): Generic Syntax, IETF RFP 3986 (standards track), Internet Eng. Task Force, Jan. 2005; www.ietf.org/rfc/rfc3986
206
Shadbolt
Berners-Lee, T., Hendler, J. and Lassila, O. 2001. The Semantic Web,Scientific Am., May 2001, pp. 34–43. Bodenreider, O., Smith, B., Kumar, A. and Burgun, A. 2004. ‘Investigating subsumption in DL-based terminologies: A Case Study in SNOMED CT’ in Proceedings of First International Workshop on Formal Biomedical Knowledge Representation (KR-MED 2004) Whistler, BC, Canada. CEUR Workshop Proceedings 102. 12–20. Brachman, R. and Levesque, H. 2004. Knowledge Representation and Reasoning. Morgan Kaufman. Brickley, D. and Guha, R.V. 2004. RDF Vocabulary Description Language 1.0: RDF Schema http://www.w3.org/TR/rdf-schema/ Buchanan, R.A. 1985. The rise of scientific engineering in Britain. British Journal for the History of Science, 18: 218–233. Dasmahapatra, S., Dupplaw, D., Hu, B., Lewis, H., Lewis, P. and Shadbolt, N. 2006. Facilitating multi-disciplinary knowledge-based support for breast cancer screening. International Journal of Healthcare Technology and Management, 7(5): 403–420. Dasmahapatra, S. and O’Hara, K. 2005. ‘Interpretations of Ontologies for Breast Cancer’ Triple C 4(2): 293–303. Dupplaw, D., Dasmahapatra, S., Hu, B., Lewis, P. and Shadbolt, N. (in press). A Distributed, Service-Based Framework for Knowledge Applications With Multimedia. International Journal of Human Computer Studies. Dummett, M. 1991. Frege and Other Philosophers. Oxford: Oxford University Press. Fechner, R.E. 1997. History of Ductal Carcinoma in Situ, in M. Siverstein (ed.) Ductal Carcinoma In Situ of the Breast. Williams & Wilkins: Philadelphia, PA, pp. 13–23. Feferman A. and Feferman S. 2004. Alfred Tarski: Life and Logic. New York, NY: Cambridge University Press. Genesereth, M. and Nilsson, N. 1987. Logical Foundations of Artificial Intelligence. Morgan Kaufmann: Los Altos, CA. Gower, B. 1997. Scientific Method, An Historical and Philosophical Introduction. London: Routledge. Hendler, J.A. 2004. Frequently Asked Questions on W3C’s Web Ontology Language (OWL), www.w3.org/2003/08/owlfaq Hill, D. 1984. A History of Engineering in Classical and Medieval Times. La Salle, IL: Open Court. Hodges, W. 1997. A Shorter Model Theory. Cambridge University Press, ISBN 0-52158713-1 Hu, B., Dasmahapatra, S. and Shadbolt, N. 2003. From Lexicon To Mammographic Ontology: Experiences and Lessons. In: D. Calvanese, G. De Giacomo and E. Franconi (Eds.), Proceedings of the International Workshop on Description Logics (DL’2003), pp. 229–233. Jordan, M. and Weiss, Y. 2002. Graphical Models: Probabilistic Inference. In: M. Arbib (Ed.), The Handbook of Brain Theory and Neural Networks, 2nd edition, Cambridge, MA: MIT Press. Kirkham, R. 1992. Theories of Truth. Cambridge, MA: MIT Press. Kneale, W. and Kneale, M. 1962. The Development of Logic. London: Oxford University Press. Kuhn, T. 1962. The Structure of Scientific Revolutions. Chicago, IL: University of Chicago Press, 2nd edition 1970. 3rd edition, 1996. Masolo, C., Bango, S., Gangemi, A., Guarino, N. and Oltramari, A. 2003. “Wonderweb Deliverable D18 – Ontology Library”.
Philosophical Engineering
207
NHS, 1997. National co-ordinating group for breast screening pathology. Pathology Reporting in Breast Cancer Screening, Sheffield. Available from http://www .cancerscreening .nhs .uk /breastscreen /publications /qa-07.html. Nirenburg, S. and Wilks, Y. 2001. What’s in a symbol: ontology, representation and language. Journal of Theoretical and Experimental Artificial Intelligence 13(1): 9–23. Quine, W., 1953. On What There is. In: From a Logical Point of View. New York: Harper & Row. Rae, J.B. and Volti, R. 1993. The Engineer in History. New York: Peter Lang. Rumelhart D.E. and McClelland J.L. 1986. Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Cambridge, MA: Bradford Books. Shadbolt, N., Berners-Lee, T. and Hall, W. 2006. The Semantic Web Revisited. IEEE Intelligent Systems, May/June. Tarski, A. 1994. Introduction to Logic and to the Methodology of the Deductive Sciences (new edition of book originally published in Polish in 1936), New York: Oxford University Press. Waismann, F. 1945. Are There Alternative Logics? Proceedings of the Aristotelian Society 46, pp. 77–104. Waismann, F. 1968. Verifiability. In: Anthony F. (Ed.), Logic and Language., Oxford: Basil Blackwell. Wilks, Y. 1990. Form and Content in Semantics. In: Synthese, 81, 373–389, Amsterdam: Kluwer Academic Publishers. Reprinted In J. Fetzer (Ed.), Epistemology and Cognition. Amsterdam: Kluwer, 1991 and In R. Johnson and M. Rosner (Eds.), Advances in Formal Semantics. Cambridge: Cambridge University Press, 1992. Wilks, Y. and Ballim, A. 1991. Artificial Believers. Norwood, NJ: Erlbaum. Wilks, Y. 2004. Are ontologies distinctive enough for computations over knowledge? In IEEE Intelligent Systems, Trends and Controversies. Wilks, Y. 2005. What Would a Wittgensteinian Linguistics be Like? In Proceedings 9th International Conference on Pragmatics, Lake Garda. Wittgenstein, L. Tractatus Logico-Philosophicus, 1922 translated by D.F. Pears and B.F. McGuinness (London: Routledge and Kegan Paul, 1961) Wittgenstein, L. Philosophical Investigations 1953, G.E.M. Anscombe and R. Rhees (Eds.), G.E.M. Anscombe (trans.), Oxford: Blackwell. Ziman, J. 2000. Real Science What it is and What it Means. Cambridge, UK: Cambridge University Press.
This page intentionally blank
11 Machine Translation and the World Wide Web Harold Somers School of Informatics, University of Manchester, Manchester, UK
11.1 Introduction Yorick Wilks has been involved in Machine Translation (MT) throughout his career. He chose the task of MT as the ultimate test of understanding for his early work on Preference Semantics (Wilks 1973a, b; 1975a, b), a model of language processing; in the late 1970s he was involved in evaluations of Systran (Wilks, 1978, 1992a1 ); while at NMSU he worked on the ULTRA system (Farwell and Wilks, 1991), and in collaboration with CMU and ISI on the Pangloss system (Wilks, 1992b; Helmreich et al., 1993); most recently he has achieved some notoriety as a critic of Statistical MT (Wilks, 1992c). Always a good source of a quotable quote, two of his pronouncements in particular have been widely reproduced: the first, from Wilks (1990a), is the pair of observations that “any theory (however absurd) can be the basis of an MT system”, and “MT systems rarely operate on the basis of their stated principles”. The second concerns evaluation, which Wilks (1994) reported to be “for all its faults, probably in better shape than MT itself”. Wilks first became interested in MT as a student in the Cambridge Language Research Unit under Margaret Masterman (Wilks, 1990b; 2000). Masterman was an early proponent of the interlingua approach to MT, and expressed the belief that any representation of language for the purposes of MT “should be fundamentally semantic in nature (i.e. based on meaning rather than syntax) and that those semantic structures should be used in the parsing process itself.” (Wilks, 2000: 289). These ideas are clearly mirrored in Wilks’s own Preference Semantics approach to MT, and the insistence on more than shallow processing has remained a theme in his observations on MT over the years. It was little surprise that he should find the statistical approach so unsatisfactory, and was one of the first (Wilks, 1992c) to predict hybrid approaches as the way forward. Perhaps at least as significant as any of the technical advances in MT research, however, has been the availability, since 1994, of MT free online via the Internet. This is the unifying theme of this chapter, which looks at the impact of free online MT from various perspectives. 1
Based on his 1979–1980 evaluation of Systran for the US Air Force.
209 K. Ahmad, C. Brewster and M. Stevenson (eds.), Words and Intelligence II, 209–233. © 2007 Springer.
210
Somers
11.2 Free Online MT Macklovitch (2001: 27) talks of “the spectacular growth and pervasiveness of the World Wide Web” leading to a “democratization” of Machine Translation (MT) which has “profoundly transformed the MT business”. Free Online MT (FOMT) was first offered by CompuServe in collaboration with Systran in 1994 (Flanagan, 1996), the success of which led to AltaVista’s better known Babelfish service from 1997 onwards (Yang and Lange, 1998). Since then, more and more services have appeared, some using the same underlying system, or only marginally differing versions of it, while others appear to be little more than on-line dictionaries, and sometimes not even particularly sophisticated ones. Major providers include Lycos, Voila (concentrating on language pairs with French), FreeTranslation, ProMT and Google Translate. Babelfish originally offered ten combinations of European languages, while currently between them online MT providers cover about 30 language pairs including Chinese, Japanese, Korean and Arabic (not including sites offering language pairs which, on inspection, turn out not to be “proper” MT systems in some sense!). As is well known, and as Figure 11.1 shows, much fewer than half of internet users are English speakers (just under 30%), although the majority of web pages are still in English.2 While there are numbers of people in non English-speaking
Fig. 11.1. Languages used online (based on a total of 1.085 billion users) (left) and percentage of webpages in various languages (from a total of 313.2 million web pages) (right) (Internet World Stats, 2006 (www.internetworldstats.com) and ClickZ Network, July 2000 (www.clickz.com))
2
It is extremely difficult to get exact figures, and this chart from 2000 is woefully out of date. Figures mentioned in more recent articles (usually without citing a source) suggest that English is still dominant, and even if Nunberg is correct when he says “it’s certain that English will account for less than half of Web content within a few years” (Nunberg 2005), it will still outstrip by far any other single languauge in web presence.
Machine Translation and the World Wide Web
211
countries who are more or less proficient in English, one survey3 suggests that in China 85% of users prefer to access the web in their native language; the figure for Japan is 84%, 82% for Brazil and Spain, 81% for Argentina and Peru and 79% for Germany. Translation of web pages is big business, which in turn has revolutionized the MT world, creating a whole new and significantly large community of MT users, mostly with little or no knowledge or understanding of how MT works or even, in some cases, how language works. Few surveys have investigated to any extent what use is made of online MT. Shortly after its installation, Yang and Lange (1998) surveyed the use of Babelfish, inviting feedback and comments from users. Predictably, reaction was mixed, ranging from delight (from users suddenly able to get the gist at least of foreignlanguage web pages) to disgust (from users able to judge the quality of the translations, though some such users actually offered help and advice). Besides translating web pages, users can type in text to be translated, and many do, for example using MT to send emails, play games of round-trip translation Chinese whispers (more about this below), do their homework (ditto), translate rude words (a surprisingly high percentage) and no doubt various other uses, some of which (such as the user who asked for a translation of the text of of of of of) can only be guessed at. Traditionally, commentators have sought to distinguish carefully between two uses of MT, for assimilation and for dissemination, or reading and writing, to put it simply. The profile of MT for these two groups of users differs considerably. Crucially, “readers” generally want translation into one language, from a variety of languages, on a variety of topics, with no control over the quality of language input to the MT system; fortunately they can tolerate low quality MT, which they can tidy up if necessary. On the contrary, “writers” generally want translation from one language, perhaps into a variety of languages, and can in principle control the quality of language input to the MT system; usually, they want good quality MT because they cannot check the translation themselves. Our first perspective looks at how this neat distinction is undermined by web-page designers who put a link to an MT system on their web page without appreciating what that MT system might do to their web page.
11.3 What Web-page Designers Should Know about MT Free online MT is ideally aimed at web surfers who come across a webpage in a foreign language and would like to get a rough idea of what it is about. Such use perfectly fits the profile of MT for assimilation sketched above. Increasingly however, web-page designers want to target an international audience by making their webpage available in several languages. If the content of the web-site is constant, it makes better sense for them to have it translated professionally. 3
www.todaytranslations.com/index.asp-Q-Page-E-Language-Usage-on-the-Internet– 99144709. Article not dated, but quotes data mainly from 2004. Accessed 14.11.05.
212
Somers
However, if the material is volatile, being regularly updated for example with news items, weather details or other changeable data, online MT must seem like a reasonable alternative. It is now increasingly commonplace to find that web-page designers have incorporated an explicit link to one or more online MT services. In this section we briefly consider (a) good and bad ways of linking to MT, (b) what happens to a typical web page when the surfer clicks on the link, and (c) what the web-page designer could do about it. While there have been a number of evaluations of FOMT from the end-user’s perspective, only a few researchers have focussed on factors other than translation quality which are often in the hands of the web-page designer. In one of the first such evaluations, Miyazawa et al. (1999) illustrated translation errors due to the poor handling of HTML by MT systems (now largely rectified by companies developing web-translator versions of their systems), for example translating text inside HTML tags, or relocating angle brackets so that they do not pair up. O’Connell (2001) gives a number of tips for web-page designers designed to improve the end result should users have the page translated by MT. Gaspari (2004a) evaluates how a selection of web sites have integrated MT. 11.3.1 Linking to Online MT It is not at all uncommon on web pages to find links to free online MT in small writing, tucked away near the bottom, with a text such as “Click here to have this page translated” in the same language as the rest of the web page. The obvious first point to make is that a reader who needs to have the page translated into another language, is unlikely to understand (or to notice) such an instruction. More thoughtful web-page designers will have a flag icon and/or a message in the target language. Such icons are a standard method for indicating other-language versions of genuinely multilingual websites, so in fairness to users, designers should indicate whether the link is to a genuine translation, and indeed some webpages do include such a kind of disclaimer. Amusingly, there is often evidence that these messages have been translated using FOMT, as in Figure 11.2. Often, the links will take you to the website of the MT portal, so the users have to type in the URL for themselves. A more user-friendly link would dynamically generate the translation using the “?” convention in the URL. Another feature which is not so easy to implement at the moment would be to have any links from the translated page automatically translated: the idea is that if you are browsing in translation, you presumably want subsequent pages also translated. 11.3.2 How Well are Web Pages Translated? Like with any text submitted to MT, the quality of translation of web pages translated by FOMT systems will depend to some extent on the source text. FOMT is no more or less susceptible to a misparsed sentence or a wrong word-sense selection, except inasmuch as web pages are more likely to contain text which is difficult to translated due to poor spelling and grammar or colloquial and informal style. On the
Machine Translation and the World Wide Web
213
Translate Our Camcorder Batteries to any language. Free! Traduzir Nossa [Camcorder] Pilhas a qualquer língua. Grátis! Prevesti Naš [Camcorder] Tuci to bilo koji jezik. Slobodan! Chápat Náš Videokamera Bít až k jakýkoliv jazyk. Drzý! Oversætte Vor [Camcorder] Akkumulatorer hen til hvilken som helst sprog. Omkostningsfrit! Vertalen Onze [Camcorder] Rammeien voor ieder taal. Zonder kosten! Traducir Nuestro [Camcorder] Pilas hasta cualquier lenguaje. Libre! Kääntää Meidän [Camcorder] [Batteries] ilm. suuntaa kukaan kieli. Vapaa! Traduire Notre [Camcorder] Piles à tout langue. Libérer! Übersetzen Unserer [Camcorder] Batterien zu Landessprache. Umsonst! Tradurre Nostro [Camcorder] Batterie verso qualsiasi lingua. Libero! Traduzir Nossa [Camcorder] Pilhas a qualquer língua. Grátis! Chyfieitha 'n Camcorder Chyflegrau at unrhyw dafodiaith. Rhyddha!
If you would like to view our web pages in your language, please choose a link below to translate our website to many different languages. Just enter the web page address (http://...) and these websites will translate it to the language of your choice.
Si vous aimeriez à la vue nos pages de toile dans votre langue, s'il vous plaît choisir un lien au dessous à traduit notre website à beaucoup de langues différentes. Seulement entrer le (http://.. d'adresse de page de toile.) et ces websites le traduira à la langue de votre choix.
Please note that some of these translators work better than others, please feel free to experiment with them all and find the one that works best for you.
S'il vous plaît la note qu'une partie de ces traducteurs travaillent mieux que les autres, s'il vous plaît sans sens à l'expérience avec eux tout et trouve l'un que les travaux mieux pour vous.
Fig. 11.2. Example of link to MT translate.html.Last accessed 21.11.05)
(found
on
http://
WWW.batterybank.com/
other hand, certain features typical of web pages may cause particular problems for MT. We mentioned above some problems with MT systems mishandling HTML tags, leading to unaccessible links and missing graphics. MT systems are nowadays more tuned to HTML and “know” not to translate the contents of tags. But there are still HTML-related problems. For example, MT systems cannot generally translate “around” mark-up; consider the sequence “Bold” which would appear as “Bold” (with the initial letter in bold face) on a web page: Babelfish translates this into German as B alt (literally “B old”) rather than Fett. FOMT also has difficulties with web-page features such as pop-ups, floats, tables and forms. Figure 11.3 for example shows two portions of a web-page translated from Russian along with the original. The top section shows a drop-down menu embedded in a graphic, in which the items in the menu have been transformed into strange characters. The search button next to the text-box below the date has not been translated at all. In the bottom section, the abbreviations for the weekdays
214
Somers
Fig. 11.3. Portions from the Russian newspaper webpage. (http://www.kp.ru/as at 21.11.05, together with Babelfish translations)
have been transliterated, while the items in the drop-down menu “Our headings” have this time been translated. 11.3.3 Improving Web-page Translatability There has been a fair amount of research recently on “translatability” (Gdaniec, 1994; Bernth, 1999a, b; Bernth and McCord, 2000; Underwood and Jongejan, 2001; Bernth and Gdaniec, 2002; O’Brien, 2005). Research has focussed on identifying “translatability indicators”, stylistic or grammatical linguistic features that are known to be problematic for MT (so a more transparent name would perhaps be “translation difficulty indicators”). For example, mid-sentence
Machine Translation and the World Wide Web
215
parenthetical statements or the use of the passive voice could respectively be classified as stylistic and grammatical indicators. While such measures are of use to linguists, and to designers of controlled languages, they mean little or nothing to the average lay-user or web-page designer. We will return to this question in the next section, but for our present purposes it is probably more helpful to outline some obvious features of web pages which will impact on their translatability. The following list suggests some mostly self-explanatory tips to web-page designers which will improve the chances of their web pages being translated reasonably. • • • • • • • • •
Avoid including text in graphics and pdf files as these will not be translated. Web pages containing frames may be only partially translated. Keep sentences short and simple. Get spelling and punctuation right (including accents for languages that use them). Formatting (e.g. italics for emphasis) may not have the same interpretation in the target language, or may be rendered meaningless by word-order changes. Avoid idioms, puns, slang, and jokes. Use connecting words like that, which, who to identify relative or subordinate clauses. Proper names are often handled quite well by MT systems if a title such as Mr, Dr, President is included. Avoid anaphoric references. Avoid acronyms.
Interestingly, many of these tips are also found in guidelines for (monolingual) web-page readability (Krug 2000; Nielsen 2000; Gaspari 2004b). Readers of web pages tend to scan the page rather than read linearly, which is why shorter selfcontained sentences are to be preferred. Regarding proper names, many MT systems now include a way of marking sequences not to be translated: it would be of great benefit if these could be standardized and incorporated into HTML. Babych and Hartley (2004) have shown how named-entity recognition can improve the quality of MT output by identifying proper names and other sequences that should not be translated.
11.4 Evaluating MT Using Round-trip Translation In the previous section we mentioned web-page translatability: both readers and designers of web-pages might like to know how well the web pages are translated by FOMT systems, either to help choose the best system for the job or, in the case of designers, as an indication of whether they need to simplify the text. While the MT community has invested a considerable effort into developing evaluation methods and metrics – cf. Wilks’s (1994) comment that evaluation is “for all its faults, probably in better shape than MT itself” – all of these depend on human judges who can read and understand the target language, or one or more model translations in the target language. Obviously, for both types of FOMT
216
Somers
user, such methods are unsuitable. A very frequently suggested alternative is the intuitive technique of “round-trip translation” (RTT), or “back-and-forth translation”, in which a given text or sentence is translated into some foreign language by the MT system (the “forward translation”, henceforth FT), then the result is translated back into the original language by the same system (the “back translation”, BT). Popular articles on MT by journalists and other lay-users all too frequently use this technique to “evaluate” MT, with results which are, depending on your predisposition, hilarious or infuriating. A recent example is from the Biomedical Translations website (Anon, 2003), where the author explains the technique, and suggests that “In theory, the back translated English should match the original English.” Several garbled examples are then given, and the article concludes “Would you trust your surgeon using these instructions?” Another website recognizes the problem “Machine translations can produce text that is garbled or hilariously inaccurate”, and suggests as a resolution “Test the precision of your translated text by sending a phrase on a round trip through the translation engine.” (Anon, 2005). Although it is widely agreed in the MT community that RTT is a bad technique, and equally widely suggested in the lay community that it is an effective way to evaluate systems, there has been little or no work to demonstrate empirically whether RTT is in fact as misleading as it is claimed. A series of small experiments first reported in Somers (2005) have attempted to do this. 11.4.1 Evidence that RTT Does or Does Not Work The dangers of the RTT approach have long been appreciated: O’Connell (2001) gives the following sound advice on an IBM website: “A common misunderstanding about MT evaluation is the belief that back translation can disclose a system’s usability. [ ] The theory is that if back translation returns [the source language] input exactly, the system performs well for this language pair. In reality, evaluators cannot tell if errors occurred during the passage to [the target language] or during the return passage to [the source language]. In addition, any errors that occur in the first translation [ ] cause more problems in the back translation.” As O’Connell and other commentators who understand how MT works have pointed out, RTT could be misleading for three reasons: First, if the round trip is bad, you cannot tell whether it was the outward journey or the return trip where things went wrong. For example, (1) shows an RTT from English to Italian and back again using Babelfish. The resulting BT (1c) is garbled, but in fact the FT into Italian is really quite acceptable. (1)
a. Select this link to look at our home page. b. Selezioni questo collegamento per guardare il nostro Home Page. c. Selections this connection in order to watch our Home Page.
Machine Translation and the World Wide Web
217
Secondly, a bad FT can nevertheless lead to a quite reasonable BT. So the fact that the round trip gives a good result does not necessarily tell you anything about the outward journey. This can be illustrated in (2), again using Babelfish, where the idiomatic phrase is translated literally into meaningless Portuguese (2a) and then “perfectly” back into English (2c). (2)
a. tit for tat b. melharuco para o tat c. tit for tat
The third point is that of course the basic premise of RTT is flawed: even a pair of human translators would not be expected to complete a perfect RTT, in the sense that the return translation would be word-for-word identical to the original source text. It is easy to show RTT not working. But equally we should acknowledge that sometimes, RTT does appear to work, producing a quite understandable paraphrase and, if only we knew it, a reasonable translation on the way. Examples (3) and (4) (which will be familiar to many readers) translated by Freetranslation, illustrate this. (3)
a. b. c. (4) a. b. c.
The spirit is willing but the flesh is weak. Dyx geLaem, no pLom sLaba. The spirit wishes, but the flesh is weak. My car drinks petrol. La mia automobile beve la benzina. My automobile drinks the gasoline.
11.4.2 Can RTT Tell us Which MT System is Best? Even if RTT does not always work, we might hope that the quality of the RTT will reflect the quality of the FT: if this is true, then at least RTT could be used to help lay-users to decide which system to use, when they are faced with a large number to choose from. In order to explore this hypothesis, Somers (2005) took four texts representing various language pairs, translated them each using five FOMT systems,4 then translated the resulting FT back into the original language using the same system. Two standard measures were used to evaluate the results, the familiar BLEU metric (Papineni et al., 2002), and Turian et al. (2003) F-score metric, both of which compute the n-gram overlap between the translations to be evaluated and one or more reference (“gold standard”) translations. To clarify, in the description that follows, FT is the process of translating the original source text (S) into a target text (T), and we evaluate it by comparing T to reference translations (R) of the original text. BT is the process of translating this translated text T back into the original language, resulting in text B, and the evaluation compares B to the original text S. The term RTT refers to the whole process. 4
Babelfish, Freetranslation, Systran, ProMT, and Worldlingo.
218
Somers
The texts were as follows: extracts from the French web pages of the Tourist Offices of Marseilles and Barèges (a skiing resort) for translation into English, and two passages from the Europarl corpus of European Parliament Proceedings 1996–2003, one in English, for translation into German, and one in French, for translation into English. All the texts were around 100 sentences long. The reference translations for the Europarl texts were taken from the parallel corpus, while those for the tourism texts were produced by the present author. Figures 11.4 and 11.5 show the BLEU and F-scores for the 20 pairs of translations, FT (i.e. S → T, score T:R) and BT (i.e. T → B, score B:S), grouped by text, and ordered within each group. The order of the systems is different for each text, but since, for our purposes, we are only interested in seeing whether the scores for the FTs and BTs correlate, the identity of the individual systems is unimportant. The first thing to notice is that both BLEU and F-score show little correlation between the FT and BT scores (Pearson’s coefficient r = −004). We should note that the difference in scores for some of these systems is really quite small, and that for two of the texts, the system with the top-ranking score for FT is actually ranked fourth or fifth for the BT. Our first conclusion then is that RTT is not a particularly good way to identify which system is better: if anything, a high-scoring BT indicates either the best or the worst system, but even this is not systematic.
Fig. 11.4. BLEU scores for the 20 forward and back translations
Fig. 11.5. F-scores for the 20 forward and back translations
Machine Translation and the World Wide Web
219
What is also striking is that the BT score is often better than the FT score, and the difference is greatest when the FT score is low. Although the results do not show a consistent pattern, what is clear is that a good score for the BT generally does not necessarily “predict” a good score for the FT; rather more often the opposite. The reason for this is fairly easy to explain, considering how these MT systems in general work. Although systems perform source-text analysis to a certain extent, when all else fails they resort to word-for-word translation, and where there is a choice of target word they will go for the most general translation. Clearly, when the input to the process is difficult to analyse, the word-for-word translation will deliver pretty much the same words in the BT as featured in the original text. 11.4.3 Can RTT Predict General Translation Quality? A second experiment reported by Somers (2005) wanted to see if the scores for the BT would correlate with scores for FT when texts that the MT systems translate well are compared with texts that prove difficult. Based on the BLEU and F-scores, three of the texts from the first experiment were taken, and the scores computed for their BTs using Freetranslation, neither the best nor the worst of the MT systems. These “hard” texts were the Marseilles web-page and the two Europarl examples. These were compared to three “easy” texts: a children’s story (Goldilocks and the Three Bears), some text from Canadian weather forecasts,5 and some typical entries from a tourist’s phrase-book. Like the “hard” texts, the “easy” texts were all roughly 100 lines long. Reference translations for Goldilocks and the weather forecasts were done by a French translator. In (5) we see some examples of BTs that show that the easy texts were indeed generally well translated back and forth. (5)
a. Therefore she went in top in the bedroom where the three Bears slept, and there was the three beds. b. Today. Cloudy with the clear periods and some snow. High close to −9. The winds of the west 15 to 30 km/h. Tonight. Cloudy with the clear periods and 30% probability of flurries. c. Do you speak the English? I do not speak the French. I do not understand. Please to speak slowly. I hope that you understand my English.
The comparison of the BLEU scores for the FT and BT of these six texts is shown in Figure 11.6. The figure shows quite dramatically that, at least as far as the BLEU scores go, the easy texts are somewhat easier to translate than the hard texts; and it shows equally clearly that the score for the BT does not reflect this at all: in fact according to the BT score, all the texts are of about the same difficulty. The 5
From RALI’s Météo website;http://rali.iro.umontreal.ca/meteo, as described in Langlais et al. (2005). The texts were lightly pre-edited the texts, converting the all-uppercase text to mixed case, inserting accents, and also changing moins and minus in temperature read-outs to a minus-sign.
220
Somers
0.80 0.70 0.60 0.50 FT 0.40
BT
0.30 0.20 0.10 0.00 3bears
meteo
tourist
bareges
marseille
euro-F-E
Fig. 11.6. BLEU scores for the forward and back translations of three “easy” and three “hard” texts
correlation between BLEU scores for the FT and BT is r = −031, while for the F-scores (not shown) it is r = 059. Somers (2005) concludes that, as MT experts would assume, RTT is not really a good indicator of anything. However, the article ends on a note of caution, recognizing that the evaluation measures chosen (BLEU and F-score) tend to favour translations that are lexically close to the oracle translation, without taking into account whether they are grammatical or make sense. To be really sure of the results, it would be good to replicate the experiments evaluating the translations using a more old-fashioned method involving human ratings of intelligibility.
11.5 Detecting Misuse of MT by Language Students Our final perspective looks at a problem in the world of language teaching that has arisen from the ready availability of FOMT, where students – especially weaker ones – use FOMT to do their translation homework. Apart from the pedagogic implications, one question of interest is whether we can devise any techniques for automatically detecting such use. In this section we describe some pilot experiments using techniques derived from computational stylometry, plagiarism detection, information retrieval, and text reuse, partly influenced by the METER project (Clough et al. 2002). 11.5.1 MT in the Classroom There is a growing literature on the impact of MT in general on the language classroom (see Somers (2001) for an overview). Much of the focus is on what
Machine Translation and the World Wide Web
221
trainee translators (or language learners as potential professional translators) should learn about MT, and how MT can be taught to computational linguists. There are also contributions suggesting how MT can be used as a kind of computerassisted language learning tool. Of interest are approaches which seek to exploit the weaknesses of MT to illustrate the differences between languages, or to heighten learners’ appreciation of matters of grammar and style in both languages (Richmond, 1994; Anderson, 1995; McCarthy, 2004; Niño, 2004). However, such uses carry with them the danger that students, particularly beginners, cannot readily identify examples of bad usage, and have a not necessarily justified trust in the accuracy of computer output. Our focus is somewhat different: we are interested in the “illicit” use of FOMT by students seeking a quick way of completing their translation assignments. It is shocking to consider that the standard of translation achieved by FOMT might, according to one teacher and examiner, be worthy of a C grade – a moderate pass – at “A” level. Coupled with the fact that there is a growing move towards coursework-based syllabuses in “A” levels, it is clear that we need some way of detecting improper use of FOMT by students. That the availability of FOMT could pose a problem for language teachers is recognized in a thoughtful article by Brian McCarthy (2004). As he suggests, FOMT “ impacts negatively on the teaching of translation when students simply feed the [ ] passage they have been given as an assignment through the translation service and submit the [ ] output for assessment. Motivation for this course of action can vary.” Among the causes are “lack of time, lack of energy, or lack of imagination, coupled with a lack of scruples or a lack of linguistic insight”. Submitting output from FOMT for assessment is bad for a number of reasons: it is unfair to students who have invested the intellectual effort and time into producing an original translation; a translation produced with no intellectual input has no instructional value, and it is a waste of the teacher’s time to correct it.
11.5.2 Detecting Plagiarism There is a considerable literature on plagiarism detection, which seems, with the growth of the Internet over the last 10 years, to have become a major industry (see Clough (2003) for a good overview). Educators are concerned that students can now too easily complete assignments making inappropriate use of resources found on the Web, whether it be submitting a term paper wholly copied from the Web (perhaps from one of the growing number of “paper mills” and “essay banks”), or more subtle cutting, pasting, combining and editing of several sources without due acknowledgement. There are now numerous services and software packages available which will search the Internet to try to find sources that have been plagiarised, using a number of text similarity measures, to which we will return below. Others will compare sets of documents with each other in order to
222
Somers
detect collusion, a type of plagiarism where students submit essentially identical assignments because they have worked together on them. Plagiarism detection has some affinities with and shares some of the techniques of several other branches of computational linguistics and linguistic computing: stylometry and authorship attribution, forensic linguistics, document classification, information retrieval, corpus linguistics. Our particular interest has two characteristics which make the standard approaches to plagiarism detection less relevant. First, we know beforehand the text (or small group of texts) which we want to check their work against (henceforth, the “source text”). Second, when students do a translation assignment, it is reasonable to expect that there will be textual overlap between their work, corresponding to the range of acceptable translations. So we need to find a way of measuring excessive similarity to the source text, and/or perhaps similarity to specific portions of it. For this reason, we find the related work on legitimate reuse of text to be of more relevance, typified by the METER project (Clough et al., 2002), concerned with journalists’ use of news agency text. Plagiarism detection methods are mostly based on string similarity measures ranging from simple vocabulary profiling measures, through string sequence similarity measures to attempts to profile the semantic similarity of texts. In the experiments to be described here, we concentrate on a range of word-counting measures which can be easily implemented and are more or less languageindependent. 11.5.3 Pilot Experiment This section describes one of a number of experiments with students at various levels and a variety of languages (the completed experiment is described in Somers et al., 2006). In the present case, we worked with a group of ten students studying beginners’ Italian at university.6 For our experiments we need some examples of legitimate “honest” translations, and some examples of lightly post-edited FOMT output. For obvious reasons it would be difficult to get genuine examples of the latter, so we devised a means of generating parallel sets of translations done with and without the “help” of FOMT. For this experiment we took a short English text (224 words in 14 sentences) from a website7 and using the AltaVista Babelfish service translated it into Italian. The students were asked to perform one of two tasks with the text. One task was to translate it into Italian using conventional resources (dictionaries, grammar reference books); the other was to take the Babelfish translation and “tidy it up” as much as possible in a strictly limited timeframe. Half the students did the task in this order, while the other half worked on the Babelfish
6
7
I am extremely grateful to Federico Gaspari and of course to his students at Salford University. “Stars of Singapore visit east London school”, www.london2012.org/en/news/archive/ 2005/October/2005-10-05-13-42.htm
Machine Translation and the World Wide Web
223
translation first. This is an attempt to control for Babelfish influence on the honest translations, although it must be said that their level of Italian is so elementary that any such effect would be hard to detect. Because of the shortage of time available, for the translation task, the text was split into two halves, and students worked on one or other portion. In what follows we will make a distinction between “derived” translations resulting from post-editing the Babelfish output, and “honest” translations which are done in the traditional manner. Unfortunately, not all the students completed both tasks, so that in the results presented below we have ten derived and eight honest translations. Some examples of the students’ work are shown in (6): (6a) shows the original English, (6b) the Babelfish translation, (6c–d) two derived translations, and (6e–f) two honest translations. (6)
a. b. c. d. e. f.
I want to run at the 2012 Olympics for South Africa. desidero funzionare alle 2012 Olimpiadi per la Sudafrica. desidero correre alle 2012 Olimpiadi per la Sudafrica. desidero correre nelle Olimpiadi 2012 per la Sudafrica. Voglio correre ai giochi Olimpici per Sudafrica. Voglio correre ai giochi Olimpiachi per Sudafrica.
11.5.4 Results We have experimented with a number of measures based on techniques used in stylometry and plagiarism detection to see if we can distinguish, by comparing them with the Babelfish translation, a derived translations from an honest, albeit flawed, offering. 11.5.4.1 Simple Word Counts Early attempts at computational stylometry focused on simple statistics based on word frequency counts (cf. Baayen, 2001). Measures such as type–token ratio, and other measures of vocabulary richness are not appropriate for our task, as the texts are too short. However, from this field comes the idea of counting hapax legomena (HL; lit. “once said”, i.e. words occurring once in the text, also termed “singletons”): the idea is that a significant overlap in use of infrequent words might suggest copying. This is the basis of the CopyCatch program (Woolls and Coulthard, 1998), in which it is claimed that an overlap of 70% is suspicious. We measure HL overlap by counting the percentage of singletons in the source text which are also singletons in the target. Words occurring exactly twice – dis legomena (DL; “doubletons”) – also have a distinctive distribution, so we measure DL overlap too. By extension, we propose a measure taking into account all “n-letons”: if we consider the different totals for all the frequencies, we can calculate a Euclidian distance measure F as in (7), n s t 2 (7) F s t = i=1 fi − fi
224
Somers
where fix is the number of words occurring with frequency i in text x, n being the frequency of the most frequent word. Looking at the frequencies of individual types in the two texts, two further measures of text similarity suggest themselves. The first is the percentage of words that have exactly the same frequency (SF) in the two texts. The second is again a slightly different Euclidian distance E based on the total frequencies of the words in each text, as in (8), (8) E s t =
s w∈s∪t fw
− fwt 2
where fwx is the frequency of occurrence of the word w in text x. The intuition behind this measure is that if one text is derived form another, the distribution of the individual words used in the two texts will be similar. Table 11.1 shows the scores for these five measures. Derived and honest translations are identified as “d” and “h”. Where students did not complete the whole text, comparisons are with the equivalent portion of the Babelfish text. Figure 11.7 shows the same results in graphic form. Since we are not interested in comparing students’ performance on the two tasks, the cases are simply numbered 1 to 18, with no connection made between the “d” and “h” translations by the same student. The figure shows rather clearly that of our proposed measures, HL and SF seem to work very well. DL goes some way towards distinguishing derived and honest
Table 11.1. Word-count measures for comparison of derived and honest translations with Babelfish text Case
HL
DL
F
SF
E
i1d i2d i3d i4d i5d i6d i7d i8d i9d i10d
801724 818966 818966 870690 956897 862069 870690 750000 767241 844828
619048 761905 666667 904762 857143 809524 809524 619048 571429 857143
87750 61644 83066 50990 48990 43589 33166 142478 103441 36056
759740 772727 798701 844156 928571 857143 824675 714286 707792 818182
105830 95394 54772 60828 45826 63246 77460 76158 138564 91652
i11h i12h i13h i14h i15h i16h i17h i18h
333333 307692 320513 346667 173333 346667 360000 439655
300000 300000 400000 555556 111111 222222 444444 238095
133041 220681 179722 261916 372424 153623 178606 70000
322581 311828 322581 362637 153846 351648 384615 383117
142829 89443 105830 70711 96437 67823 76158 157162
Machine Translation and the World Wide Web
225 40
100 90
35
80 30 70 25
60
20
50 40
15
30 10 20 5
10
0
0 HL
DL
SF
E
F
Fig. 11.7. Graphic display of data from Table 11.1. White squares show derived translations, black triangles honest translations
translations, though not as decisively. The two Euclidian distance measures do not appear to be able to distinguish at all. 11.5.4.2 Comparing Word Sequences An intuitive way of detecting plagiarism is to look for common sequences of words, and indeed this has been the basis of several approaches. The UNIX diff function is based on longest common subsequence matching, while searching for overlapping n-grams has been used for example by Brin et al. (1995), Heintze (1996), Shivakumar and Garcia-Molina (1996) and Lyon et al. (2001). A use of n-gram matching that is very familiar in the world of MT evaluation is in the BLEU measure (Papineni et al., 2002), already mentioned in Section 11.4, along with Doddington’s (2002) derived NIST algorithm. Both these measures essentially give a weighted precision score based on the number of n-grams common to both source and target texts. In the mteval implementation8 n-grams up to n = 9 are included. While the idea of n-gram matching against an oracle translation is somewhat controversial for MT evaluation, it seems to offer a good platform for evaluating the similarity of two texts. 8
Downloadable from www.nist.gov/speech/tests/mt/resources/scoring.htm.
226
Somers
A simpler measure of text similarity based on word sequences is of course Levenshtein (or string-edit) distance (LD) (Levenshtein, 1965). Implementations can vary as to whether they count only substitutions, insertions and deletions (“indels”), or also count transpositions, mergers and expansions. Also, segments can be treated as strings of words or strings of characters. For our application, we calculate the LD in its simplest form (substitutions and indels) for each segment taken as a string of words, and provide an average over the individual segment scores. Both the BLEU/NIST algorithms and the LD rely on the two texts being sentencealigned. Fortunately, student translations typically follow the structure of the source text fairly closely, and students generally translate sentence by sentence. MT systems certainly do so too. Table 11.2 shows the LD, BLEU and NIST scores for our data, which are reproduced graphically in Figure 11.8. All three measures show a clear separation of the honest texts from the derived translations. LD has the additional advantage that it can show us exactly where the texts differ. Table 11.3 shows the individual LD scores on a sentence-by-sentence basis. The column headers show the number of words in each sentence in the Babelfish text. LD is obviously a good indicator of possible plagiarism, but the raw scores can be misleading. Obviously a higher score indicates greater differences between two texts, but it is not obvious what constitutes a “high” score. The actual lengths of the sentences being compared are important in interpreting the scores. Figure 11.9 Table 11.2. LD, BLEU and NIST scores for comparison of derived and honest translations with Babelfish text Case
LD
BLEU
NIST
i1d i2d i3d i4d i5d i6d i7d i8d i9d i10d
50714 28571 27143 25000 21429 21429 26429 45714 60714 27143
05921 07771 07905 07950 08600 08262 07724 06327 05402 07796
63421 69152 70818 69880 75150 71519 68960 64551 58597 69190
i11h i12h i13h i14h i15h i16h i17h i18h
127143 117143 121429 104286 135714 97143 91429 115000
01867 01478 01901 01806 00518 02077 02700 02193
27729 31680 33477 31405 05997 37568 39524 41347
Machine Translation and the World Wide Web
227
Fig. 11.8. Graphic display of data from Table 11.2. Again, white squares show derived trabslations, black traingles honest translations. NIST scores are doubled so as to share the same scale as LD
shows the individual LD scores on a sentence-by-sentence basis, expressed as a percentage of the number of words in each sentence in the original text. For example, even the honest translations have an LD of 0 for the final 3-word sentence (the date), and both honest and derived translations show a high percentage change for the first sentence, the 7-word title. Note that scores over 100% are possible. 11.5.5 Conclusion Although these results are based on only one small experiment with a few students, they do suggest that there are a number of measures that can suggest that a translation is suspiciously similar to a free online version: for our purposes this is sufficient, as it will signal to the teacher that the work should be looked at. Somers et al. (2006) report on further experiments with German and Spanish students that largely confirm these preliminary findings. Of the five word-count measures, HL consistently works best, the difference between scores for “h” and “d” translations being statistically significant at p < 002. A similar result is obtained for LD and BLEU.
228
Somers
Table 11.3. Sentence-by-sentence LD scores 10
22
37
15
11
8
10
20
11
10
25
26
21
4
Av’ge
i1d 7 i2d 10 i3d 8 i4d 7 i5d 8 i6d 7 i7d 8 i8d 9 i9d 7 i10d 9 i11h 9 i12h 8 i13h 9 i14h i15h i16h i17h i18h 10
3 4 1 0 7 2 2 4 6 5 14 14 13
17 1 0 3 9 0 11 11 17 0 25 33 24
15 8 4 7 0 3 5 4 15 7 13 15 14
3 0 0 0 0 1 1 2 4 0 9 5 8
1 1 1 2 1 3 0 4 4 1 6 8 6
4 5 5 3 4 6 2 6 4 3 6 0 11
4 5 6 2 0 3 2 2 4 4
0 0 0 0 0 1 0 2 3 2
0 0 0 0 0 1 0 1 0 0
8 2 8 4 0 3 2 13 15 2
4 3 4 7 0 0 4 5 5 4
5 1 1 0 1 0 0 1 1 1
0 0 0 0 0 0 0 0 0 0
8 14 10 10 10
7 9 8 5 2
5 5 3 3 6
18 23 21 20 18
16 24 11 12 13
19 20 15 14 15
0 0 0 0 0
50714 28571 27143 25000 21429 21429 26429 45714 60714 27143 127143 117143 121429 104286 135714 97143 91429 115000
15
23
22
10
5
12
Fig. 11.9. Sentence-by-sentence LD sores expressed as a percentage of sentence length
Machine Translation and the World Wide Web
229
Important for our purposes is how these results translate into possible measures for detecting misuse of FOMT by students: in the practical situation, teachers will be faced with just one text – the student’s work – not two. Assuming that the teacher has obtained the MT text with which to compare the student’s work, then if we focus on the three best measures, results suggest that for HL, a text with 50% or higher HL coincidence is suspicious. For LD, we should measure LD divided by sentence length, in which case scores below 50% indicate probable plagiarism. And a BLEU score of 0.4 or higher would seem to indicate a similarity level likely to be the result of misuse.
11.6 Summary and Conclusions The name of Yorick Wilks is certainly associated with MT, and this chapter has explored a number of aspects of MT in its latest guise, freely available online. We have suggested that web-page designers should be better educated about what MT will do to their web pages when they invite readers to click on the link to FOMT, and it would be a good idea if there were general guidelines, written in lay users’ terms, about how to improve the translatability of one’s web pages. Lay users might, quite reasonably, want to evaluate FOMT systems, either to see which one is best, or to see if their web page is being translated well enough. Various automatic metrics for evaluation have been suggested and have gained popularity in the MT world, but most of these require “model” translations against which the MT output is compared. This is not helpful for the lay user however, who often cannot lay their hands on a model translation (and if they could, why would they want to use MT then?). Intuitively, and following the advice of some commentators, many users resort to translating their text into and then back out of the target language in what we termed above round-trip translation. MT researchers have always assumed that RTT was a flawed evaluation method, for various reasons, and we have reported here two experiments to demonstrate this. The results quite clearly show that indeed RTT is unreliable, since a very badly translated text can lead to a remarkably good back-translation, probably due to the fact that MT systems resort to literal translation when faced with input that they cannot analyse. It has to be admitted that evaluating RTT with an automatic measure such as BLEU might be missing some of the story, since we noticed, during the second experiment reported above, that BTs for easy texts were significantly more readable that those for hard texts, so a new evaluation based perhaps on human judgments of readability might be called for. If BLEU is not entirely appropriate for judging RTTs, it is nevertheless one of a set of three measures that we can recommend for identifying the special case of plagiarism which is the misuse of FOMT by language learners. In a reflection of the work by Wilks and colleagues on plagiarism detection, we confirmed that BLEU score, Levenshtein Distance and measures of relative incidence of hapax legomena forma good basis for identifying when a student has used FOMT to do their homework. The interesting philosophical issue of how (or indeed whether) a student should be penalised for so doing is another matter.
230
Somers
But suppose MT quality was of a sufficiently high level that we could confidently recommend its use for a wide range of tasks. If this were the case, should we not then be teaching the use of MT to language learners, and encouraging them to use this tool where appropriate? While high-school students of Yorick Wilks’s and my generations were routinely taught in maths lessons how to use logarithms and a slide rule (even if as computational linguists we later had little use for either), these devices are no more than a curio to today’s students, for whom a quite sophisticated calculator, and knowledge of how and when to use it, is essential. Perhaps one day soon the same will be true of language students and MT.
References Anderson, D.D. (1995) Machine translation as a tool in second language learning. CALICO Journal 13(1):68–97. Anon. (2003) More machine translation: Fun with computer generated translation! Biomedical Translations, News, October 2003. www.biomedical.com/news.html. Anon. (2005) Gotcha!: Translation software. Software that translates text from one language to another may be a big help—or hindrance—to businesses and relief agencies alike. Baseline, May 2, 2005. www.baselinemag.com/article2/0,1397,1791588,00.asp. Baayen, R.H. (2001) Word Frequency Distributions. Dordrecht: Kluwer Academic Publishers. Babych, B. and A. Hartley (2004) Selecting translation strategies in MT using automatic named entity recognition. In 9th EAMT Workshop “Broadening Horizons of Machine Translation and its Applications”, Valletta, Malta, 18–25. Bernth, A. (1999a) EasyEnglish: A confidence index for MT. In Proceedings of the 8th International Conference on Theoretical and Methodological Issues in Machine Translation, TMI ’99, Chester, England, 120–127. Bernth, A. (1999b) Controlling input and output of MT for greater user acceptance. In Translating and the Computer 21, London, [pages not numbered]. Bernth, A. and C. Gdaniec (2002) MTranslatability. Machine Translation 16, 175–218. Bernth, A. and M. McCord (2000) The effect of source analysis on translation confidence. In J.S. White (ed.), Envisioning Machine Translation in the Information Future: 4th Conference of the Association for Machine Translation in the Americas, AMTA 2000, Cuernavaca, Mexico, , Berlin: Springer, 89–99. Brin, S., J. Davis and H. Garcia-Molina (1995) Copy detection mechanisms for digital documents. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, California, 398–409. Clough, P. (2003) Old and new challenges in automatic plagiarism detection. JISC National Plagiarism Advisory Service, Newcastle-upon-Tyne, Available online at http://ir.shef.ac.uk/cloughie/papers/pas_plagiarism.pdf. Clough, P., R. Gaizauskas, S.L. Piao and Y. Wilks (2002) METER: MEasuring TExt Reuse. In ACL-02: 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, 152–159. Doddington, G. (2002) Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In HLT 2002 Human Language Technology Conference, San Diego, CA.
Machine Translation and the World Wide Web
231
Farwell, D. and Y. Wilks (1991) ULTRA: A multilingual machine translator. In Machine Translation Summit III Proceedings, Washington, DC, 19–24. Flanagan, M. (1996) Two years online: experiences, challenges and trends. In Expanding MT Horizons: Proceedings of the Second Conference of the Association for Machine Translation in the Americas, Montreal, Canada, 192–197. Gaspari, F. (2004a) Integrating on-line MT services into monolingual web-sites for dissemination purposes: An evaluation perspective. In 9th EAMT Workshop “Broadening horizons of machine translation and its applications”, Valletta, Malta, 62–72. Gaspari, F. (2004b) ‘On-line MT services and real users’ needs: An empirical usability evaluation’ in R.E. Federking and K.B. Taylor (eds.) Machine Translation from real users to research. LNAI 3265 Berlin, Springer, 74–85. Gdaniec, C. (1994) The Logos translatability index. In Proceedings of the First Conference of the Association for Machine Translation in the Americas, Columbia, Maryland, 97–105. Heintze, N. (1996) Scalable document fingerprinting. In Proceedings of the Second USENIX Workshop on Electronic Commerce, Oakland, California. Helmreich, S., L. Guthrie and Y. Wilks (1993) The use of machine readable dictionaries in the Pangloss project. In Building Lexicons for Machine Translation: Papers from the AAAI Spring Symposium, Stanford University, CA, 63–68. Krug, S. (2000) Don’t Make Me Think: A Common Sense Approach to Web Usability. Indianapolis, IN: New Riders. Langlais, P., S. Gandrabur, T. Leplus and G. Lapalme (2005) The long-term forecast for weather bulletin translation. Machine Translation 19:83–112. Levenshtein, V.I. (1965) LevenWten, B.I. Dvoiqnye kody s ispravleniem vypadeni, vstavok i zameWeni simvolov. DokLady Akademi Nauk CCCP 163(4):845–848. Appeared as: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10 (1966):707–710. Lyon, C., J. Malcolm and B. Dickerson (2001) Detecting short passages of similar text in large document collections. In 2001 Conference on Empirical Methods in Natural Language Processing (EMNLP 2001), Pittsburgh, PA, 118–125. Macklovitch, E. (2001) Recent trends in translation technology. In Proceedings of the 2nd International Conference, The Translation Industry Today: Multilingual Documentation, Technology, Market, Bologna, Italy, 23–47. McCarthy, B. (2004) Does online machine translation spell the end of take-home translation assignments? CALL-EJ Online 6.1. Available at www.clec.ritsumei.ac.jp/ english/callejonline/9-1/mccarthy.html. Miyazawa, S., S. Yokoyama, M. Matsudaira, A. Kumano, S. Kodama, H. Kashioka, Y. Shirokizawa and Y. Nakajima (1999) Study on evaluation of WWW MT systems. In Proceedings of MT Summit VII “MT in the Great Translation Era”, Singapore, 290–298. Nielsen, J. (2000) Designing Web Usability: The Practice of Simplicity. Indianapolis, IN: New Riders. Niño, A. (2004) Recycling MT: A course on foreign language writing via MT post-editing. In 7th Annual CLUK Research Colloquium, Birmingham. [pages not numbered] Nunberg, G. (2005) Letting the Net speak for itself: Fears of an ‘anglo-saxon’ takeover of the online world are unfounded. San Jose Mercury News, April 17, 2005, available at http://www-csli.stanford.edu/∼nunberg/weblg.html. O’Brien, S. (2005) Methodologies for measuring the correlations between post-editing effort and machine translatability. Machine Translation 19:37–58.
232
Somers
O’Connell, T.A. (2001) Preparing your web site for machine translation: How to avoid losing (or gaining) something in the translation. IBM website, www-128.ibm.com/ developerworks/web/library/us-mt/. Papineni, K., S. Roukos, T. Ward and W. Zhu (2002) BLEU: A method for automatic evaluation of machine translation. In ACL-02: 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, 311–318. Richmond, I.M. (1994) Doing it backwards: Using translation software to teach targetlanguage grammaticality. Computer Assisted Language Learning 7:65–78. Shivakumar, N. and H. Garcia-Molina (1996) Building a scalable and accurate copy detection mechanism. In DL’96: First ACM Conference on Digital Libraries, Bethesda, MD. Somers, H. (2001) Three perspectives on MT in the classroom. In MT Summit VIII Workshop on Teaching Machine Translation, Santiago de Compostela, 25–29. Somers, H. (2005) Round-trip translation: What is it good for? In Australasian Language Technology Workshop 2005, Sydney, Australia, 127–133. Somers, H., F. Gaspari and A. Niño (2006) Detecting inappropriate use of free online machine translation by language students – A special case of plagiarism detection. In 11th Annual Conference of the European Association for Machine Translation – Proceedings, Oslo, 41–48. Turian, J.P., L. Shen and I.D. Melamed (2003) Evaluation of machine translation and its evaluation. In MT Summit IX: Proceedings of the Ninth Machine Translation Summit, New Orleans, LA, 23–28. Underwood, N. and B. Jongejan. (2001) Translatability checker: A tool to help decide whether to use MT. In Proceedings of MT Summit VIII: Machine Translation in the Information Age, Santiago de Compostela, Spain, 363–368. Wilks, Y. (1973a) An artificial intelligence approach to machine translation. In R.C. Schank and K.M. Colby (eds.), Computer Models of Thought and Language, San Francisco: Freeman, 114–151. Wilks, Y. (1973b) The Stanford machine translation project. In R. Rustin (ed.), Natural Language Processing, New York: Algorithmics Press, 243–290; repr. in S. Nirenburg, H. Somers and Y. Wilks (eds.) (2003) Readings in Machine Translation, Cambridge. MA: MIT Press, 371–390. Wilks, Y. (1975a) An intelligent analyzer and understander of English. Communications of the ACM 18:264–274. Wilks, Y. (1975b) Preference semantics. In E. Keenan (ed.), Formal Semantics of Natural Language, Cambridge: Cambridge University Press, 329–348. Wilks, Y. (1978) Comparative translation quality analysis (Final report F-33657-77-C-0695), Latsec Inc., La Jolla, CA. Wilks, Y. (1990a) Form and content in semantics. Synthese 82:329–351. Wilks, Y. (1990b) Themes in the work of Margaret Masterman. In P. Mayorcas (ed.), Translating and the Computer 10: The Translation Environment 10 years on, London: Aslib, 148–160. Wilks, Y. (1992a) SYSTRAN: It obviously works, but how much can it be improved? In J. Newton (ed.), Computers and Translation: A Practical Appraisal, London: Routledge, 166–188. Wilks, Y. (1992b) Pangloss: A knowledge-based machine assisted translation research project – Site 2, In Speech and Natural Language: Proceedings of a Workshop, Harriman, New York, 280.
Machine Translation and the World Wide Web
233
Wilks, Y. (1992c) Stone soup and the French room: The empiricist-rationalist debate about machine translation. Talk given at TMI 92, Montreal. First published (1993) as Memorandum in Computer and Cognitive Science MCCS-93-255, Computing Research Laboratory, New Mexico State University, Las Cruces, NM; repr. (1994) In A. Zampolli, N. Calzolari and M. Palmer (eds.), Current Issues in Computational Linguistics: In Honor of Don Walker, Pisa/Dordrecht: Giardini/Kluwer, 585–595. Wilks, Y. (1994) Developments in machine translation in the US. Aslib Proceedings 46:111–116. Wilks, Y. (2000) Margaret Masterman. In W. J. Hutchins (ed.), Early Years in Machine Translation: Memoirs and Biographies of Pioneers, Amsterdam: John Benjamins,279–297. Woolls, D. and M. Coulthard (1998) Tools for the trade. Forensic Linguistics 5:33–57. Yang, J. and Lange, E.D. (1998) Systran on AltaVista: A user study on real-time machine translation on the Internet. In D. Farwell, L. Gerber and E. Hovy (eds.), Machine Translation and the Information Soup: Third Conference of the Association for Machine Translation in the Americas, AMTA’98, Langhorne, PA, Berlin: Springer, 275–285.
This page intentionally blank
12 Semantic Primitives: The Tip of the Iceberg Karen Spärck Jones Computer Laboratory, University of Cambridge, Cambridge, UK∗ Abstract:
Semantic primitives have been central to Yorick’s approach to language processing. In this paper I review the development of his ideas on the nature and role of primitives, considering them both from the narrower system point of view and in the larger context to which Yorick himself always referred
12.1 Introduction: Semantic Primitives ((∗ ANI SUBJ)(((FLOW STUFF)OBJE)((SELF IN)(((WRAP THING)FROM)(((MAN PART)TO)(MOVE CAUSE))))) I want to revisit Yorick’s 1983 question: “Does anyone really still believe this kind of thing?” (Wilks 1983a). Yorick in 1983 argued for the kind of thing illustrated by the formula above as a semantic tool for resolving what others would describe as awkward syntactic problems. Here, to see what a contemporary answer might be, I will look again at how Yorick’s ideas about “this kind of thing” developed and what they may say to us now. My focus is thus on semantic primitives and their intimate relationship with word sense disambiguation, as well as ambiguity resolution more generally. But semantic primitives are the tip of a large iceberg with natural language processing and its tasks in the upper layers, and the philosophy of language and its ramifications in the lower ones, a continuity illustrated by the connection between primitives, interlinguas for translation, and the language of thought. The iceberg has many other component lumps: discourse structure, and metaphor, for example, and Yorick has always sought to relate his work on automatic language processing with linguistic theory on the one hand and philosophy on the other. Thus Wittgenstein and Quine are invoked as philosophical supports for his position, and language processing, its needs, and his strategies for it are brandished as tools to attack theoretical linguists of the Chomskyan or Generative Semanticist schools. I will concentrate on Yorick as a language processor, so to speak, but also comment more briefly on these other facets of his work.
∗
I am grateful to Ted Briscoe for comments.
235 K. Ahmad, C. Brewster and M. Stevenson (eds.), Words and Intelligence II, 235–253. © 2007 Springer.
236
Spärck Jones
I will begin with Yorick’s first substantial paper, “Computable semantic derivations” of 1965. This presented many of his basic ideas, which were developed, especially for computational experiments, in a phase of research cumulatively represented in Wilks (1972a), with philosophical and theoretical amplifications also illustrated by Wilks (1971) and Wilks (1975a). Then, though much remained the same, there were some changes, especially under the growing influence of research on artificial intelligence, rather than the earlier machine translation: this phase can be roughly taken up to Wilks (1977a), which offers some theoretical discussion to amplify system-centred papers like Wilks (1975b). Some further shifts are signalled in Wilks (1978), and amplifications in Fass and Wilks (1983) and Wilks (1983b), followed by a much more substantial change of direction as signalled by Wilks et al. (1987), and developed in Wilks et al. (1996). My focus will be on Yorick’s work up to 1983, and I will consider his later machine dictionary work, as represented by Wilks et al. (1987, 1989) and Wilks et al. (1996), primarily as a comment on the earlier ideas.
12.2 Computable Semantic Derivations It is important to put this work of Yorick’s in its historical context. Research in automated natural language processing at the Cambridge Language Research Unit (CLRU) (see especially Masterman 2005) began in the 1950s and focused from its very beginning on semantic issues, specifically as presented by machine translation though it also sought to relate translation to the use of language in general. Thus for translation, how were the senses of words in the source language to be identified and characterised so as to guide the choice of words (senses) for the target language? The initial model used a semantic classification – a thesaurus like Roget’s – to categorise word senses, and the notion of class recurrence or recurrent class association over a text as a device for selecting individual word senses. Syntax was taken for granted, rather than dismissed out of hand. The thesaurus could be seen as an interlingua, so the selected class characterisations of the words of the source text constituted the device transmitting source word meanings, or at least the essential elements of source word meanings, to the target language generator. Each thesaurus class could be taken to represent, or embody, a semantic primitive. Such a simple model, however engaging its simplicity, was also manifestly inadequate as a strict interlingual model, i.e. one in which there is no presumption that source word senses, once identified, point straight to specific target language equivalents. It was succeeded, for experimental purposes, by an interlingual language, NUDE (Spärck Jones 2000), which made use of a much smaller set of primitives and characterised word meanings not by individual primitives or, in principle, sets of primtives, but as syntactically structured formulae. The members of the CLRU were sufficiently serious to engage in a non-trivial lexicographic exercise, providing dictionary entries for a general vocabulary using NUDE. But the all-important issues of how exactly the dictionary entries were to be applied in conjunction with syntactic information and, as would certainly be required, to intersentence relationships, had
Semantic Primitives: The Tip of the Iceberg
237
been only partly addressed before Yorick tackled them, and the latter much more, by Masterman, than the former. Yorick’s key contribution was to address these crucial, and interrelated issues much more thoroughly and to show, through computational experiments, that he could get something to work (see, e.g. Chapter 3 in Wilks 1972a). This work led, in turn, to further development of the structure of the ground-level lexical formulae. Yorick’s strategy was to enhance the scope and role of semantic patterns that defined message or discourse structures. The basic patterns, templates, exploited fundamental “actor/action/object” relationships with actor, action and object each defined by semantic primitives that figured as the dominant or head primitive in corresponding lexical formulae. This may seem an obvious idea now, but within the then context of machine translation research, with its primary emphasis on syntactic analysis, was extremely revolutionary. The initial application of these ideas in Wilks (1965) was nevertheless very simple: text was fragmented e.g. at prepositions, and fragments were annotated by template primitive triples that could be mapped on to them. Sequencing and compression rules then helped to select templates and template chains characterising more extended text. These rules embodied obvious ideas about repetitive cohesion, and exploited the full formulae for lexical items dependent on the triples. The general model was of gradual convergence on the sense of a text over successive fragments. The paper describes, in extremely opaque detail, actual codings using such familiar primitives as MAN, THING, WHERE, WHEN, BE, MOVE along with other perhaps less familiar ones like TRUE and SIGN, templates like MAN WHERE BE and MAN KIND BE, and individual word formulae like ((NOT TRUE : LIFE) : (MAN : KIND)) for “ill” and ((LIFE : STUFF)/(MAN/IN) : DO)) for “eat”, along with some computational experiments which show, realistically, how much apparatus is required to deal even with tiny texts. Some general ideas characteristic of Yorick’s subsequent work already appear in this early paper. First, working with minimal syntax: template slots were associated with conventional syntactic categories, and the fragmentation made implicit use of natural constituents and their internal word order, but this was far from fullblown syntax. Second, templates characterised the form of expressions of facts, not facts themselves. Third, templates were devices for resolving word, and higher unit, ambiguity in context. Fourth, meaning representations are what remains as the semantic structure of the resolved text as a whole. Fifth, there is no notion of correctness with respect to a text, just as there is no notion of “the” legitimate senses of a word. Rather, there is what suffices for the task to hand, e.g. machine translation. Finally, and most importantly, that any text processing has to deal, as a matter of course and with its core mechanisms, with new uses of words: there has to be a way of responding to new, extended word uses, and to novel variations on familiar message forms. This is not however illustrated, but stated as an immediate research goal. The unifying notion linking these various features of Yorick’s approach is that of preference: there are no absolutes about the semantic relations between text components, only preferences, which may be more or less fully satisfied.
238
Spärck Jones
12.3 Preference Semantics: Design, Development and Defence I have elaborated on Yorick’s initial paper, because his work for the next two decades is a response to the exigencies of making it work and, also, an expansion on its theoretical positions, for example in oppposition to the Chomskian primacy of syntax. Wilks (1967), for instance, describing experiments in processing questionanswer pairs as mini-texts, again emphasises, on the one hand, the notion that his apparatus is essentially heuristic but also, on the other, the two theoretical points, first that one should talk of achieving a text interpretation, rather than identifying the text’s correct meaning and second, and significantly, that the basic device for doing this, the semantic primitives, are themselves only words and behave like words: thus they can be individually ambiguous, though the complex formulae built from them are not. Wilks (1972a) describes further computational experiments using the apparatus just summarised to process paragraph-length texts (including ones from major philosophical works), and also to deal with new or extended senses of words, essentially by projecting sense characterisations from “old” word senses in the text to the new ones so that template preferences can be satisfied. 12.3.1 Formula and Templates, and Paraplates and Inference Rules However the main development of Yorick’s initial apparatus was to allow for much richer discourse structure, both by dealing with the attachment of subordinate text units, like adjectives, to their nominals and also, much more importantly, by introducing higher-level patterns designed to combine templates. These paraplates characterised template pairs connected by case-type relations as signalled by, e.g., prepositions. The paraplates offered a much tighter and more explicit account of the semantic relations between words and word groups. The basic templates of course covered the primary agent/object relations but the paraplates extended case ties to, for instance, TO, LOCA, POSS and GOAL. Deploying all these interpretative components depended on the word formulae which had a complex internal structure with sub-formulae signalling semantic properties of sought case “relatives”, on the use of classes of semantic primitives with shared behaviours, and so forth. Thus we now have a formula like ((∗ ANI SUBJ)(SIGN OBJE) (((MAN SUBJ)SENSE)CAUSE))))) for “singing”. Yorick’s use of case ideas, quite familiar now, and figuring for linguists in Fillmore (1968), were innovative in the computational context. This extended range of semantic resources was nevertheless still insufficient to resolve ambiguity, and was therefore further supplemented by common sense inference rules, essentially rule-based ways of unpacking formulae to make implicit pattern information explicit in template form and hence available for further matching. Common sense inference does refer to (familiar and recurrent features of) the world but does not operate directly on a specific world model and is primarily expressed in conventional linguistic forms. The new linguistic objects that the inference rules produce help to lead to the interpretation of a text as that given by
Semantic Primitives: The Tip of the Iceberg
239
the densest network of preferences linking one text component with another. This interpretation is manifest in the sequence of selected (and presumably implicitly or explicitly filled out) word sense formulae in the primitive-based meaning representation language. The details of Yorick’s approach to text interpretation, along with the fact of its computational implementation and some successful or at least plausible processing outputs, were used to buttress Yorick’s theoretical arguments against linguists and philosophers of language in, for example, Wilks (1972b) and Wilks and Schank (1974) as well as Wilks (1971) and (1975a). His theoretical position was equally an up-market endorsement, indeed ground for, his computational practice. Thus in Wilks (1971) he makes an attack on grammaticality as not a valid independent property of texts: all that counts is meaningfulness, which in turn is having one of several possible interpretations, implying sense resolution in context and hence meaning for a (coherent) text as a whole. Such an interpretation is a linguistic interpretation, in the sense of being in the same or another language, so text objects are alternatives for one another. Thus one might substitute a dictionary definition as a paraphrase for a word in a text, or a translation of a text. This is the name of the primary or routine meaning game in the Wittgensteinian and Quinean tradition, anti denotational semantics, with even referential “bottoming out” never certain. Similarly Wilks and Schank (1974) argues that linguistic (and we may say specifically semantic) theory is not to do with grammaticality (or decidability) as properties of sentences, and about competence in being correct on this. Because meaningfulness springs from a dynamic process of text interpretation, and in particular because meanings for new word senses have to be constructed, a semantic theory has to be grounded on a substantive notion of performance. In Wilks (1975b, c) Yorick gives further detailed, and computationally oriented, accounts of his system. The detail, especially on how the various components individually and collectively work, are nothing like as crude as my summary might suggest. The lexicon, with 600 entries, was far from negligible for the time (or indeed some time after). There is a more detailed treatment of translation, into French (see also Herskovits 1973), guided by a surprisingly specific generation apparatus using stereotypes for French words attached to paraplates. However there does not appear, in general, to be any guidance in output from the specific input words, i.e. the formulae etc. are a true interlingua rather than only a vehicle for selecting from alternative output choices for particular input lexical items. One of the puzzles of these accounts, and the relatively limited and repeated examples given in the papers of the period, is how the interpretive system can actually capture some of the finer grain of the input, for example the specific determiners used. One must suppose, since everything is lexically driven, but primitives like THIS may not be sufficient for unequivocal surface determiner selection, that there has to be a perfectly good conventional bilingual dictionary showing equivalent word forms which, together with the final primitive representation, does lead to appropriate translations including, for French, the choice of “le” or “la” as article. In Wilks (1975c), Yorick also indicates how his basic apparatus of formulae, templates and paraplates, by characterising stretches of text and showing semantic
240
Spärck Jones
relations, can help to resolve anaphors, for example by unpacking formulae to exhibit case ties. The suggestion that Yorick’s basic system can be extended to deal with such demanding phenomena is all the more surprising given what appears to have been a concurrent non-trivial change to the system underpinnings, namely the disappearance of the conventional syntactic category information that figured in the initial version. Yorick’s claim is that syntax can be done on the fly and (apart from the rather limited syntactic notions marked by SUBJ and OBJE) is, as it were, finessed by the direct semantic relationships given by the templates and so forth. This is all also, as Yorick emphasises in Wilks (1975c), within a philosophical framework that not only, in terms of a controversy of the time, makes him a proceduralist rather than a declarativist about meaning, but more specifically bases processing on an “inertial” or “laziness” principle. In these accounts of Yorick’s approach, semantic primitives embody generic concepts shared by many words and as such facilitate key text processing operations. Here, and previously, he maintains that his vocabulary of primitives is stable and, while not claiming that his sets of templates, paraplates or common sense inference rules are equally stable and limited, he does claim, while recognising that his actual testing has been rather modest, that the sets that are required to serve language interpretation and generation tasks will be small. Then again, in Wilks (1975d and 1977a), he explores the wider theoretical context of his work, with particular reference to primitives. Thus in (1975d) he returns to the fact that primitives are only words, albeit in a “small” natural language, and that there is therefore nothing improper about the idea that for convenience any actual (e.g. English) words can figure in text representations, as long as they do have formulae. Thus one might elaborate a text representation with a rather particular word like “aeroplane”, apparently far more specific in meaning than, say, THING, as long as “aeroplane” has its own THINGy formula. Yorick argues that there is no more circularity about this than about any dictionary, and notes that a detailed study of Webster’s dictionary showed some rather generic words recurring in many defintions, functioning like primitives. This position is different from the Katz and Fodor one with markers subsuming distinguishers. Wilks (1977a) is a more substantial discussion of the pros and cons of semantic primitives, or rather of what Yorick believes is their essential character and hence rational justification, contra others including Katz and Fodor, and Putnam, for example. Thus he offers criteria for (a set of) primitives: being finite, comprehensive, independent, non-circular and non-reducible. Further, a set of primitives (with an application syntax) is a reduction device yielding a semantic representation for a natural language via a translation algorithm which is not plausibly explicated by other entities of the same type (p. 184). But aren’t they really things of another type? Mental forms, for example? How could one know? Or terms in a model-theoretic formal language, Markerese? But model theorising is not what natural language processing is all about. Guaranteeing, say, that “seek” (or “seek1”) is equivalent to “TRY TO FIND” is a lost cause for natural language. Yorick’s position is essentially that trying to capture natural language semantics via meaning postulates is trying to bag a black cat: when you open the bag there is only a Cheshire grin.
Semantic Primitives: The Tip of the Iceberg
241
Semantic primitives are nothing to do with analyticity, or with stereotypical facts: the former is too restrictive and the latter too demanding for a necessarily reduced representation language. Primitives constitute a language for the description of meaning (p. 191) and are thus, as mentioned earlier, no more univocal than any other language. Yorick’s position is that (both non-case and case) primitives are a useful organising principle for a natural language processing system because they allow helpful generalisations to be made, as operationally convenient: he endorses Sampson’s analogy of primitives as like English pound notes with promises to pay that mean the notes can be turned into something else but never into actual gold. Just as there is no gold that English pound notes can be turned into, there is no conceptual substance that primitives can be turned into. As Goodman had earlier claimed, a primitive representation language is just that, a language, among many others, with no absolute status. Of course the machine has to know the language, or rather we as its program writers do, but this is no different from dealing with any other language, formal or informal. Yorick further argues that when studies of large conventional dictionaries show that some words feature conspicuously as recurrent defining terms for other words, they are organisational devices because, though they themselves have dictionary definitions, these definitions are mere empty gestures, not substantive ones. It is worth noting that where I have just referred to language processing systems, Yorick during the 1970s referred to language understanding systems: this is somewhat at variance with his general stance on primitives, and is perhaps attributable to the fashionable terminology of the period, or at any rate should be interpreted as “understanding sufficiently for the task in hand (e.g. translation)”. The ultimate justification for some language of primitives is thus whether it works for some language processing purpose, on some suitable test of working. In Wilks (1977b) and (1978), Yorick further extended his apparatus to deal with two problems of practical language processing, and in so doing further amplifies the points just made. In (1977b) he grapples with the need to exploit different underlying causaltype relations, and specifically to distinguish between “cause-of” and “reasons-for”, to link different parts of a text. He maintains that this can be done, within the overall preference framework, by classifying inference rules so that, essentially, they operate in one semantic direction for CAUSE and the other for GOAL. The result is rather complicated and it is far from clear that what is illustrated for a few examples will scale up satisfactorily. 12.3.2 Pseudo-texts, and Metaphors Wilks (1978) is a more direct development of the line Yorick took in Wilks (1977a) because it enriches the language-like characteristics of his approach, and of the mechanisms it supports, to deal with the key fact of ordinary language use, namely the continuous appearance of new word senses. It is also Yorick’s respose to the contemporary interest in organised bodies of knowledge about the world (frames etc.). Thus Yorick’s proposal is to import another class of primitive-based
242
Spärck Jones
resource, pseudo-texts, that are frame-like objects that encapsulate detailed world knowledge and can be invoked to enlarge any existing text representation using templates, paraplates etc. and, specifically, make possible the connective inferences that provide an interpretation for new word senses. The particular point of interest is the way that Yorick’s existing apparatus, relying wholly on primitives, can be connected with pseudo-texts, which may need to be much more specific: many words for types of weapon will all have the same formula, for example, but a particular text may require some more specific characterisation of guns to be understood which the “gun” pseudo-text supplies. Yorick’s way of bridging this gap is to exploit the thesaurus idea. Thus he notes that primitive formulae may be taken as imposing a structure on a thesaurus in that the individual primitives within a formula context can point to word classes in a thesaurus. This imposed structure is richer than the normal simple hierarchy in a thesaurus like Roget’s (just as Yorick’s formulae are themselves more complex than sets of class labels for word senses). However the thesaurus class hierarchy is also exploited to enhance the formula-based apparatus. Thus the classes pointed to may be whole head classes, or subheads, or fine-grained bottom level synonymous word sense sets, or rows. Any individual word (sense) can be invoked as long as it has a formula, and so implicitly substituted in a template etc. Word classes can also be invoked since there are formulae, with inclusion relations, for the word sets at different levels within a thesaurus head, bottoming out at row level. (It is not clear whether these formula are inferred from shared parts of bottom-level formulae or are specifcally constructed.) The pseudo-texts embodying word-related facts associated with particular words or word classes are the same sort of primitive-based pattern structures as ordinary text representations, so when they are invoked through the words or word classes that can fill slots in the initial text representation, they provide more template, etc. patterns across which inferences can be made. The presumption is that all this invocation will occur only when existing representation patterns clash with the available preferences. The operation is envisaged as an inferential projection from the pseudo-text onto the actual text representation, supplying new patterns which are near enough those sought to do as interpretations: thus if we cannot interpret “drink” in “My car drinks gasoline” directly, we can get a good enough interpretation by importing “use” from the pseudo text for “car”. As this suggests, Yorick’s apparatus is becoming increasingly baroque. He admits that there is no implementation, and that there are important issues like control to tackle. His claim is that his approach combines coherence in the use of the same kind of primitive-based representation across many different components with flexibility through need-driven and preference-based use of these components, but the evidence that this strategy actually works is missing. Yorick’s last major paper on the line just laid out was Fass and Wilks (1983). Much of the apparatus, and illustrated application, is as before, with a discussion of “My car drinks gasoline” as a metaphor and comparison between the way it is treated in Preference Semantics and in other approaches to metaphor. However there are some differences. Thus in considering how to handle metaphor as a normal feature
Semantic Primitives: The Tip of the Iceberg
243
of language use, the paper again argues for the relative preferences approach as a better way of accommodating “ill-formed” input than fixed semantic constraints. But the paper flags a more explicit emphasis on the forms of semantic information used as types of dictionary information. Thus it argues that dealing with metaphors is better done by allowing some weakening (by generalisation or partial matching) of either the core elements of dictionary definitions, or their relational conditions, than by invoking the frame-like pseudo-texts referring to world knowledge of Wilks (1978). The notion is that the character of the metaphor is displayed by the nature of the modification. 12.3.3 Preference Semantics’ Claims Fass and Wilks (1983) also marks another shift, and is of particular significance here, apart from its concern with metaphor, for two reasons. First, it states, quite specifically, that Preference Semantics is not a set of programs, but a set of principles or claims. It is thus a system only in an abstract sense, to be justified less by actual implementational achievements, than by the merit of its general claims and by concrete illustrations of the way it works. Yorick comments, more than once (e.g. in Wilks 1975c), that there was an implementation, with a 600-word dictionary, that did process a range of English paragraphs successfully, by translating them into French. There is thus more support for his illustrations than what might be called the usual theoretical linguists’ style of illustration. But there is no commitment to making a serious application system work, come what may. Second, Fass and Wilks lists the claims that Preference Semantics makes: summarising, these are 1. there is no syntactic module; 2. the semantics are not model theoretic, and quantification just needs some special procedures; 3. everything is procedural, and generally so, operating under a least-effort principle; 4. there is some privileged set of semantic primitives; 5. text representation is linear and primarily surface-text sequenced; 6. the representation of a text is the best fit among competitors; 7. hence ill-formedness is only relative not absolute, and does not preclude interpretation. There is some irony in the fact that these claims appear at what was in fact the end of Yorick’s work on Preference Semantics as we knew it, given not only that some claims are not supported by much evidence, e.g. that on quantification, and that, as we shall see, some concurrent attempts to exploit Preference Semantics ran into difficulties. But as against this, a fairer view is that Yorick’s move in the 1980s to concentrate on automating lexicon construction was the correct strategy: even without a commitment to building lexicons for practical natural language
244
Spärck Jones
processing systems and to cost-effective methods of doing this, it is necessary, for intellectual credibility, to show that it is possible to build a non-trivial lexicon so that a lexically-based approach to text interpretation can be independently tested. From this point of view, Yorick’s move was reculer pour mieux sauter. But Yorick’s 1965–1985 version of Preference Semantics then vanished from sight. I will return to its reappearance, or reincarnation, later. Its retirement, at least into the wings, in the mid 1980s is not wholly surprising: quite apart from the scaling up challenge, and the fundamental question of whether Preference Semantics was indeed the right way to approach semantic interpretation, there were other factors in play. One was the enthusiasm, from the late 1970s through the 1980s, for the logico-grammaticist approach to language processing and meaning representation, with its emphasis on logical form, its more limited interpretation of “semantic”, and its sharper distinction between semantic, in this sense, and pragmatic. Thus there was both more concern with issues like the treatment of quantifier structures, which Yorick in practice ignored, and a harder line about what constitutes the genuinely linguistic information about word meanings to be embodied in dictionary, i,e, system lexicon, entries. Yorick’s apparatus indeed depended on the lexical formulae for words, but also on the other pattern sets. In the more lexicalist approach of the 1980s, the narrowly linguistic share of the pattern information tended to be dispersed in the form of constraints on individual words contained in their lexicon entries. These entries, expressed in feature set form, indeed made use of general semantic categories and were typically supported by subsumption hierarchies, and so implicitly involved general patterns. But even if there were elements of common linguistic description behind such thoroughly different “notations”, there was also a much more important and radical difference between Yorick’s approach and the then dominant ones in the key role of syntax. Processing within the grammaticist approach was dependent on and driven from syntax, even if the eventual meaning representations were not conventional syntactic parse trees. The grammatico-logicist approach also led to “deep” representations with structures rather far removed from the surface text. As Yorick commented in Wilks (1983b), “deep” and “superficial” (aka surface) are complicated notions. In his case the primitive formulae are deep but his text representations, with their sequences of templates, are actually shallow.
12.4 Applications These competing developments were reinforced by the attempts others made to exploit Yorick’s approach. His own experiments and implementations were not so compelling as to lead many others with natural language processing interests to follow him, or to supply exportable technology; but it is worth commenting on Boguraev’s work, first on ambiguity resolution per se and then on the database query task, as applications and developments of Yorick’s key ideas (Boguraev 1979, Boguraev and Spärck Jones 1983).
Semantic Primitives: The Tip of the Iceberg
245
The main features of Boguraev’s initial work were a return to conventional syntactic parsing as an essential component of processing, and the derivation of a semantic representation as a case-labelled dependency tree over what would nowadays be called predicate-argument or proposition-like constituents. Boguraev’s work was with individual sentences, rather than extended texts. His system used ATN parsing that exploited dictionary entries combining conventional syntactic information with Wilks-type semantic formulae. His concern, like Yorick’s, was with both lexical and structural ambiguity (especially associated with prepositions). Processing was driven off contextual verb frames, with additional semantic patterns much like paraplates, called preplates, for dealing with modifier attachment. Over time Boguraev’s work came to involve a larger set of case labels that Wilks’, which could be supplied from prepositional lexical entries or in other ways. Boguraev’s basic mechanism was, however, the same as Yorick’s, namely the use of semantic preferences. Boguraev’s semantic sentence representations were explored for a more autonomous generation process than Wilks and Herskovits’ translation into French: thus he demonstrated successful ambiguity resolution via paraphrase. At the same time, in the application to database query, the representation could also be used as input for further transformations of an English question into a formal database query. In particular the detailed dependency tree could be used to handle quantified structures in the precise way required by database query, but very doubtfully possible with Yorick’s original system. For both the ways of exploiting his semantic representations that Boguraev studied, the dependency tree structure provided significant leverage. This line of work, which appears to be the most substantial attempt to apply Yorick’s ideas outside his own group, eventually petered out, partly for external reasons, but partly for internal ones. The database frontend work, like other such work elsewhere, came up against the challenge of supplying domain models and the need for robustness in connecting human users with hidden formal data structures. However, as with Yorick’s own work, the business of enlarging the lexicon in an adequate and consistent way led naturally to Boguraev’s own work on ways of exploiting machine-readable conventional dictionaries as sources for processing system lexicons. The difficulties of supplying Wilks-style lexison entries had also led, in another project on text-retrieval requests (Spärck Jones and Tait 1984), to a radical simplification of the formulae, making their semantic content and their relation to syntactic information closer to the conventional one.
12.5 Building Machine Lexicons Yorick’s work on automated lexicon construction was more than a simple response to the data and data processing capabilities that were becoming available at the time. Exploiting existing machine-readable dictionaries as the base for new ones designed for natural language processing systems, or exploiting corpora to extract word behaviour data, is entirely in the spirit of his long-standing views about the
246
Spärck Jones
nature of his approach to semantic interpretation. But it also marked at least an apparent change in his detailed approach to semantics. This is fully evident in Wilks et al. (1996). Electric words also provides a retrospective overview, in the context of theories of meaning, of the general (though not specific) approach to semantic representation that Yorick adopted in the work I have described. This approach is offered both as a sound approach to linguistic meaning representation in its own right, and as an appropriate basis for the strategy of building language procesing lexicons by bootstrapping from machine-readable dictionaries (MRDs) that is now proposed, to meet the scaling up challenge, as the way forward for the field. The message in all of this is, again, 1) that word meanings can (in general) only be conveyed by other words, i.e. through some other language which necessarily has the properties of any natural language, like lexical ambiguity; 2) that semantic primitives provide anchoring pegs, or an organising apparatus, for this semantic characterisation; 3) that any set of semantic primitives is the right set only because they work, as a specialised sublanguage, in enabling some language processing task, like translation; 4) that such useful primitives emerge, as their motivating ontology, when an existing dictionary is analysed as a text. These are very general statements. The point of interest here is precisely what form these semantic primitive-based entries in the language processing lexicon derived from a conventional dictionary are like, along with what this implies for the the other system contributors to text processing and for the form of text representation these deliver, say for translation. In fact, of course, the language processing lexicon will not simply rise out of the MRD like Venus from the waves: the process will be more like fishing with some carefully chosen bait; and it will also gain, taking it yet further from the distributional purist’s approach to pulling all linguistic units and structures out of running text, by starting from the “preprocessed” text corpus that an existing dictionary text in itself provides. This is not the place to recapitulate the detail of Yorick’s group’s work witb MRDs: I want only to consider its key points, especially as illustrated in Wilks et al. (1987) and in Electric words, and how these consort with Yorick’s earlier Preference Semantic system. Wilks et al. (1987) describes several independently-pursued lines of work, with LDOCE (Procter 1978), but these can all be related to different aspects of Preference Semantics. First, and most important, semantic primitives survive. But there are far more of them, around 1000 terms identified (through frequency and simplicity) as central in the 2000+ terms of the basic vocabulary used for LDOCE definitions. This is far more than Yorick’s original set of less than 100, though the suggestion is that the 1000 can be further reduced. The claim, on the basis of experiment, in Wilks et al. (1987) is that provided these terms themselves are properly, de facto manually, defined, the definitions for the much larger word set in LDOCE can be
Semantic Primitives: The Tip of the Iceberg
247
automatically rewritten, in bootstrapping cycles, to obtain a derived dictionary with primitive-based definitions. This in turn provides the material for machine lexicon entries which are in frame form, and can in principle be extracted by automatically parsing the dictionary definitions. These frames combine both linguistic knowledge and world knowledge, and also encode case relationships, as well as conventional grammatical information. These frame structures, again experimentally investigated, combine the types of information that in the earlier Preference Semantics apparatus was spread across formulae, templates and paraplates, and pseudo-texts. However the primitives function much more as simple category labels, and the distinctive syntax of Yorick’s original word sense formulae appears to have vanished. At the same time the core lexical data could be enhanced in two ways. First, by processing dictionary entries to extract genus hierarchies which would allow generalisation and inference in the way that classes of primitive did in Yorick’s original Preference Semantics; and second by processing the dictionary as a corpus to extract word cooccurrence relations and, in principle, classes of words with similar dictionary-text behaviour: these would function in the same way as rows in Yorick’s earlier system, as links between individual words with their own distinctive properties and the primitive formula characterisations of word sense classes.1 Similar ideas figured in other research on MRDs in the late 1980s. But quite apart from the political intellectual property obstructions this work ran into, there were far more substantial barriers to progress. The most salient was correct word sense identification in the dictionary entries, i.e. sufficiently reliable sense selection to drive the whole bootstrapping process. But there was also, especially for the present context, the awkward fact that a “good” (even if not perfect) set of semantic primitives has not been found to emerge. Yorick’s view that you cut your language processing primitive cloth for your application purpose suit is a hard dialectic taskmaster. Current approaches to the language processing lexicon illustrate different responses to these problems, but collectively diverge from Yorick’s primitive-based lexical centre. In one strategy, as implemented for the LKB (LKB 2005), lexical entries have a rigorous formal (typed feature) structure, but have only very general semantic category features, like ANI, and relational features, like “telic”, alongside syntactic data. In a complementary approach, as manifest in WordNet (WN 2005), there is a rich, fine-grained descriptive word classification, in the same ball park as Roget’s, enhanced with some simple syntactic category and verb frame information. Both strategies have more limited aims than the earlier ones, but have more chance of being able to automate at least some of the lexicon building work, as with EuroWordNet (EWN 2005). FrameNet (FN 2005) is somewhat closer in spirit to the kind of lexicon envisaged in, or rather implied by, Wilks et al. (1987). It offers both conventional syntactic data and row/synset word sense class data, thus (roughly) combining LKB and WordNet-style approaches. But in addition, and most importantly, is is organised by semantic frames, like Activity or Ingestion, that are 1
Yorick comments on the relation between this work and some of my earlier work, but I am ignoring this here.
248
Spärck Jones
effectively a large number of low-level semantic category primitives, and by a set of semantic case primitives, like Manner or Place, that define frame slots. However FrameNet also conveys much of its lexical information by the most straightforward use of ordinary, detailed English without any concession to limited defining vocabularies, e.g. “Means of Ingestion: an act performed by the Ingestor that enables them to accomplish the whole act of ingestion”. The detail involved also implies that, as with conventional lexicography, building FrameNet is a primarily human activity. A direct semantic comparison between an early Wilks lexical entry and a FrameNet one is hardly fair, because Yorick relied on explicit, separate pattern sets, notably the paraplates, to supplement individual word information, where Framenet
Wilks “grasp” FORMULA : (ANI SUBJ) (SIGN OBJE) ((THIS (MAN PART)) INST) (((SAME SIGN) (TRUE BE)) THINK) FrameNet “grasp” [syntax detail], FRAME : GRASP Definition: A Cogniser possesses knowledge about the working, significance, or meaning of an idea or object, which we call Phenomenon, and is able to make predictions about the behaviour or occurrence of the Phenomenon. The Phenomenon may be incorporated into the wider knowledge structure via categorisation, which can be indicated by the mention of a Category. The Cogniser . Frame Elements: Core: Cogniser Semantic TypeSentient Faculty Phenomenon
Non-core: Category
Completeness
The sentient animate being who acquires new knowledge.
A part of a person’s cognitive-emotional faculties that is said to acquire knowledge. A state of affairs or dynamic system whose internal makeup and working the Cogniser comes to assimilate into their knowledge structure. This expresses a general type or class of which the Phenomenon is considered an instance by the Cogniser, allowing them to make predictions about the qualities, occurrence or behaviour of the Phenomenon. The extent to which the Cogniser has incorporated the workings and significance of s Phenomenon into their knowledge structure.
Evidence Manner . Reference-point Time
Fig. 12.1. Wilks’ entry for “grasp” and the corresponding entry in FrameNet
Semantic Primitives: The Tip of the Iceberg
249
supplies this information directly in its lexical entries. But even so, just considering Yorick’s entry for “grasp”, in the sense of grasp an idea, and the corresponding word sense one in FrameNet, as given in Figure 12.1, show how different Yorick’s original and one major modern view about the nature and role of primitives in a general-purpose lexicon are. What is less clear is how different the outcome of Yorick’s ideas about the lexicon in what we may call his MRD-based phase, as illustrated by Wilks et al. (1987), and this modern product would be. Nor, of course, do we have any idea about how effective any of these lexicons would be for real, tough natural language processing tasks. A great deal of use is made of WordNet, but much of this is because it is all there is and it has been found to be of some use. Noone would say this implies it is the optimal semantic lexical resource. The very limited and most general semantic categories, like ANI or PHYSOBJ recur in many dictionaries, with the same motivation as Yorick’s formula head primitives. The question is whether there is value in additional modifying category primitives (not case ones) as well for semantic processing: grasp is not merely THINK, but a recognising TRUE BE sort of THINK. Again, corpus-based lexical sets and relations have been found practically useful. But this cannot be taken to demonstrate that all the linguistic reality that semantic primitives in some strong sense do appear to embody is wholly captured, even if left implicit rather than explicitly labelled, by current statistical operations.
12.6 The Iceberg Returning now to my starting point, and the iceberg with which I began: the review of theories of meaning in Wilks et al. (1996) reminds us that the underwater part, below the semantic primitives tip, is very large and, spreading wide as well as deep, often very deadly. Much natural language processing (though not computational linguistic) work is pursued with a purely practical attitude, on the “go for it and if it works, fine” principle. Who cares about the underwater berg of theories of meaning if, as long as you are careful about data detail, you can get your system to work? But as Yorick’s papers over the long period considered here show, you cannot build practical language processing systems without adopting some position about the base on which you are building, and foundations are not only theories, they are even metaphysics. This applies just as much to currently fashionable statistical approaches, where language models are only another form of dictionary and computing mutual information is only an unconventional form of parsing. These statistical approaches are appealingly austere and apparently without any metaphysical baggage. They nevertheless rest on a theory about how meaning is recognised, represented and manipulated, just like other more obviously theoryladen approaches do. Here Yorick’s “language” account of meaning shows the iceberg with a less dangerous bulk than some other theories have. But theories of meaning in general are not all there is to the underwater iceberg. There is also the closely related question of linguistic creativity, and notably of metaphor, where the kind of interpretive strategy that Yorick advocated is one way
250
Spärck Jones
of not getting wrecked. Again, though Yorick’s position is implicit rather than explicit, his use and view of primitives offers one account of language universals. Since his semantic primitives started out as tools for conventional, bilingual translation they must, if effective, have some degree of universality. More generally, his account of meaning representation through the use of a meaning representation language which, however limited it is, has some crucial natural language properties like word sense ambiguity, implies some universality property for his primitives. But this is an Aristotelian, not a Platonic, account of universals. Equally, Yorick’s view of primitives and its grounding in the larger contextualprocedural approach to meaning determination sits on top of a whole iceberg mass of theories about language and the form and role of computational linguistics, as opposed to applications-oriented language processing. Here Yorick’s emphasis on language process and action rather than simply description is entirely right, though there is no evidence that mainstream linguistics has taken any notice of this defining methodological and substantive contribution from computational linguistics to linguistics in general. However, though Yorick has emphasised the claim that his approach embodies syntactic as well as semantic conditions, so one does not need a distinct syntactic processor, one does not have to be a formal semanticist obsessed with quantification to feel that there is more to the part of the iceberg to do with the modules and architecture of an abstract language processor than this; and this applies both to the computational case and the cognitive one. Here modern approaches, where some (though perhaps weak) form of semantic primitive provides the bridge between lexical, syntactic and semantic components, of this kind illustrated earlier, are more convincing. One other chunk of the iceberg below Yorick’s earlier work also deserves comment. His approach to disambiguation relied not merely on the wider text context but on this larger context having a particular form of representation: his form of discourse representation was a shallow one, ordered and hence parallel with the surface text. This puts it in the same general class as Rhetorical Structure Theory as opposed to Schankian scripts, say, but the underwater iceberg here is vast. From this point of view it is a pity that Yorick’s move to machine lexicons diverted him from what became a major area of research.
12.7 Primitive Preferences: Where are We Now? In the final review, what does Yorick’s use of semantic primitives offer us? The crux here is the one I touched on earlier, in discussing machine lexicons drawn from MRDs, namely how similar Yorick’s and, insofar as thy use semantic primitives, modern meaning representation languages actually are. This is not just a matter of whether their primitives have the same names or whether, when statistical derivation is in question, one can find primitive names: as against Yorick’s claim in Wilks et al. (1996) that one cannot, one can point to obvious strategies like taking the most frequent word from a set of grouped rows. In such a case one might get “act” − > ACT, but this is near enough Yorick’s DO, especially when on Yorick’s
Semantic Primitives: The Tip of the Iceberg
251
own principles one does not suppose that there is any one intrinsically correct primitive set. One of the engaging features of Richens’ original NUDE, and of Yorick’s development of it, was just how natural it was as a paraphrasing language, and even how joyous it was to use as a language of communication: what is a surprise but a BANG DO ? One might imagine re-forming modern featurestructured lexical entries to give them a Wilksian shape, but the result is far from a stimulating pidgin: developing the illustration for “ammeter” and “measure” in Wilks et al. (1996) we might get (measure PURPOSE) (Solid/Movable PHYSOBJ) and (Human/Sex-Unspecified SUBJE) (Abstract OBJE) ACTION which, though perfectly reasonable, lack zap. Of course this is not exactly what these things are for. WordNet, FrameNet, MindNet (MN 2005), the annotation schemes using in such banks as PropBank (PB 2005) are all present players in the semantic lexicon space, and WordNet in particular is applied in practical task systems, for example for question answering. But as mentioned earlier, WordNet has been applied primarily because it is available, in system-usable form. It, and these other resources have been built as generalpurpose resources (as Roget’s original Thesaurus was), and have not always been of value for particular tasks. They have been seen more as descriptive than as task and process-oriented, but though descriptions they can still be exploited predictively when amplified with appropriate application rules. There is thus, I maintain, no generic difference between them and Yorick’s apparatus as primitive-exploiting semantic tools, however large the detailed differences are. There are real questions, currently being rerun in the Semantic Web and ontologies world, about the relations between, and values of, domain-independent and domain-dependent semantic structures and about the relations between linguistic and world knowledge. There can be no unequivocal answers to these questions, as Yorick always recognised: his position was that while you can’t make a language processor without semantic primitives somewhere, you choose your semantic primitive cloth, and tailor it, to suit your processor climate. So we should bear in mind, as possibly suggestive for the future, Yorick’s Preference Semantics approach to text interpretation and the reasons he advanced for it, perhaps acknowledging his message thus: (POINT : MAN [yw]) / (CAUSE / (FOLK/ ((THINK : SIGN) : FEEL)))
References This list of references has two parts. The first contains items by other authors than Wilks that are cited in the text. The second contains items by Wilks that are cited in the text, in their temporal order. This second part is not exhaustive, but consists only of those items relevant to, and discussed in, the text. Boguraev, B.K. (1979) Automatic Resolution of Linguistic Ambiguities. PhD Thesis, University of Cambridge, 1979; Technical Report 11, Computer Laboratory, University of Cambridge, 1979.
252
Spärck Jones
Boguraev, B.K. and Spärck Jones, K. (1983) How to Drive a Database Front End Using General Semantics. Proceedings of the Conference on Applied Language Analysis, Association for Computational Linguistics, 1983, 81–88. EWN (2005): EuroWordNet, see http://www.illc.uva.nl/EuroWordNet/(visited 2005). Fillmore, C.J. (1968) The Case for Case. In Bach, E. and Harms, R. (eds.) Universals in Linguistic Theory. New York: Holt, Rinehart and Winston, 1968, 1–88. FN (2005): FrameNet, see http://framenet.icsi.berkeley.edu/(visited October 2005). Herskovits, A. (1973) The Generation of French from a Semantic Representation. Memo STAN-CS-73-212, Stanford Artificial Intelligence Laboratory, Stanford University, 1973. LKB (2005): LKB, see http://wiki.delph-in.net/moin/LkbTop (visited October 2005) Masterman, M. (2005) Language, Cohesion and Form. (ed. Y. Wilks), Cambridge: Cambridge University Press, 2005. MN (2005): MindNet, see http://research.microsoft.com/nlp/Projects/MindNet.aspx (visited October 2005). PB (2005): PropBank, see http://www.cis.upenn.edu/ ace (visited October 2005). Procter, P. (1978) (ed.) Longman Dictionary of Contemporary English. Harlow, Essex, UK: Longman. Spärck Jones, K. (2000) R.H. Richens: Translation in the NUDE. In Hutchins, W.J. (ed.) Early Years in Machine Translation. Amsterdam: John Benjamins, 2000, 263–278. Spärck Jones, K. and Tait, J.I. (1984) Automatic Search Term Variant Generation. Journal of Documentation, 40, 1984, 55–66 WN (2005): WordNet, see http://wordnet.princeton.edu (visited October 2005). Wilks, Y. (1965) Computable Semantic Derivations. ML 176, Cambridge Language Research Unit, 1965 (see also Computable Semantic Derivations, Report SP-3017, System Development Corporation, Santa Monica CA, 1968, and Chapter 2 in Wilks (1972)). Wilks, Y. (1967) Semantic Consistency in Text – an Experiment. Report SP-2758/000/00, System Development Corporation, Santa Monica CA, 1967. Wilks, Y. (1971) Decidability and Natural Language. Mind, LXXX, 497–520. Wilks, Y. (1972a) Grammar, Meaning and the Machine Analysis of Language. London and Boston: Routledge, 1972. Wilks, Y. (1972b) Lakoff on Linguistics and Natural Logic. Memo STAN-CS-73-457, Stanford Artificial Intelligence Laboratory, Stanford University, 1972. Wilks, Y. and Herskovits, A. (1973) An Intelligent Analyser and Generator for Natural Language. Proceedings of the International Conference on Computational Linguistics, Pisa, l973, (ed. A. Zampolli). Florence: Olschki. Wilks, Y. and Schank, R. (1974) The Goals of Linguistic Theory Revisited. Lingua, 34, 1974, —. (originally Memo STAN-CS-73-368, Stanford Artificial Intelligence Laboratory, Stanford University, 1973.) Wilks, Y. (1975a) Preference Semantics. In Keenan, E.L. (ed.) Formal Semantics of Natural Language. Cambridge: Cambridge University Press, 1975, 329–348. (originally Memo STAN-CS-73-377, Stanford Artificial Intelligence Laboratory, Stanford University, 1973.) Wilks, Y. (1975b) An Intelligent Analyzer and Understander of English. Communications of the ACM, 18 (2), 1975, 264–274. Wilks, Y. (1975c) A Preferential, Pattern Seeking, Semantics for Natural Language Inference. Artificial Intelligence, 6, 1975, 53–74. Wilks, Y. (1975d) Primitives and Words. Proceedings of the Workshop on Theoretical Issues in Natural Language Processing (Tinlap), (ed. R. Schank and B. Nash-Webber). Association for Computational Linguistics, 1975, 38–41.
Semantic Primitives: The Tip of the Iceberg
253
Wilks, Y. (1977a) Good and Bad Arguments About Semantic Primitives. Communication and Cognition, 10 (3/4), 181–221. Wilks, Y. (1977b) What Sort of Taxonomy of Causation Do We Need for Language Understanding? Cognitive Science, 1, 1977, 235–264. Wilks, Y. (1978) Making Preferences More Active. Artificial Intelligence, 11, 1978,197–223, and in Findler, N.V. (ed.) Associative Networks. New York: Academic Press, 1979, 239–266. Wilks, Y. (1982) Some Thoughts on Procedural Semantics. In Lehnert, W.G. and Ringle, M.D. (eds.) Strategies for Natural Language Processing. Norwood NJ: Lawrence Erlbaum Associates, 1988, 495–516. Wilks, Y. (1983a) Does Anyone Really Still Believe This Sort Of Thing? In Sparck Jones, K. and Wilks, Y. (eds.) Automatic Natural Language Parsing. Chichester: Ellis Horwood, 1983, 182–189. Wilks, Y. (1983b) Deep and Superficial Parsing. In King, M. (ed.) Parsing Natural Language, London: Academic Press, 1983, 216–246. Reprinted in Woods, W.A. and Fallside, F. (eds.) Computer Speech Processing. New York: Prentice Hall, 1985, 335–362. Fass, D. and Wilks, Y. (1983) Preference Semantics, Ill-Formedness, and Metaphor. Americal Journal of Computational Linguistics, 9(3/4), 1983, 178–187. Wilks, Y. et al. (1987) A Tractable Machine Dictionary as a Resource for Computational Semantics. Memo MCCS-87-105, Computing Research Laboratory, New Mexico State University, Las Cruces, 1987. Wilks, Y. et al. (1989) A Tractable Machine Dictionary as a Resource for Computatinal Semantics. In Boguraev, B. and Briscoe, T. (eds.) Computational Lexicography for Natural Language Processing. London: Longman, 1989, 193–228. Wilks, Y.A., Slator, B.M. and Guthrie, L.M. (1996) Electric Words. Cambridge MA: MIT Press, 1996.
This page intentionally blank
13 Molecules, Meaning and Post-Modernist Semantics John Tait and Michael Oakes Department of Computer Science, University of Sunderland, Sunderland, UK Abstract:
Wilks’ early English/French Machine Translation system was based on a notion called Preference Semantics. There were two key components of Preference Semantics. First was the notion of combining elementary meaning units of some kind (in Wilks’ case effectively surrogates for Roget thesaurus’ categories) in structures of arbitrary complexity and fineness of description. Second, was the notion of meaning selection: in this case choice of translation term; being one of preferential or balanced ranking rather than absolute selection. While Wilks’ system was driven by a dictionary hand crafted in much the manner of conventional lexicographic work, Wilks’ colleagues (and specifically Spärck Jones) were very interested in what would now be called supervised and unsupervised learning of these lexical structures. Such learning is probably needed to build a practical language processing system based on these ideas. The paper looks at these notions of molecular word meaning definitions and their acquisition in terms of modern developments in supervised and unsupervised learning. It will go on to look further at the notion of preference in the light of post-modernist notions of semantics developed by Zuidervaart amongst others, and then look briefly at how one would go about constructing a Wilks-like Machine Translation system using today’s state of knowledge.
13.1 Introduction This paper is a critical retrospective of Yorick Wilks’ work in the 1960s and early 1970s based on an approach to the semantics of natural language called preference semantics. We will mainly focus on word meanings and their selection: in other words on lexical semantics. However Wilks’ work was undertaken with the intention of producing working Machine Translation (MT) systems and in doing so inevitably had to deal with many tasks other than word meaning selection. Although we touch on these we make no attempt to cover them in any depth, not least because of the difficulty of doing so in a paper of this length. We do, however, attempt to relate this early work of Wilks to more recent developments: especially in machine learning. We also make some effort to relate Wilks’ work to other views of lexical semantics including post modernist ones. Some of these other views might be characterised as atomist in character; a view Wilks then (and now) seems to reject: hence our title. In particular we move from a consideration of the dictionary entries and preference semantics approach used in Wilks’ early system, to a more detailed consideration of Word Sense Disambiguation (a task undertaken by the Wilksian system and revisited more recently by Wilks and colleagues). We then look at the problem of Out Of Vocabulary (OOV) 255 K. Ahmad, C. Brewster and M. Stevenson (eds.), Words and Intelligence II, 255–279. © 2007 Springer.
256
Tait and Oakes
words and review how this and other phenomena of real language can be dealt with by automatic thesaurus generation. Towards the end we also attempt to sketch a component of what might be described as a rational reconstruction of Wilks’ MT system from the 1970s.
13.2 Wilks’ Dictionary Entries Wilks’ early English-French Machine Translation (MT) systems might reasonably be described as being lexically focussed, in the sense that the primary repository of linguistic knowledge was the dictionary and not, for example, a separate grammar or ontology. Wilks (1975a) describes a language of 70 or so primitive semantic units plus a few classes or groupings of more primitive units (such as animate entities). The primitive elements may be combined in formulas to “express the senses of English words; one formula to each sense” (p. 266). Wilks (1975a) contains several example dictionary entries for both nouns and verbs. One example is for drink: “drink” (action) → ((∗ ANI SUBJ) (((FLOW STUFF) OBJE) ((SELF IN) (((∗ ANI (THRU PART)) TO) (BE CAUSE)))))) This representation language will be discussed in more detail in Section 13.3 in the context of the overall preference semantics scheme, but first we will make some observations about Wilks’ dictionary entries. First they importantly distinguish word senses. In other words they are based on a position that word senses have a reality rather than, for example, the view that when a writer selects a word they do so taking account of the ALL the senses of a word, and their relationship with the other elements of the language. Second they exist independently of the task: they are about the source or target language, but not about the relationship between particular source or target languages. Word sense is of course a notoriously difficult and slippery notion, as Kilgarriff (1997), and indeed Wilks et al. (1996) (p. 68ff) and Wilks (1997), for example have pointed out. The commitment to word senses bogs Wilks down (unnecessarily in our view) in dealing with the ill specified problem of word sense disambiguation. Kilgarriff’s arguments are monolingual and task independent in character. They are based on the need to decide to split, merge and distinguish word sense in a monolingual dictionary without consideration of other languages or of task. Wilks could avoid this problem by focussing on the task of MT (his real aim), in which the distinction between word senses could by drawn by the need to use different words in the target translation. In practice, however, Wilks’ objective was not the relatively well defined task of English-French translation, but the much less well defined task of translation between arbitrary language pairs via an interlingua. Now Willks’ verb definitions are inevitably complex and some what daunting at first sight. They incorporate case frames and the preferred entries for those case frames. The noun entries are equally complex, in general giving enough
Molecules, Meaning and Post-Modernist Semantics
257
information to select a word sense from some sort of deeper representation: consider the generation of a description of a scene from a representation of the results of running an object recognition system. However Wilks’ dictionary entries are not as complex as equivalent entries in full scale wide coverage dictionaries, like the Longmans Dictionary of Contemporary English (LDOCE) or Wordnet (Fellbaum, 1998). Sticking with the word “drink”1 used in the example from Wilks above, Wordnet identifies five senses of drink: Sample Senses of the verb drink from Wordnet: Sense 1 drink, imbibe EX: They drink EX: The animals drink Sense 2 drink, booze, fuddle EX: They drink Sense 3 toast, drink, pledge, salute, wassail ∗ > Somebody ----s something ∗ > Somebody ----s somebody ⇒ Somebody ----s PP Sense 4 drink in, drink ∗ > Somebody ----s something Sense 5 drink, tope ∗ > Somebody ----s Consider further a sample definition for one sense of the verb “drink” from the LISPIFIED version of LDOCE (LDOCE has four verbal senses for “drink” rather than five). ((drink) (1 D0195300 !< drink) (2 1 !< !<) (3 drINk) (5 v !< drank ∗ CC/dr ∗ 67 Nk/∗ 44 !, ∗ 45 drunk ∗ CC/drVNk/) (7 100 !< T1 !; I ∗ DE : (∗ CA DOWN !, OFF !, UP ∗ CB) !<----!<----Q----L) (8 to swallow (liquid)) (7 200 !< T1 !< ----!<----Q----G) (8 to take in or suck up : ∗ 46 drinking air into his lungs) (7 300 !< T1 (∗ 46 to ∗ 44) !< EN– !<----H---YT) (8 to give or join in (a ∗ CA TOAST ∗ CB ∗ 8B (2)) ∗ 63 see ∗ CA HEALTH ∗ CB (3)) (7 400 !< I ∗ DE !< BW– !<----H) 1
In case the reader thinks “drink” is an especially difficult case, consider, for example the entry for “announce” presented in Wilks et al. (1996) p. 113
258
Tait and Oakes
(8 to use alcohol !, esp !. too much : ∗ 46 He doesn!’t smoke or drink !. ! He drinks like a fish) (7 500 !< X9 !< BW– !<----H---YX) (8 to bring to a stated condition by taking alcohol: ∗ 46 He drank himself into unconsciousness !. ! He drank his troubles away ∗ 44 ∗ 63 see also ∗ 45 drink someone under the ∗ CA TABLE)) For our purpose the precise details stored in these large scale dictionaries is not relevant. The point is that Wilks entries, although complex at first sight are significantly less complex than real dictionary entries. If one were to build a bilingual machine translation system one would need to include in the dictionary only the possible translations for a given word or phrase and the means of selecting them. So the dictionary need only contain the target word plus sufficient information to allow it to selected from amongst a set of possible targets and to allow correct morphological form and word order to be generated. It might be that this would allow one to operate with simpler dictionary entries than these. However, as noted above, Wilks’ system is intended ultimately to support multiple languages and the dictionary entries need (in effect) to support an interlingua. Wilks work clearly began with work committed to MT via standardised interlingual representation. In particular it derives from the NUDE system developed at the Cambridge Language Research Unit (CLRU) (Spärck Jones, 2000). True interlingual representations need to support all possible surface distinctions and mappings in both the source language and target language sides. The route taken in CLRU and so in Wilks’ work was to provide a means of deriving formulas constructed from elements of Roget’s (1852) thesaurus to distinguish different senses of a word: hence the terminology of Wilks (1975a). It is difficult at this distance in time to determine whether there was a development strategy at CLRU to deal with the problems this would pose if at some point a wide range of source and target languages were supported. On the one hand, the infinitely extensible formulas could provide ever more fine grained distinctions as new distinctions required by new language became apparent. On the other hand, it is difficult to see how the work of splitting source language word senses could be effectively undertaken, certainly by a reasonable manual effort. Wilks’ own position, even in early years, seems ambiguous on this point. Wilks (1971), in section 2.3 during a discussion of sense notes: “ the senses of the English words may equally well be explained and distinguished by means of their French equivalents ” thereby rather tying them to this language pair. However, to be fair in the introduction to this paper it should be pointed out that Wilks makes it clear the project is focussed on producing a “working artefact” (English-French MT system) not on settling general questions. It is interesting to compare the 70 or so primitives used by Wilks with the number of equivalent words in a well constructed wide coverage dictionary, like
Molecules, Meaning and Post-Modernist Semantics
259
LDOCE presented above. Wilks et al. (1989) (p. 201) identifies 2166 words used as primitives or controlled vocabulary in the Longmans Dictionary of Contemporary English (LDOCE). There is an interesting slide here of course between primitives (which one might take as something formal and somehow separate from actual human language) and primitive words. In particular Wilks et al. (1989) note the primitive words from LDOCE are more ambiguous than general words in the vocabulary, whereas Wilks (1975a) tends to imply the primitives are unambiguous. Wilks’ early work was conducted with a respectable (for its day) vocabulary of around 600 words: 20 years later results were being reported with vocabularies less than one tenth that size (see Wilks et al. 1996, p. 2). However, this is very much less than the 27,000 words with 74,000 senses in LDOCE Wilks et al. (1989). It would be an impossibly daunting task to produce full, consistent Wilksian formulae for a vocabulary on this scale by hand.
13.3 Preference Semantics Wilks’ system of preference semantics was designed to work on the problems of word-sense ambiguity, case ambiguity of prepositions and pronoun reference resolution, as part of the MT system described in Wilks (1975a, 1975b). “Preference” means that utterances are not either grammatical or wrong, but have varying degrees of likelihood or acceptability. Ambiguity resolution is then a case of finding the preferred reading of an utterance rather than a single right answer – this requires the introduction of a scoring mechanism (Shann, 1984) (p. 76) or at least an ordering system. Wilks’ scoring method was to find the interpretation of the sentence which enabled the greatest number of individual semantic preferences to be fulfilled. As noted in Section 13.2, the English input text is processed to derive an interlingua which can be used as the basis of generating the target language text (Wilks, 1973). Wilks and Schank both made essential developments of the basic idea of using semantic primitives, enabling the exploitation of semantic as well as syntactic information (Spärck Jones, 1986). Wilks’ primitive elements include entities such as (MAN, THING, ACT, STATE), deep cases such as (SUBJ, GOAL, CAUSE, LOCATION), actions (MOVE DROP FLOW COMPEL), and others (KIND, HOW, CONTAINER, GOOD). There is a certain amount of ambiguity about the exact number of elements, and the way they may be combined in formulae, but Wilks (1975a) makes it clear the language is seen as closed rather than arbitrarily extensible. Shann (1984)(p. 77) gives the example “grasp” as follows: e.g. [∗ ANI SUBJ] [∗ PHYSOB OBJE] [[THIS [MAN PART]] INST] [TOUCH SENSE] For each subformula, the right hand component is the case or act (syntactic), while the left hand component is its value (semantic). Specifically, the four subformulae can be interpreted as follows: [∗ ANI SUBJ] the preferred agent is animate [∗ PHYSOB OBJE] the preferred object is a physical object
260
Tait and Oakes
[[THIS [MAN PART] INST] the preferred instrument is a human part, (the hand) [TOUCH SENSE] the action is of physical contact – the rightmost subfomula gives the overall sense of the word. Shann gives the caveat that when using semantic primitives, the codings have a certain amount of vagueness, and it might be that such a system cannot remain stable with large vocabularies of several thousand entries. To overcome this, in later versions of the system, Wilks (1980) suggested underpinning the whole vocabulary by a thesaurus-like structure to impose more consistency. The meaning of the input text is represented in a semantic block that consists of a sequence of templates. Templates are patterns with an agent-action-object structure, e.g. [man cause thing] (triple). The templates are linked together by case ties, and by linking anaphora with their references. These templates are discussed at some length in Wilks (1975b) and derive from much earlier work (Wilks, 1964). Even if not all the preferences are fulfilled, an interpretation will be accepted so long as it is not possible to interpret the text in a better or higher scoring way. Wilks (1976) sums up his scoring method as follows: The template finally chosen for a fragment of text is the one in which most formulas have their preferences satisfied. This is an important idea, since it provides Wilks with a mechanism which allows the system to not only resolve lexical ambiguity but also referential ambiguity, or more to the point (some) anaphora and therefore coreference chains. In many ways one of the most interesting ideas in Wilks (1975a, 1975b) is the approach to anaphora resolution combining initial processing using only linguistic knowledge (template and preferences) combined with locality or perhaps focus. If this leaves anaphora unresolved resort is made to common sense inference rules. The outline of this approach find resonances in much later work like Carter (1987), Kennedy and Boguraev (1996) and has led the present generally accepted position that anaphora and coreference resolution systems need to use different kinds of knowledge in concert (see Mitkov, 2003, for a review). This contrasts with much work undertaken contemporaneously with Wilks’ efforts in the 1960s and 1970s, in which world knowledge and inference were given prominence to the exclusion of linguistic knowledge. However a full analysis of the contributions of this early Wilks to automatic coreference chaining and anaphora resolution is really beyond the scope of this paper.
13.4 Word Sense Disambiguation 13.4.1 Introduction to Word Sense Disambiguation A word is semantically ambiguous if it has more than one sense. Word Sense Disambiguation (WSD) is the process of deciding which sense is correct in a given context. WSD is important in machine translation, such as for the translation of queries in Cross Language Information Retrieval. It is also necessary for automatic thesaurus generation, a task we will return to later, as different senses of a word should be placed in different parts of the thesaurus.
Molecules, Meaning and Post-Modernist Semantics
261
It should be noted that the general WSD task is not well defined. It is intimately connected with a more specific (or perhaps grounded) kinds of task, whether they are operational (MT, IR or whatever) or evaluative. For example in machine translation different senses of a word are distinguished by having different translations into a target language, so the number and range of different senses will depend on the target language under consideration. Furthermore, as noted earlier, the very reality of distinguished and separate words senses is controversial (Kilgarriff, 1997). However, it is clear that the idea of Word Sense Disambiguation at some level is a useful one for many language processing tasks, and that Wilks early work has been influential on this subfield of natural language processing. In this section we review developments in WSD since Wilks’ early work as a foundation for our sketch of modern Wilksian system presented in Section 13.8. SENSEVAL was an open evaluation exercise for WSD programs (Kilgarriff and Palmer, 2000). A corpus manually annotated with the correct sense of each word is used as a “gold standard”, against which the output of each of the competing programs is compared. The SENSEVAL systems can make use of the rich WordNet thesaurus rather than a simple lexicon. For each homograph, there is a separate entry for each sense distinction, including fields for word sense definition, POS information and examples of usage.
13.4.2 Word Sense Disambiguation: A Modern Example The process of WSD is illustrated by Rayson and Wilson’s SEMSTAT (Thomas and Wilson, 1996), a semantic tagger which reads in a text and assigns a code number standing for a particular word sense to each word in that text. Each word is first looked up in a lexicon to see what senses that word could possibly take. Many words are unambiguous, but if more than one sense is possible for a given word, WSD techniques come into play, making use of the following types of information:
1. The part-of-speech (POS) tag assigned by the CLAWS POS tagger (Garside et al. 1987). For example, if “book” is a verb, we know it must mean “make a reservation”. Wilks and Stevenson (1996) have shown that POS tagging greatly assists in the problem of word sense disambiguation. They found that 95% of word types in LDOCE could potentially be disambiguated in this way to the homograph level. In addition, using POS filtering of word senses is a safe method which is unlikely to reject the correct sense. (Stevenson and Wilks, 2001). 2. The general likelihood of a word taking a particular meaning, as found in certain frequency dictionaries. For example if a corpus has 9730 occurrences of “green” in the sense of colour, and only 64 occurrences of “green” in the sense of a village green, then the simplest technique (used as a baseline for the evaluation of WSD systems) is to assume that the more common sense of “green colour”
262
3.
4.
5. 6.
Tait and Oakes
is always the correct one. According to Allen (1995), this simple technique is about 70% accurate over English as a whole. Manually extracted idiom lists are kept. Whenever one of the stored idioms is found in the text, it is assumed that the idiomatic meaning of the phrase as a whole (such as “shake in one’s bootsseatshoes” is more likely than individual interpretations of the words. The topic of the text can be an important indicator. For example, if the text is concerned with computers, then “Java” is unlikely to refer to the island. This accords with Yarowsky’s (1995) principle of “one sense per discourse”. Special rules have been developed for the auxiliary verbs “be” and “have”. Collocations are pairs or groups of words that frequently appear in the same context. The technique of proximity disambiguation is to scan the immediate vicinity of each ambiguous word to look for collocates of the word which suggest a particular interpretation. For example, if we find the ambiguous words “rock” and “roll” close by each other, they probably both refer to music. The amount of text on either side of a word in which we look for collocates is called the window. One statistical measure of collocation strength is Mutual Information (Church and Hanks, 1990).
13.4.3 Use of Machine Readable Dictionaries for WSD Cowie et al. (1992) employ the rationale that “word senses which belong together in a sentence will have more words in common in their definitions that word senses which do not belong together”. Consider the overlap in the following dictionary definitions (Lesk, 1986), which strongly suggests that sense 1 of “pine” corresponds with sense 2 of “cone”: Pine 1 kinds of evergreen tree with needle-shaped leaves 2 waste away through sorrow or illness Cone 1 solid body which narrows to a point 2 fruit of certain evergreen trees The earliest example of this approach was described by Lesk (1986), and it was enhanced by Cowie et al. (1992) with a machine learning technique called simulated annealing. Similarly, a score can be given for the number of identical words occurring in each of the dictionary examples of an ambiguous word and a text window containing that word. The sense with the highest scoring example is chosen. Various measures of similarity between text sequences and dictionary examples (glosses) have been suggested. Interestingly, Cowie et al. (1992) contrast numerical techniques such as these with semantic techniques, which are based on linguistic information such as semantic preferences (Wilks, 1975a) as discussed in Section 13.3. Semantic techniques are noted to require extensive hand-crafting e.g. assigning semantic categories to nouns, preferences to verbs and adjectives: an implicit criticism of the concerns about the feasibility of this approach we originally touched upon in Section 13.2.
Molecules, Meaning and Post-Modernist Semantics
263
13.4.4 Use of Thesauri in WSD Sussna (1993) makes use of the semantic distance between the nodes representing concepts in WordNet, which he takes as the sum of the weightings of the edges connecting them by the shortest possible route. These edge weightings take into account the type of word relations represented by the edges (synonymy has an edge weighting of 0, antonymy 2.5, while other relations have intermediate weightings) and the density or fan-out of the tree at that point. For example, the node corresponding to the tree sense of “pine” would be close to the tree sense of “cone” (perhaps separated only by the part_of relation), while the pine_away sense of “pine” would be quite distant. Voorhees (1993) also used WordNet for WSD, making use of the is_a relations. The use of a thesaurus enables the use of expert system-like inference, such as spreading activation (Sussna, 1993; Voorhees, 1993). It also enables the notion of conceptual density. All possible senses of all content words in the input sentence are marked in a hierarchy such as WordNet (Fellbaum, 1998). The portion of the hierarchy with the greatest concentration of marked nodes (including one for the word being tested) will reflect the sense of the test word. A thesaurus can be used to overcome the problem of data sparseness. A recurring problem with WSD, compared with POS tagging, is that there are more word senses than syntactic categories, meaning a much larger amount of training data is required. Use of a thesaurus helps overcome this problem, as frequencies of word classes are studied rather than those of individual words. One can thus count matches which involve semantically related words (such as words with the same Roget’s thesaurus categories) in the matching score, rather than insisting on exact word matches. This is, of course, especially of interest here because of the interest in Roget at CLRU from which Wilks’ work originated. 13.4.5 Use of Bilingual Corpora for WSD Gale, Church and Yarowsky (1992) used machine readable texts and their translations, noting for example that the sense of “drugs” which translates into French as “mediacaments” collocates with “prescription”, “patent” and “generic”, while the sense which translates as “drogues” collocates with “abuse”, “paraphernalia” and “illicit”. Other authors who have used bilingual corpora in WSD are Dagan et al. (1991) and Brown et al. (1991). One way in which bilingual (especially parallel) corpora may be used is as training sets for supervised machine learning (see below). 13.4.6 Machine Learning Techniques for WSD Supervised learning approaches are trained on manually sense-tagged text, where each instance of a word is assigned to one of a set of established sense definitions, such as dictionary entries. With unsupervised learning the classification of the data is not known beforehand (Argaw, 2005). Unsupervised learning by clustering was
264
Tait and Oakes
done by Yarowsky (1995), who used the term “sense induction” for the use of distributional similarity to partition word instances into clusters that may have no relation to established sense definitions. As an example of unsupervised learning, Biber (1993) used the multivariate statistical technique of factor analysis to discover four basic senses of the word “right” in a corpus, according to their various collocates. Machine learning approaches to WSD have always been hampered by the problem of data sparseness – since manual sense tagging is laborious, there is simply not enough accurately annotated training data available to train the parameters. One solution is to use pseudowords, where all occurrences of two quite distinct words are artificially conflated, to see whether their original meanings can be recovered by the WSD algorithm (Sanderson, 2000). Another solution is the use of bootstrapping approaches, which are able to start with just small amounts of humanly annotated training data and much larger amounts of untagged data. The training data is enough to partially train the algorithms, which are then used to tentatively tag the remaining data. We then retain only those tags which were assigned with most confidence, and add them to the training data, while the less confidently assigned tags are disregarded. The whole process is repeated, and at each iteration the pool of tagged training data becomes greater, and the remaining untagged data becomes less, until eventually all the data becomes tagged. Mihalcea (2004) refers to self-training when a single algorithm is used throughout, and to co-training when different algorithms are used for each iteration, each one tagging a few more of the untagged examples. A second solution is to discover correlations among the various features which might be used in a WSD algorithm, and to collapse similar features (i.e. those having similar distributions) into fewer dimensions. One approach to this is to use a matrix decomposition technique called Singular Value Decomposition (SVD) (Agirre et al., 2005). As noted in Section 13.4.4 another potential solution to the problem of data sparseness is use thesauri to augment the data, and approach for shadowed by Spärck Jones (1986) but as yet (so far as we can determine) untested.
13.4.7 Combination of Knowledge Sources Stevenson and Wilks (2001) used a suite of techniques to filter out incorrect senses of words. They used part-of-speech filtering, degree of dictionary definition overlap, and collocation-based filtering. They also used selectional preferences in the sense described by Wilks (1975a). LDOCE word senses are labelled with selectional restrictions expressed by 36 semantic codes. Named entities such as the names of persons and organisations can also be mapped onto these codes (human and abstract respectively). Relations between nearby words such as adjective-noun, subject-verb and verb-object are identified, and all senses of each word in the relation are considered in turn. According to the semantic codes of the word senses, restrictions are placed on the sentence, and combinations of word senses which do not fulfil these restrictions are filtered out. In the sentence “John ran the hilly course”, “hilly” has only one
Molecules, Meaning and Post-Modernist Semantics
265
word sense (undulating terrain) while “course” has two LDOCE senses: route and programme of study. The only word sense of “hilly” comes with the semantic restriction that it must modify a nonmovable solid. The route sense of course has the restriction that it must be of type nonmovable solid, which is consistent with the semantic restriction on hilly. However, “course” in the sense of a programme of study is restricted in that it must be of type abstract, which cannot be modified by “hilly”, so this second interpretation is rejected. Use is also made of the LDOCE codes in the context of 50 words on either side of the ambiguous word, using an algorithm developed by Yarowsky (1992). For each possible sense of each word in the window, the following quantity is maximised, where P(wSCat) means how likely it is to be word w given that its sense is SCat: w∈context
log
PrwSCat Prw
The results from the above filters are combined using a machine learning algorithm called TiMBL memory-based learning, originally developed by Daelemans et al. (1996). 13.4.8 Conclusions on WSD There are three main observations which are worth making for our purpose concerning current WSD efforts: 1. Most rely on existing manually constructed dictionaries and the word senses recorded in them; 2. They do not take account of longer range coherence phenomena (including coreference) let alone cohesion: 3. In style almost all the systems are to some extent the intellectual descendents of Wilks (1973) system in the sense that they have a predefined dictionary and use some interword relations to perform disambiguation. It is also worth noting that almost all the successful systems use some form of machine learning or optimisation.
13.5 The Problems of Out of Vocabulary Words and of Language Synchronicity In this section we want to look at two problems which were all but ignored by early language work like Wilks (1973) and which are all pervasive in practical large scale language processing. The first is the problem of language change, or synchronicity. Word meanings do not remain the same through time. New meanings arise: others fade from common use.
266
Tait and Oakes
New words defined in this year’s edition of Merriam-Webster’s Collegiate Dictionary include “metadata”, “WiFi” and “hazmat” (material that would endanger life if released without precautions).2 Two words which have fallen into disuse are “deasil” and “widdershins”, which were the words for clockwise and anticlockwise before clocks were commonplace. Semantic drift is the term used by linguists to describe a change in meaning over time. Examples of this are found in the King James Bible, where “charity” means “love” (1 Corinthians 13:4), “superstitious” means “religious” (Acts 17:22), and “addicted” means “devoted” (1 Corinthians 16:15)3 New words are coined as new concepts are invented, and may be borrowed from other languages. One of the new phrases in Marriam-Webster’s dictionary is “amuse-bouche”, meaning a small complimentary appetizer offered by some restaurants. Words disappear from language when concepts become obsolete, when two near synonyms compete for acceptance, or when concepts (such as swear words and terms of racial abuse) become taboo. The way language changes over time means that we need to have systems which take account of changes in the use of words as the language evolves. Secondly, if one examines any piece of extended real text (like a newspaper or scientific paper) it is littered with words which will not occur in a conventional dictionary. These include proper names, including place names, personal names and organisational names; abbreviations, technical terms, ages, monetary values, web addresses, telephone numbers, symbols, times of day and so on. This of course is nothing new: the stimulus to deal with these sort of real world phenomena was a major focus of the Message Understanding Conferences (MUC)4 if nothing else. One of the shortcomings of Wilks’ (1973) system, for all its strengths for its time was the lack of any obvious way to deal with these varied phenomena. There is an implicit assumption that the system dictionary, once constructed will require manually intervention to change it. There is an assumption that all the source text words will be contained in it. Therefore if we are to build a Wilks’ like system to do machine translation or another practical language processing task we must provide solutions to the problems posed by language change. Further these solutions must be able to be integrated with a large pre-existing dictionary.
13.6 Solutions: Automatic Thesaurus Generation One possible solution to both the problems of scalability of the dictionary and to the issues posed by OOV terms and semantic drift is to learn a dictionary (or a series of versions of dictionaries). This section reviews what is known about this task. Interestingly the oldest citation is from 1971, and so is contemporaraneous with Wilks’ work becoming more widely known in the early 1970s. 2 3 4
www.brownielocks.com/words.html checked 26 October 2005. John Mark Ministries, jmm.aaa.net.au/articles/9267.htm checked 26 October 2005. See http://www.itl.nist.gov/iaui/894.02/related_projects/muc/ checked 26 October 2005
Molecules, Meaning and Post-Modernist Semantics
267
13.6.1 Introduction For the manual production of thesauri, there are ANSI/NISO standard guidelines for the Construction, Format, and Management of Monolingual Thesauri (ANSI/NISO 19935 ). However, Grefenstette writes that manual thesaurus generation is often prohibitively costly, and thus there is a need for automatic thesaurus generation (ATG). Chen et al. (1995) state that the advantage of an automated method is that we can have a periodic, automatic update of the vocabulary and relationships in the thesaurus. In some respects the holy grail of automatic thesaurus generation is the deduction of semantic relationships exclusively from free text corpora (Grefenstette, 1994) (p. 114): in effect the ultimate knowledge poor approach. An example of a more knowledge-rich approach is Scannell’s (2003) production of an Irish thesaurus derived from existing English resources. Chen et al. (1995) in their automatically-generated thesaurus of nematode worm biology use keyword lists as such gazetteers of researchers’ names, subject indexes from a textbook, gene lists, and a list of experimental methods from the appendix of a text book. Knowledge-rich approaches include all those which rely on a sophisticated natural language processing tools, such as parsers (Grefenstette, 1984) (p. 114). An example of this is Pereira et al. (1993), who employ a divisive clustering algorithm to group words according to their particular grammatical relations with other words. The approach is limited to specific grammatical relations, requiring a pre-processor to parse the corpus and tag the parts-of-speech. Chen et al. (1995) describe ATG as an example of knowledge discovery in databases or text data mining.
13.6.2 Looking for the Strength of Association Between Words The main approach to automatic, language independent thesaurus generation is based on term-term co-occurrence (Salton, 1971). First, decisions are made as to which terms should appear in the thesaurus. To select content words, Salton employed stoplisting, word-stemming and term phrase recognition (sequences of adjacent keywords once stop words have been removed). Remaining words can be given a weight, such as the term-frequency/inverse document-frequency score (TF ∗ IDF), to represent their “descriptive power” within a particular document, and only the most descriptive words will be retained. The semantic similarity between two terms (t and j) is assumed to be a function of a (total occurrences of term t in the collection), b (total occurrences of term j in the collection) and c (total number of documents/context windows that contain the co-occurrence of both term t and term j). One such measure is “primed” mutual information −w = logNc/ab + 1, where a high positive weight −w means that the terms co-occur more often than you would expect by chance. Using Salton’s own cosine similarity measure, we ran a small experiment using a month’s supply
5
See http://www.niso.org/committees/MT_info.html
268
Tait and Oakes
of “Scrip” (an electronic newspaper circulated around the pharmaceutical industry), as the raw corpus, and the words most closely related to “arthritis” were as follows:
Stemmed Word
Cosine Similarity Measure
arthritirheumatoid Ra centocor idec suspend roussel marion biologhoecht
1 0.903 0.470 0.373 0.321 0.252 0.248 0.248 0.238 0.238
Using the cosine similarity measure, a word which never appears in the same document as “arthritis” will have a similarity of 0, while “arthritis” has a similarity of 1 with itself. The high similarity scores reveal a variety of ways in which other words can be related to “arthritis”: “rheumatoid” is a type of arthritis, “RA” is an acronym for rheumatoid arthritis, “Centocor”, “Idec” and “Hoechst Marion Roussel” are companies which make drugs for arthritis, and trials of drugs for arthritis can be suspended. “Biolog-” is the stemmed form of some very broad concepts. A matrix is produced, containing a dissimilarity score for every pair of words in the vocabulary. In the fictitious example of five words below, the dissimilarity scores could have been derived by taking one minus the cosine similarity coefficient.
dingo goldfish lion tiger wolf
dingo
goldfish
lion
tiger
Wolf
– .9 .5 .4 .2
.9 – .9 .9 .8
.5 .9 – .1 .5
.4 .8 .1 – .5
.2 .9 .5 .5 –
By means of hierarchical agglomerative clustering, this matrix can be transformed into a dendrogram, which is an upside-down tree-like structure with the words of the thesaurus at the leaves, pairs of closely related words connected by short branches, and less closely related pairs only indirectly connected through branches higher up in the tree. Only the leaves correspond exactly to terms in the thesaurus, while branch points higher up the tree correspond to concepts which are a combination
Molecules, Meaning and Post-Modernist Semantics
269
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1 lion
tiger
dingo
wolf
goldfish
Fig. 13.1. Word similarity Dendogram
of the concepts represented by the leaves connected to that branch point. Using the nearest neighbour or single linkage technique, the matrix above produces the following (Figure 13.1) dendrogram: Techniques of this sort have been shown to be feasible on a larger scale by Schütze and Pederson (1997), for example. 13.6.3 Looking for Semantic Relations Between Words There are a number of problems with these purely numeric approaches. One is that synonyms (or variants such as “tumour/tumor”) tend never to occur together, but tend to co-occur with an overlapping set of related words (for example, “tumour” and “tumor” will both co-occur with “brain” and “cancer”). There is thus a need to examine second-order co-occurrence, where we would see that “tumour” tends to cooccur with “brain”, and in other documents “brain” tends to co-occur with “tumor”, and so deduce that “tumour” and “tumor” are related (Peat and Willett, 1991). A much harder problem mathematically is how to capture (quantitatively) the thesaurus concepts “broader terms” and “narrower terms” (Hazewinkel, 1996). The WordNet thesaurus shows a variety of relations between words: synonyms, antonyms (opposite), hyponyms (is_a) (i.e. a narrower term), meronyms (partof); entailment (one entails the other, e.g. buy, pay) and troponyms (two words related by entailment must occur at the same time e.g., limp, walk). Purely numeric approaches generally do not distinguish between these types of word relations, since they produce only a generic distance between two “related” terms. However, Ryu
270
Tait and Oakes
and Choi (2005) describe a method for deciding which of two related terms is the hypernym using the statistical measure of mutual information. We therefore need a method of identifying semantic (rather than merely numeric co-occurrence) relations. Using an altogether different approach, Hearst (1992) produced an automatic lexical discovery technique that uses lexico-syntactic patterns to find instances of hyponymy relations between noun phrases in the raw corpus. One such pattern is NP1 such as NP2 (andor) NP3 which matches the “Scrip” text at cities such as Beijing and Guangzhou to reveal the relations hyponym (cities, Beijing) and hyponym (cities, Guangzhou). This technique shows promise for other types of relations between terms, where for example NP1 for the treatment of NP2 would match Tacalcitol for the treatment of psoriasis to reveal the predicate treats(tacalcitol, psoriasis). Hearst’s technique is easy to implement, and reveals the nature of the relationships between words rather than just a measure of association strength. Semantic relations between verbs and the words they govern can also be found by induction of verb case frames (Schulte im Walde, 2007). A major difficulty with thesaurus construction, whether manual or automatic, as suggested by Atkins and Levin (1991), is that meanings may not be neat little packages attached to a word, but a blending from one dictionary sense to another. Further “Generating a thesaurus in one go for a large area such as Mathematics is not feasible Thus the problem arises of constructing several thesauri and to match them, i.e. to describe the degree of overlap between them” (Hazewinkel, 1996). Some authors have experimented with neural network architectures with unsupervised learning, such as a Self-organising Map (SOM). (Roussinov and Chen, 1998; Hodge and Austin, 2002).
13.6.4 Conclusions on Thesauri Although there remain a number of important research challenges, it is clear that there is much successful work on learning thesauri. Much recent work within the semantic web community (see for example Sabou et al., 2005) has focussed on narrow domains: nonetheless there is every reason to believe that this is an area where significant progress has been made since Wilks’ work in the 1960s and 1970s.
13.7 Other Views of Lexical Semantics One of the assumptions of Wilks’ early work is that there is a common, agreed meaning for language (words) at some level of specificity and accuracy. One can hedge this around with constraints in terms of communities of language users, place, task, time, and so on, but there remains this assumption that pretty much what language means to me is what language means to you. Winograd and Flores in their extremely influential (1986) book essentially developed a view of language as a tool of communication rather than an artefact in itself. This leads them to a position (in contrast to at least early Wilks) in which
Molecules, Meaning and Post-Modernist Semantics
271
for example the role of language in coordinated action means language is not really meaningful without communicants. To quote them somewhat out of context: “As we work within the domain we have defined we are blind to the context from which it was carved and open to the new possibilities it generates”. (178) This is an important view of the limitations of semantics, so important we felt bound to mention it, but it is beyond the scope of this paper to reconcile it with the rationalist tradition from which Wilks comes. However in this section we do want to set up some bridges to other viewpoints on lexical semantics. We have characterised Wilks as a molecularist. This is in contrast to atomists, who believe words cannot be decomposed semantically. In particular we wish to characterise Wilks’ long standing position as being one in which the meaning of words can be made up from combinations of simpler or more common elements, and we have christened the combinations molecules. This is but one aspect of Wilks position: another is a commitment to word sense – individual words have different meaning presumably distinguished by differing combinations of elements. Let us begin by outlining the atomist position.
13.7.1 The Atomist Approach – Fodor One view of lexical semantics is that word meanings (“concepts” in his terminology) are not decomposable in some sense (see, for example, Fodor, 1998). This view is leads Fodor to a rather odd position concerning our shared understanding of language and its inferential implications. For example, Pulman (2005) argues that Fodor’s position leads us to a position where sequences like: A: X killed Y, you know. B: But Y is alive! A: Yes, I know that. are acceptable. A major reason for raising this here is that it presents a very interesting problem for machine translation. Consider the title of the French web site www.maif.fr/site2/magazine/vie2.htm), (“Drogues, medicaments, et vigilance en conduite automobile”, which is translated by Google as “Drugs, drugs and vigilance in automobile control”. If monolingual (non-morphemic) word-senses are really undecomposable there is an implicit claim that there should be only one translation for a given word-sense in any language. A position which appears to be contradicted by this example.
13.7.2 The Idiosyncratic Approach: Humpty Dumpty In Lewis Carroll’s “Through the Looking Glass” (Gardner, 1970), there is an interesting passage for our current purpose:
272
Tait and Oakes
“But ‘glory’ doesn’t mean ‘a nice knock down argument’,” Alice objected. “When I use a word,” Humpty Dumpty said, in a rather scornful tone, “it means just what I choose it to mean – neither more nor less.” (269). In one sense, of course, a writer in selecting a word has complete freedom to do so: but if that selection is too eccentric then there is a likelihood that the text fails to communicate effectively. Gardner analyses this in terms both of Carroll’s serious philosophical writings and its broader impact. From our point of view it does create one interesting challenge, but perhaps one for the distant future. If a writer’s use of language is in the Humpty Dumpty style how can we ensure our inferred lexicon is not polluted or led astray by this eccentric usage. In fact part of Fodor’s problem (mentioned in the last section) is the desire to avoid this extreme form of relativism. 13.7.3 A Post-modern Viewpoint – Zuidervaart Both Fodor and Dodgson (aka Lewis Carroll) come from a philosophical tradition which includes Kant, and Leibniz, and runs back to Plato and forward to Wittgenstein amongst others. Wittgenstein’s philosophy of language was, of course, very influential in Cambridge during the period Wilks first studied the approaches to semantics embodied in his early work. A contrasting approach to meaning and truth has recently been put forward by Lambert Zuidervaart (2004). Zuidervaart’s position rests on the notions of authenticity, significance and integrity, and although it is also directed towards notions of the visual arts and music being “true”, it also, interestingly, addresses a notion of lexical and propositional truth in the context of literature. This is relevant to our purpose here, because in a discussion of “understanding” Wilks (1975c) writes: “ there cannot be a main sense of “understand” which 95% of the population never attain, for that is not what the word means” (113) There might be a danger in equating “truth” and “understanding”, but it seems a reasonable leap here. The idea we want to extract from Zuidervaart is that language conveys more than propositional correspondence. In restricting ourselves to modelling the formalist, mechanistic view of language perhaps we are missing something critical in language to support endeavours like Machine Trans-lation.
13.8 Back to the Future: A Modern Wilksian Approach to MT We focus here on issues of word selection for the target language, not because this is the most important problem, but because a machine translation system clearly needs a target language word or term selection system, and limitations of space mean we cannot adequately treat issues of grammaticality, word order, or issues of textual cohesion like anaphora analysis or generation. Furthermore it is, perhaps, on word selection the early work of Wilks has the most to say. For simplicity, we are going to primarily consider one pair of languages. We will assume there is a more infinite quantity of text in each language available, and
Molecules, Meaning and Post-Modernist Semantics
273
some text, but comparatively little, available which is in parallel in both languages. We also assume in practice one would run many similar and compatible translations systems on numerous language pairs. How would one go about building a modern Wilksian Machine Translation system? Seen with modern eyes, the primary problem for a Wilksian system is how to build a lexicon with adequate coverage. Wilks’ dictionary entries, are expressed in formulae, which are really a combination of thesaurus heads and case frame information. As indicated earlier, one possible approach is to use machine learning of one sort or another to identify potential heads. The review of Section 13.6 concludes that we have made significant progress with this problem, despite difficulties, for example, with identifying broader/narrower term relations between meaningfully labelled terms. This lack of meaningful labels may in fact not matter. In Lakoff (1972) for example, primitives very like the elements in Wilks formulae are arbitrary names and any distinct symbol will do. This would allow techniques like Schütze and Pederson’s (1995) use of SVD to agglomerate terms to derive suitable monolingual thesauri from the monolingual corpora. This would probably provide a suitable mechanism for inferring Wilksian noun entries. Verb entries, especially the case frame information might prove harder to learn mechanically. However it might be the case that co-occurrence data might suffice here, although clearly resort to this would severely restrict the kinds of deeper, inferential, semantic processing which could be undertaken. There remains the problem of reconciling the two monolingual thesauri in a way which would allow translation. Since the number of derived broader terms is likely to be much smaller than the original number of baseline terms there is at least hope that an algorithm like EM (Dempster et al., 1977) could derive a mapping between the two thesauri. I also imagine that the system would have access to the thesauri and translation dictionaries of all the other language pairs produced by the system. I assume that run on the large scale (both in terms of number of texts and numbers of
Source Bilingual Dictionaries
Target Core System
OOV handling
Monolingual Thesauri
Fig. 13.2. A Modern wilksian MT System
274
Tait and Oakes
languages) the systems would converge on sets of underlying inferred nodes which would be equivalent to common semantic primitives: but of course this is unknown (Figure 13.2). Boguraev and Spärck Jones (1981) work on natural language access to databases has shown that much of the power of a Wilksian system is obtained from the heads of the formulae, not the detail. This provides an interesting bridge to Fodor’s position on Lexical Semantics (for example) as it is essentially neutral between the atomist and what we have described as molecularist contrasting position. Further it makes the proposal tractable, since effectively the broader term sets in the monolingual thesauri correspond to the heads of formulae. So in effect what is proposed is a means of inducing a simple Wilks-like dictionary. While there are many elements of a modern reconstruction of a Wilksian machine translation system which could not adequately be reconstructed using this simple machine learning approach, it could provide bag of words or word for word translations. This allows it be tested via word overlap models like BLEU (Papineni et al., 2001) and the use of either new test parallel corpora or the use of held back data from the training set. So one might envisage a semi-supervised which proceeds as follows: 1. Scan through the incoming source language text disambiguating and translating each token which occurs in the dictionary for the language pair; 2. For each OOV word identify whether any other dictionary for the source language contains those words. a. If so see whether there is any dictionary for that target which contains a translation to the target language and use that; i. Store the putative translation in the source/target dictionary; b. If not repeat with the process with the broader terms of the original OOV word in the target dictionary and repeat from step 2a. c. Finally if no translation can be found leave it as it stands. Step 2c would probably be the subject of some sort of manual post editing process, which could subsequently used to expand the existing dictionary. Of course inferred dictionary entries would need to be quality checked and tuned – presumably by some combination of comparison/reconciliation with parallel corpora and manual intervention. In order to recognise effects of language change over time the dictionary entries and thesauri would also need to track the times for which particular entries would apply. This might come mainly from the dates of source texts. It is the gradual induction of changing and growing thesauri and bilingual dictionaries which distinguishes this proposal from standard models of statistical machine translation. Of course this omits several of the key elements of a true high quality machine translation system: word order or syntax and the related morphology. It goes beyond
Molecules, Meaning and Post-Modernist Semantics
275
the scope of this paper to try to consider the contribution Wilks made to these elements, and how one might go about constructing a modern Wilksian system capable of translating running text. In addition it relies on a model of language which looks no further than the tokens of the written language itself and not really to the world of human discourse and understanding: it does not embody a response to Zuidervaart’s critique. The real question is the extent to which one needs this level of sophistication to translate instruction manuals or bureaucratic regulations.
13.9 Conclusions If one casts one’s mind back to the environment in which Wilks did his PhD one the most striking things about it is the extraordinary ambition of the work given, for example, the puny computers then available. One is also struck by the farsightedness of much of the work. For example, the Theory of Clumps (originated by Roger Needham and Frederick Parker-Rhodes at the Cambridge Language Research Unit where Wilks did his PhD work) is an unsupervised machine learning algorithm in all but name, and it is clear from Karen Spärck Jones (1986) that the need for what we would now call machine learning was appreciated by the early 1960s by Wilks and his colleagues in the CLRU (see Wilks and Tait, 2005, for a review). One rather suspects, however, that they did not imagine, working as they were less than 20 years after the invention of the programmable digital computer, that it would take a further half century before we had the power and technology necessary to contemplate building the kind of machine translation system they were working towards.
Acknowledgements We would like to thank Eric Atwell of the University of Leeds for supplying a fragment of the Lispified version of LDOCE and Chris Stokoe of the University of Sunderland for advice on extracting information from Wordnet, both for the examples in Section 13.2. We would also like to thank Karen Spärck Jones for pointing us towards information about work on the NUDE interlingua at CLRU.
References Agirre, E., O.L. de Lacalle and D. Martinez. 2005. Exploring Feature Spaces with SVD and Unlabeled Data for Word Sense Disambiguation. Proceedings of the Conference on Recent Advances in Natural Language Processing (RANLP’05), Borovets, Bulgaria. Allen, J. 1995. Natural Language Understanding. Redwood City, CA: Benjamin Cummings. Argaw, A.A. 2005. Word Sense Discrimination in Query Translation. www. dsv.su.se/∼atelach/Stat/termpaperSTAT.pdf
276
Tait and Oakes
Atkins, B. and B. Levin. 1991. Admitting Impediments. In Lexical Acquisition: Exploiting On-line Resources to Build a Lexicon, ed. U. Zernik, Hillsdale, New Jersey: Lawrence Erlbaum, 233–262. Biber, D. 1993. Co-occurrence Patterns Among Collocations: A Tool for Corpus-Based Lexical Knowledge Acquisition. Computational Linguistics 19(3): 531–538. Boguraev, B. and T. Briscoe (eds.) 1989. Computational Lexicography for Natural Language Processing, Longman. Boguraev, B.K. and K. SpärckJones. 1981 A Natural Language Analyser for Database Access. Information Technology: Research and Development 1: 23–39. Brown, P.F., S.A. Della Pietra, V.J. Della Pietra and R.L. Mercer. 1991. Word-Sense Disambiguation Using Statistical Methods. In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics (ACL 91), 18–21 June 1991, University of California, Berkeley, California. pp. 264–270. Carter, D. 1987. Interpreting Anaphors in Natural Language Text. Chichester, UK: Ellis Horwood. Chen, H.-C., T. Yim, D. Fye and B. Schatz. 1995. Automatic Thesaurus Generation for an Electronic Community System. Journal of the American Society for Information Science 46(3): 175–193. Church, K.W. and P. Hanks. 1990. Word Association Norms, Mutual Information and Lexicography. Computational Linguistics 16(1): 22–29. Cowie, J., J. Guthrie and L. Guthrie. 1992. Lexical Disambiguation using Simulated Annealing. COLING 92: 359–365, Nantes, France. Daelemans, W, J. Zavrel, P. Berck and S. Gillis. 1996. MBTiA Memory-based POS Tagger generator, Proceedings of the 4th Workshop on Very Large Corpora, Copenhagen, 14–27. Dagan, I., A. Itai and U. Schwall. 1991. Two Languages are More Informative than One. In Proceedings of the 29th Annual meeting of the ACL, 130–137. Dempster, A.P., N.M. Laird and D.B. Rubin. 1977. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society Series B 39:1–38. Fellbaum, C. 1998. WordNet, an Electronic Lexical Database. Cambridge, London: MIT Press. Fodor, J.A. 1998 Concepts. Oxford University Press. Gale, W., K.W. Church and D. Yarowsky. 1992. A Method for Disambiguating Word Senses in a Large Corpus. Computers and the Humanities 26(5–6): 415–439. Gardner, M. (ed.) 1970. The Annotated Alice-Lewis Carroll. 2nd Edition, Middlesex, UK: Penguin, Harmondsworth. Garside, R., G. Leech and G. Sampson (eds.) 1987. The Computational Analysis of English, a Corpus-Based Approach, London: Longman. Grefenstette, C. 1994. Explorations in Automatic Thesaurus Discovery. Moston, MA: Kluwer Academic Publishers. Hazewinkel, M. 1996. Bipartite Graphs and Automatic Generation of Thesauri, ERCIM News. Hearst, M.A. 1992. Automatic Acquisition of Hyponyms from Large Text Corpora. Proceedings of COLING, 539–545. Hodge, V.J. and J. Austin. 2002. Hierarchical Word Clustering – Automatic Thesaurus Generation. Neurocomputing 48: 819–846. Kennedy, C. and B. Boguraev. 1996. Anaphora for Everyone: Pronominal Anaphora Resolution Without a Parser. Proceedings of the 16th International Conference on Computational Linguistics (COLING ’96), Copenhagen, 113–118.
Molecules, Meaning and Post-Modernist Semantics
277
Kilgarriff, A. March 1997. I Don’t Believe in Word Sense. Computers and the Humanities 31(2): 91–113. Kilgarriff, A. and M. Palmer. 2000. Guest Editors of the Special Issue on SENSEVAL. Computers and the Humanities 34: 127–134. Lakoff, G. 1972. Linguistics and Natural logic. In D. Davison and G. Harman (eds.), Semantics and Natural Language, Dordrecht: D. Reidel. Lesk, M. 1986. Automatic Sense Disambiguation Using Machine Readable Dictionaries: How to tell a Pine Cone from an Ice Cream Cone. Proceedings of the 1986 SIGDOC Conference, 24–26. Mihalcea, R. 2004. Co-training and Self-training for Word Sense Disambiguation. Proceedings of the Conference on Natural Language Learning (CoNLL 2004), Boston. Mitkov, R. 2003. Anaphora Resolution. In: The Oxford Handbook of Computational Linguistics, ed. R. Mitkov, Oxford, UK: Oxford University Press. Papineni, K., S. Roukos, T. Ward and W.-J. Zhu. 2001. Bleu: A Method for Automatic Evaluation of Machine Translation. IBM Research Report RC22176(W0109-022) Computer Science. Peat, H. and P. Willett. 1991. The Limitations of Co-ocurrence Data for Query Expansion in Document Retrieval Systems. JASIS, 378–383. Pereira, F., N. Tishby and L. Lee. 1993. Distributional clustering of English words: In Proceedings of the Annual Meeting the Association for Computational Linguistics (ACL 93), Columbus, OH. Pulman, S.G. 2005 “Lexical Decomposition” in Charting a new Course: Natural Language Processing and Information Retrieval – Essays in Honour of Karen Spärck Jones, J.I. Tait (ed.). Springer. Roget, P.M. 1852. Introduction to; Thesaurus of English Words and Phrases. Roussinov, D.G. and H. Chen. 1998. A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation. http://dlist.sir.arizona.edu/460/01/A%5FScalable-98.htm Communication, Cognition and Artificial Intelligence, Spring. Ryu, P.-M. and K.-S. Choi. 2005. An Information Theoretic Approach to Taxonomy Extraction for Ontology Learning. In: Buitelaar, P., Cimiano, P. and Magnini, B. (eds.), Ontology Learning from Text: Methods, Evaluation and Applications, Amsterdam: IOS Press, 15–28. Sabou, M., C. Wroe, C. Goble and G. Mishne. Learning Domain Ontologies for Web Service Descriptions: An Experiment in Bioinformatics. Proceedings of the 14th International Conference on the World Wide Web (WWW ’05), Chiba, Japan. May 05, 190–198. Salton, G. 1971. The SMART Retrieval System: Experiments in Automatic Document Processing. New Jersey: Prentice_Hall, Eaglewood Cliffs. Sanderson, M. 2000. Retrieving with Good Sense. Information Retrieval 2(1): 45–65. Scannell, K.P. 2003. Automatic Thesaurus Generation for Minority Languages: An Irish Example. Saint Louis University. TALN 2003, Batz-sur-Mer. Schulte im Walde, S. 2007. The Induction of Verb Frames and Verb Classes from Corpora. To appear as chapter 61 in Lüdeling A. and Kyto, M. (eds.), Corpus Linguistics: An International Handbook. Berlin: Mouton de Gruyter. Schütze, H. and J.O. Pederson. 1995. IR based on word senses, in Fourth Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, 161–175. Schütze, H. and J.O. Pederson. 1997. A Co-occurrence-based Thesaurus and Two Applications to Information Retrieval. Information Processing and Management 33(3): 307–318.
278
Tait and Oakes
Shann. P. Machine Translation: A problem of Linguistic Engineering or of Cognitive Modelling? In: Machine Translation Today: The State of the Art. Proceedings of the 3rd Lugano Tutorial, Lugano, Switzerland, 2–7 April 1984, ed. Margaret King, Edinburgh University Press. 71–90. Spärck Jones, K. 1986 Synonymy and Semantic Classification. Edinburgh University Press. Spärck Jones, K. 2000 RH Richens: Translation in the NUDE. In: W.J. Hutchins (ed.), Early Years in Machine Translation, Amsterdam: John Benjamins, 263–278. Stevenson, M. and Wilks, Y. 2001 Interaction of Knowledge Sources in Word Sense Disambiguation. Computational Linguistics 27(3): 321–349. Sussna, M. 1993 Word Sense Disambiguation for Free-Text Indexing Using a Massive Semantic Network. CIKM ’93, 11/93, DC, USA. Proceedings of the second international conference on information and knowledge management (cikm-93), Arlington Virginia. Tait, J.I. (ed.), 2005 Charting a New Course: Natural Language Processing and Information Retrieval: Essays in Honour of Karen Spärck Jones. Springer. Thomas, J. and A. Wilson. 1996 Methodologies for Studying a Corpus of Doctor-Patient Interaction In: J. Thomas and M. Short (eds.), Using Corpora for Language Research, Harlow: Longman, 92–109. Voorhees, E. 1993 Using WordNet to Disambiguate Word Senses for Text Retrieval. ACM SIGIR ’93 Pittsburgh, PA USA, 171–180. Wilks, Y. Text Searching with Templates. Cambridge Language Research unit Memo ML156. 1964. Reproduced in the companion volume to this book. Wilks, Y. 1971 The Stanford Machine Translation Project. In: R. Rushton (ed.), Natural Language Processing, New York, NY, USA: Algorithmics Press. Wilks, Y. 1973 An Artificial Intelligence Approach to Machine Translation. Chapter 2 R. Schank and K.M. Colby (eds.), Computer Models of Thought and Language. San Francisco: W H Freeman and Co., 114–151. Wilks, Y. An Intelligent Anlayzer and Understander of English. Communications of the ACM 18(55), 1975a. Reproduced in the companion volume to this book. Wilks, Y. A Preferential, Pattern-Seeking, Semantics for Natural Lnguage Inference. Artifical Intelligence 6. 1975b. Reproduced in the companion volume to this book. Wilks, Y. 1975c Seven Thesis on Artificial Intelligence and Natural Language. Working Paper No. 17. Fondazione Dalle Molle per gli studi linguistici e di comunicazione internzionale. Wilks, Y. 1976 Parsing English II. In: Charniak, E. and Wilks, Y. (eds.), Computational Semantics, North-Holland, Amsterdam, 155–185. Wilks, Y.A. Frames, Semantics and Novelty.1980 In Metzing (ed.), Parsing Natural Language, 219–46. Wilks, Y., D. Fass, C.-M. Guo, J. McDonald, T. 1989 Plate and B. Slator. A tractable machine dictionary as a resource for computational semantic. In: Boguraev and Briscoe (eds.). Wilks, Y.A., B.M. Slator and L.M. Guthrie. 1996 Electric Words: Dictionaries, Computers and Meanings. Cambridge, MA, USA: MIT Oress. Wilks, Y. Senses and Texts. Computers and the Humanities 31:77–90, 1997. Reproduced in the companion volume to this book. Wilks, Y. and M. Stevenson. 1996 The Grammar of Sense: Is Word-Sense Tagging Much More than Part-of-Speech Tagging? Technical Report CS-96-05, University of Sheffield. Wilks, Y.A. and J.I. Tait. A Retrospective View of Synonymy and Semantic Classification. In Tait (2005) 1–11.
Molecules, Meaning and Post-Modernist Semantics
279
Winograd, T. and F. Flores. 1986 Understanding Computers and Cognition: A new Foundation for Design. Norwood, NJ, USA: Ablex Publishing. Yarowsky, D. 1992 Word Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora. Proceedings of COLING, Nantes, France, 454–460. Yarowsky, D. 1995 Unsupervised Word Sense Disambiguation Rivalling Supervised Methods. Proceedings of 33rd ACL (ACL-95) Cambridge, MA, 189–196. Zuidervaart, L. 2004 Artistic Truth: Aesthetics, Discourse and Imaginative Disclosure, Cambridge University Press.