Human-Computer Interaction. New Trends: 13th International Conference, HCI International 2009, San Diego, CA, USA, July 19-24, 2009, Proceedings, Part ... Applications, incl. Internet Web, and HCI)
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
5610
Julie A. Jacko (Ed.)
Human-Computer Interaction New Trends 13th International Conference, HCI International 2009 San Diego, CA, USA, July 19-24, 2009 Proceedings, Part I
13
Volume Editor Julie A. Jacko University of Minnesota Institute of Health Informatics MMC 912, 420 Delaware Street S.E., Minneapolis, MN 55455, USA E-mail: [email protected]
Library of Congress Control Number: 2009929048 CR Subject Classification (1998): H.5, I.3, I.7.5, I.5, I.2.10 LNCS Sublibrary: SL 3 – Information Systems and Application, incl. Internet/Web and HCI ISSN ISBN-10 ISBN-13
0302-9743 3-642-02573-0 Springer Berlin Heidelberg New York 978-3-642-02573-0 Springer Berlin Heidelberg New York
The 13th International Conference on Human–Computer Interaction, HCI International 2009, was held in San Diego, California, USA, July 19–24, 2009, jointly with the Symposium on Human Interface (Japan) 2009, the 8th International Conference on Engineering Psychology and Cognitive Ergonomics, the 5th International Conference on Universal Access in Human–Computer Interaction, the Third International Conference on Virtual and Mixed Reality, the Third International Conference on Internationalization, Design and Global Development, the Third International Conference on Online Communities and Social Computing, the 5th International Conference on Augmented Cognition, the Second International Conference on Digital Human Modeling, and the First International Conference on Human Centered Design. A total of 4,348 individuals from academia, research institutes, industry and governmental agencies from 73 countries submitted contributions, and 1,397 papers that were judged to be of high scientific quality were included in the program. These papers address the latest research and development efforts and highlight the human aspects of design and use of computing systems. The papers accepted for presentation thoroughly cover the entire field of human–computer interaction, addressing major advances in the knowledge and effective use of computers in a variety of application areas. This volume, edited by Julie A. Jacko, contains papers in the thematic area of Human–Computer Interaction, addressing the following major topics: • • • • •
Novel Techniques for Measuring and Monitoring Evaluation Methods, Techniques and Tools User Studies User Interface Design Development Approaches, Methods and Tools
The remaining volumes of the HCI International 2009 proceedings are: • • • • •
Volume 2, LNCS 5611, Human–Computer Interaction––Novel Interaction Methods and Techniques (Part II), edited by Julie A. Jacko Volume 3, LNCS 5612, Human–Computer Interaction––Ambient, Ubiquitous and Intelligent Interaction (Part III), edited by Julie A. Jacko Volume 4, LNCS 5613, Human–Computer Interaction - Interacting in Various Application Domains (Part IV), edited by Julie A. Jacko Volume 5, LNCS 5614, Universal Access in Human–Computer Interaction––Addressing Diversity (Part I), edited by Constantine Stephanidis Volume 6, LNCS 5615, Universal Access in Human–Computer Interaction––Intelligent and Ubiquitous Interaction Environments (Part II), edited by Constantine Stephanidis
VI
Foreword
• • • • • • • • • • •
Volume 7, LNCS 5616, Universal Access in Human–Computer Interaction––Applications and Services (Part III), edited by Constantine Stephanidis Volume 8, LNCS 5617, Human Interface and the Management of Information––Designing Information Environments (Part I), edited by Michael J. Smith and Gavriel Salvendy Volume 9, LNCS 5618, Human Interface and the Management of Information––Information and Interaction (Part II), edited by Gavriel Salvendy and Michael J. Smith Volume 10, LNCS 5619, Human Centered Design, edited by Masaaki Kurosu Volume 11, LNCS 5620, Digital Human Modeling, edited by Vincent G. Duffy Volume 12, LNCS 5621, Online Communities and Social Computing, edited by A. Ant Ozok and Panayiotis Zaphiris Volume 13, LNCS 5622, Virtual and Mixed Reality, edited by Randall Shumaker Volume 14, LNCS 5623, Internationalization, Design and Global Development, edited by Nuray Aykin Volume 15, LNCS 5624, Ergonomics and Health Aspects of Work with Computers, edited by Ben-Tzion Karsh Volume 16, LNAI 5638, The Foundations of Augmented Cognition: Neuroergonomics and Operational Neuroscience, edited by Dylan Schmorrow, Ivy Estabrooke and Marc Grootjen Volume 17, LNAI 5639, Engineering Psychology and Cognitive Ergonomics, edited by Don Harris
I would like to thank the Program Chairs and the members of the Program Boards of all thematic areas, listed below, for their contribution to the highest scientific quality and the overall success of HCI International 2009.
Ergonomics and Health Aspects of Work with Computers Program Chair: Ben-Tzion Karsh Arne Aarås, Norway Pascale Carayon, USA Barbara G.F. Cohen, USA Wolfgang Friesdorf, Germany John Gosbee, USA Martin Helander, Singapore Ed Israelski, USA Waldemar Karwowski, USA Peter Kern, Germany Danuta Koradecka, Poland Kari Lindström, Finland
Holger Luczak, Germany Aura C. Matias, Philippines Kyung (Ken) Park, Korea Michelle M. Robertson, USA Michelle L. Rogers, USA Steven L. Sauter, USA Dominique L. Scapin, France Naomi Swanson, USA Peter Vink, The Netherlands John Wilson, UK Teresa Zayas-Cabán, USA
Foreword
Human Interface and the Management of Information Program Chair: Michael J. Smith Gunilla Bradley, Sweden Hans-Jörg Bullinger, Germany Alan Chan, Hong Kong Klaus-Peter Fähnrich, Germany Michitaka Hirose, Japan Jhilmil Jain, USA Yasufumi Kume, Japan Mark Lehto, USA Fiona Fui-Hoon Nah, USA Shogo Nishida, Japan Robert Proctor, USA Youngho Rhee, Korea
Anxo Cereijo Roibás, UK Katsunori Shimohara, Japan Dieter Spath, Germany Tsutomu Tabe, Japan Alvaro D. Taveira, USA Kim-Phuong L. Vu, USA Tomio Watanabe, Japan Sakae Yamamoto, Japan Hidekazu Yoshikawa, Japan Li Zheng, P.R. China Bernhard Zimolong, Germany
Human–Computer Interaction Program Chair: Julie A. Jacko Sebastiano Bagnara, Italy Sherry Y. Chen, UK Marvin J. Dainoff, USA Jianming Dong, USA John Eklund, Australia Xiaowen Fang, USA Ayse Gurses, USA Vicki L. Hanson, UK Sheue-Ling Hwang, Taiwan Wonil Hwang, Korea Yong Gu Ji, Korea Steven Landry, USA
Gitte Lindgaard, Canada Chen Ling, USA Yan Liu, USA Chang S. Nam, USA Celestine A. Ntuen, USA Philippe Palanque, France P.L. Patrick Rau, P.R. China Ling Rothrock, USA Guangfeng Song, USA Steffen Staab, Germany Wan Chul Yoon, Korea Wenli Zhu, P.R. China
Engineering Psychology and Cognitive Ergonomics Program Chair: Don Harris Guy A. Boy, USA John Huddlestone, UK Kenji Itoh, Japan Hung-Sying Jing, Taiwan Ron Laughery, USA Wen-Chin Li, Taiwan James T. Luxhøj, USA
Nicolas Marmaras, Greece Sundaram Narayanan, USA Mark A. Neerincx, The Netherlands Jan M. Noyes, UK Kjell Ohlsson, Sweden Axel Schulte, Germany Sarah C. Sharples, UK
VII
VIII
Foreword
Neville A. Stanton, UK Xianghong Sun, P.R. China Andrew Thatcher, South Africa
Matthew J.W. Thomas, Australia Mark Young, UK
Universal Access in Human–Computer Interaction Program Chair: Constantine Stephanidis Julio Abascal, Spain Ray Adams, UK Elisabeth André, Germany Margherita Antona, Greece Chieko Asakawa, Japan Christian Bühler, Germany Noelle Carbonell, France Jerzy Charytonowicz, Poland Pier Luigi Emiliani, Italy Michael Fairhurst, UK Dimitris Grammenos, Greece Andreas Holzinger, Austria Arthur I. Karshmer, USA Simeon Keates, Denmark Georgios Kouroupetroglou, Greece Sri Kurniawan, USA
Patrick M. Langdon, UK Seongil Lee, Korea Zhengjie Liu, P.R. China Klaus Miesenberger, Austria Helen Petrie, UK Michael Pieper, Germany Anthony Savidis, Greece Andrew Sears, USA Christian Stary, Austria Hirotada Ueda, Japan Jean Vanderdonckt, Belgium Gregg C. Vanderheiden, USA Gerhard Weber, Germany Harald Weber, Germany Toshiki Yamaoka, Japan Panayiotis Zaphiris, UK
Virtual and Mixed Reality Program Chair: Randall Shumaker Pat Banerjee, USA Mark Billinghurst, New Zealand Charles E. Hughes, USA David Kaber, USA Hirokazu Kato, Japan Robert S. Kennedy, USA Young J. Kim, Korea Ben Lawson, USA
Gordon M. Mair, UK Miguel A. Otaduy, Switzerland David Pratt, UK Albert “Skip” Rizzo, USA Lawrence Rosenblum, USA Dieter Schmalstieg, Austria Dylan Schmorrow, USA Mark Wiederhold, USA
Internationalization, Design and Global Development Program Chair: Nuray Aykin Michael L. Best, USA Ram Bishu, USA Alan Chan, Hong Kong Andy M. Dearden, UK
Susan M. Dray, USA Vanessa Evers, The Netherlands Paul Fu, USA Emilie Gould, USA
Foreword
Sung H. Han, Korea Veikko Ikonen, Finland Esin Kiris, USA Masaaki Kurosu, Japan Apala Lahiri Chavan, USA James R. Lewis, USA Ann Light, UK James J.W. Lin, USA Rungtai Lin, Taiwan Zhengjie Liu, P.R. China Aaron Marcus, USA Allen E. Milewski, USA
Elizabeth D. Mynatt, USA Oguzhan Ozcan, Turkey Girish Prabhu, India Kerstin Röse, Germany Eunice Ratna Sari, Indonesia Supriya Singh, Australia Christian Sturm, Spain Adi Tedjasaputra, Singapore Kentaro Toyama, India Alvin W. Yeo, Malaysia Chen Zhao, P.R. China Wei Zhou, P.R. China
Online Communities and Social Computing Program Chairs: A. Ant Ozok, Panayiotis Zaphiris Chadia N. Abras, USA Chee Siang Ang, UK Amy Bruckman, USA Peter Day, UK Fiorella De Cindio, Italy Michael Gurstein, Canada Tom Horan, USA Anita Komlodi, USA Piet A.M. Kommers, The Netherlands Jonathan Lazar, USA Stefanie Lindstaedt, Austria
Gabriele Meiselwitz, USA Hideyuki Nakanishi, Japan Anthony F. Norcio, USA Jennifer Preece, USA Elaine M. Raybourn, USA Douglas Schuler, USA Gilson Schwartz, Brazil Sergei Stafeev, Russia Charalambos Vrasidas, Cyprus Cheng-Yen Wang, Taiwan
Augmented Cognition Program Chair: Dylan D. Schmorrow Andy Bellenkes, USA Andrew Belyavin, UK Joseph Cohn, USA Martha E. Crosby, USA Tjerk de Greef, The Netherlands Blair Dickson, UK Traci Downs, USA Julie Drexler, USA Ivy Estabrooke, USA Cali Fidopiastis, USA Chris Forsythe, USA Wai Tat Fu, USA Henry Girolamo, USA
Marc Grootjen, The Netherlands Taro Kanno, Japan Wilhelm E. Kincses, Germany David Kobus, USA Santosh Mathan, USA Rob Matthews, Australia Dennis McBride, USA Robert McCann, USA Jeff Morrison, USA Eric Muth, USA Mark A. Neerincx, The Netherlands Denise Nicholson, USA Glenn Osga, USA
IX
X
Foreword
Dennis Proffitt, USA Leah Reeves, USA Mike Russo, USA Kay Stanney, USA Roy Stripling, USA Mike Swetnam, USA Rob Taylor, UK
Maria L.Thomas, USA Peter-Paul van Maanen, The Netherlands Karl van Orden, USA Roman Vilimek, Germany Glenn Wilson, USA Thorsten Zander, Germany
Digital Human Modeling Program Chair: Vincent G. Duffy Karim Abdel-Malek, USA Thomas J. Armstrong, USA Norm Badler, USA Kathryn Cormican, Ireland Afzal Godil, USA Ravindra Goonetilleke, Hong Kong Anand Gramopadhye, USA Sung H. Han, Korea Lars Hanson, Sweden Pheng Ann Heng, Hong Kong Tianzi Jiang, P.R. China
Kang Li, USA Zhizhong Li, P.R. China Timo J. Määttä, Finland Woojin Park, USA Matthew Parkinson, USA Jim Potvin, Canada Rajesh Subramanian, USA Xuguang Wang, France John F. Wiechel, USA Jingzhou (James) Yang, USA Xiu-gan Yuan, P.R. China
Human Centered Design Program Chair: Masaaki Kurosu Gerhard Fischer, USA Tom Gross, Germany Naotake Hirasawa, Japan Yasuhiro Horibe, Japan Minna Isomursu, Finland Mitsuhiko Karashima, Japan Tadashi Kobayashi, Japan
Kun-Pyo Lee, Korea Loïc Martínez-Normand, Spain Dominique L. Scapin, France Haruhiko Urokohara, Japan Gerrit C. van der Veer, The Netherlands Kazuhiko Yamazaki, Japan
In addition to the members of the Program Boards above, I also wish to thank the following volunteer external reviewers: Gavin Lew from the USA, Daniel Su from the UK, and Ilia Adami, Ioannis Basdekis, Yannis Georgalis, Panagiotis Karampelas, Iosif Klironomos, Alexandros Mourouzis, and Stavroula Ntoa from Greece. This conference could not have been possible without the continuous support and advice of the Conference Scientific Advisor, Prof. Gavriel Salvendy, as well as the dedicated work and outstanding efforts of the Communications Chair and Editor of HCI International News, Abbas Moallem.
Foreword
XI
I would also like to thank for their contribution toward the organization of the HCI International 2009 conference the members of the Human–Computer Interaction Laboratory of ICS-FORTH, and in particular Margherita Antona, George Paparoulis, Maria Pitsoulaki, Stavroula Ntoa, and Maria Bouhli. Constantine Stephanidis
HCI International 2011
The 14th International Conference on Human–Computer Interaction, HCI International 2011, will be held jointly with the affiliated conferences in the summer of 2011. It will cover a broad spectrum of themes related to human–computer interaction, including theoretical issues, methods, tools, processes and case studies in HCI design, as well as novel interaction techniques, interfaces and applications. The proceedings will be published by Springer. More information about the topics, as well as the venue and dates of the conference, will be announced through the HCI International Conference series website: http://www.hci-international.org/
General Chair Professor Constantine Stephanidis University of Crete and ICS-FORTH Heraklion, Crete, Greece Email: [email protected]
Toward EEG Sensing of Imagined Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael D’Zmura, Siyi Deng, Tom Lappas, Samuel Thorpe, and Ramesh Srinivasan
40
Monitoring and Processing of the Pupil Diameter Signal for Affective Assessment of a Computer User . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ying Gao, Armando Barreto, and Malek Adjouadi
49
Usability Evaluation by Monitoring Physiological and Other Data Simultaneously with a Time-Resolution of Only a Few Seconds . . . . . . . . K´ aroly Hercegfi, M´ arton P´ aszti, Sarolta T´ ov¨ olgyi, and Lajos Izs´ o
59
Study of Human Anxiety on the Internet . . . . . . . . . . . . . . . . . . . . . . . . . . . . Santosh Kumar Kalwar and Kari Heikkinen
69
The Research on Adaptive Process for Emotion Recognition by Using Time-Dependent Parameters of Autonomic Nervous Response . . . . . . . . . Jonghwa Kim, Mincheol Whang, and Jincheol Woo
Automated Analysis of Eye-Tracking Data for the Evaluation of Driver Information Systems According to ISO/TS 15007-2:2001 . . . . . . . . . . . . . . Christian Lange, Martin Wohlfarter, and Heiner Bubb Brain Response to Good and Bad Design . . . . . . . . . . . . . . . . . . . . . . . . . . . Haeinn Lee, Jungtae Lee, and Ssanghee Seo An Analysis of Eye Movements during Browsing Multiple Search Results Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuko Matsuda, Hidetake Uwano, Masao Ohira, and Ken-ichi Matsumoto Development of Estimation System for Concentrate Situation Using Acceleration Sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Masashi Okubo and Aya Fujimura Psychophysiology as a Tool for HCI Research: Promises and Pitfalls . . . . Byungho Park Assessing NeuroSky’s Usability to Detect Attention Levels in an Assessment Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Genaro Rebolledo-Mendez, Ian Dunwell, Erika A. Mart´ınez-Mir´ on, Mar´ıa Dolores Vargas-Cerd´ an, Sara de Freitas, Fotis Liarokapis, and Alma R. Garc´ıa-Gaona Effect of Body Movement on Music Expressivity in Jazz Performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mamiko Sakata, Sayaka Wakamiya, Naoki Odaka, and Kozaburo Hachimura
105 111
121
131 141
149
159
A Method to Monitor Operator Overloading . . . . . . . . . . . . . . . . . . . . . . . . . Dvijesh Shastri, Ioannis Pavlidis, and Avinash Wesley
169
Decoding Attentional Orientation from EEG Spectra . . . . . . . . . . . . . . . . . Ramesh Srinivasan, Samuel Thorpe, Siyi Deng, Tom Lappas, and Michael D’Zmura
176
On the Possibility about Performance Estimation Just before Beginning a Voluntary Motion Using Movement Related Cortical Potential . . . . . . . Satoshi Suzuki, Takemi Matsui, Yusuke Sakaguchi, Kazuhiro Ando, Nobuyuki Nishiuchi, Toshimasa Yamazaki, and Shin’ichi Fukuzumi
Evaluation of User-Interfaces for Mobile Application Development Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Florence Balagtas-Fernandez and Heinrich Hussmann User-Centered Design and Evaluation – The Big Picture . . . . . . . . . . . . . . Victoria Bellotti, Shin’ichi Fukuzumi, Toshiyuki Asahi, and Shunsuke Suzuki
XVII
204 214
Web-Based System Development for Usability Evaluation of Ubiquitous Computing Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jong Kyu Choi, Han Joon Kim, Beom Suk Jin, and Yonggu Ji
224
Evaluating Mobile Usability: The Role of Fidelity in Full-Scale Laboratory Simulations with Mobile ICT for Hospitals . . . . . . . . . . . . . . . . Yngve Dahl, Ole Andreas Alsos, and Dag Svanæs
232
A Multidimensional Approach for the Evaluation of Mobile Application User Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jos´e Eust´ aquio Rangel de Queiroz and Danilo de Sousa Ferreira
242
Development of Quantitative Usability Evaluation Method . . . . . . . . . . . . Shin’ichi Fukuzumi, Teruya Ikegami, and Hidehiko Okada
252
Reference Model for Quality Assurance of Speech Applications . . . . . . . . . Cornelia Hipp and Matthias Peissner
259
Toward Cognitive Modeling for Predicting Usability . . . . . . . . . . . . . . . . . . Bonnie E. John and Shunsuke Suzuki
267
Webjig: An Automated User Data Collection System for Website Usability Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mikio Kiura, Masao Ohira, and Ken-ichi Matsumoto ADiEU: Toward Domain-Based Evaluation of Spoken Dialog Systems . . . Jan Kleindienst, Jan Cuˇr´ın, and Martin Labsk´ y Interpretation of User Evaluation for Emotional Speech Synthesis System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ho-Joon Lee and Jong C. Park Multi-level Validation of the ISOmetrics Questionnaire Based on Qualitative and Quantitative Data Obtained from a Conventional Usability Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jan-Paul Leuteritz, Harald Widlroither, and Michael Kl¨ uh What Do Users Really Do? Experience Sampling in the 21st Century . . . Gavin S. Lew Evaluating Usability-Supporting Architecture Patterns: Reactions from Usability Professionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Edgardo Luzcando, Davide Bolchini, and Anthony Faiola
277 287
295
304 314
320
XVIII
Table of Contents
Heuristic Evaluations of Bioinformatics Tools: A Development Case . . . . Barbara Mirel and Zach Wright A Prototype to Validate ErgoCoIn: A Web Site Ergonomic Inspection Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcelo Morandini, Walter de Abreu Cybis, and Dominique L. Scapin
What Do Users Want to See? A Content Preparation Study for Consumer Electronics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yinni Guo, Robert W. Proctor, and Gavriel Salvendy
413
“I Love My iPhone... But There Are Certain Things That ‘Niggle’ Me” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anna Haywood and Gemma Boguslawski
421
Table of Contents
Acceptance of Future Technologies Using Personal Data: A Focus Group with Young Internet Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fabian Hermann, Doris Janssen, Daniel Schipke, and Andreas Schuller Analysis of Breakdowns in Menu-Based Interaction Based on Information Scent Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yukio Horiguchi, Hiroaki Nakanishi, Tetsuo Sawaragi, and Yuji Kuroda E-Shopping Behavior and User-Web Interaction for Developing a Useful Green Website . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fei-Hui Huang, Ying-Lien Lee, and Sheue-Ling Hwang Interaction Comparison among Media Internet Genre . . . . . . . . . . . . . . . . . Sang Hee Kweon, Eun Joung Cho, and Ae Jin Cho Comparing the Usability of the Icons and Functions between IE6.0 and IE7.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chiuhsiang Joe Lin, Min-Chih Hsieh, Hui-Chi Yu, Ping-Jung Tsai, and Wei-Jung Shiang
XIX
431
438
446
455
465
Goods-Finding and Orientation in the Elderly on 3D Virtual Store Interface: The Impact of Classification and Landmarks . . . . . . . . . . . . . . . . Cheng-Li Liu, Shiaw-Tsyr Uang, and Chen-Hao Chang
Designing for Change: Engineering Adaptable and Adaptive User Interaction by Focusing on User Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bruno S. da Silva, Ariane M. Bueno, and Simone D.J. Barbosa
Enabling Interactive Access to Web Tables . . . . . . . . . . . . . . . . . . . . . . . . . . Xin Yang, Wenchang Xu, and Yuanchun Shi
760
Integration of Creativity into Website Design . . . . . . . . . . . . . . . . . . . . . . . . Liang Zeng, Robert W. Proctor, and Gavriel Salvendy
769
Part V: Development Approaches, Methods and Tools YVision: A General Purpose Software Composition Framework . . . . . . . . Ant˜ ao Almada, Gon¸calo Lopes, Andr´e Almeida, Jo˜ ao Fraz˜ ao, and Nuno Cardoso Collaborative Development and New Devices for Human-Computer Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hans-J¨ org Bullinger and Gunnar Brink
779
789
Orchestration Modeling of Interactive Systems . . . . . . . . . . . . . . . . . . . . . . . Bertrand David and Ren´e Chalon
796
An Exploration of Perspective Changes within MBD . . . . . . . . . . . . . . . . . Anke Dittmar and Peter Forbrig
806
Rapid Development of Scoped User Interfaces . . . . . . . . . . . . . . . . . . . . . . . Denis Dub´e, Jacob Beard, and Hans Vangheluwe
816
PaMGIS: A Framework for Pattern-Based Modeling and Generation of Interactive Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J¨ urgen Engel and Christian M¨ artin
826
People-Oriented Programming: From Agent-Oriented Analysis to the Design of Interactive Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Steve Goschnick
836
Visualization of Software and Systems as Support Mechanism for Integrated Software Project Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peter Liggesmeyer, Jens Heidrich, J¨ urgen M¨ unch, Robert Kalckl¨ osch, Henning Barthel, and Dirk Zeckzer Collage: A Declarative Programming Model for Compositional Development of Web Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bruce Lucas, Rahul Akolkar, and Charlie Wiecha Hypernetwork Model to Represent Similarity Details Applied to Musical Instrument Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tetsuya Maeshiro, Midori Maeshiro, Katsunori Shimohara, and Shin-ichi Nakayama
846
856
866
Table of Contents
Open Collaborative Development: Trends, Tools, and Tactics . . . . . . . . . . Kathrin M. Moeslein, Angelika C. Bullinger, and Jens Soeldner
XXIII
874
Investigating the Run Time Behavior of Distributed Applications by Using Tiny Java Virtual Machines with Wireless Communications . . . . . . Tsuyoshi Miyazaki, Takayuki Suzuki, and Fujio Yamamoto
Automatic Method for Measuring Eye Blinks Using Split-Interlaced Images Kiyohiko Abe1, Shoichi Ohi2, and Minoru Ohyama3 1
College of Engineering, Kanto Gakuin University, 1-50-1 Mutsuura-higashi, Kanazawa-ku, Yokohama, Kanagawa 236-8501, Japan 2 School of Engineering, Tokyo Denki University, 2-2 Kandanishiki-cho, Chiyoda-ku, Tokyo 101-8457, Japan 3 School of Information Environment, Tokyo Denki University, 2-1200 Muzaigakuendai, Inzai-shi, Chiba 270-1382, Japan [email protected], [email protected], [email protected]
Abstract. We propose a new eye blink detection method that uses NTSC video cameras. This method utilizes split-interlaced images of the eye. These split images are odd- and even-field images in the NTSC format and are generated from NTSC frames (interlaced images). The proposed method yields a time resolution that is double that in the NTSC format; that is, the detailed temporal change that occurs during the process of eye blinking can be measured. To verify the accuracy of the proposed method, experiments are performed using a high-speed digital video camera. Furthermore, results obtained using the NTSC camera were compared with those obtained using the high-speed digital video camera. We also report experimental results for comparing measurements made by the NTSC camera and the high-speed digital video camera. Keywords: Eye Blink, Interlaced Image, Natural Light, Image Analysis, HighSpeed Camera.
NTSC video cameras. This method utilizes split-interlaced images of the eye captured by an NTSC video camera. These split images are odd- and even-field images in the NTSC format and are generated from NTSC frames (interlaced images). The proposed method yields a time resolution that is twice that in the NTSC format. Therefore, the detailed temporal change that occurs during the process of eye blinking can be measured. To verify the accuracy of the proposed method, we performed experiments using a high-speed digital video camera. Thereafter, we compared results obtained using the NTSC cameras with those obtained using the high-speed digital video camera. This paper also presents experiments that evaluate the proposed automatic method for measuring eye blinks.
2 Open-Eye Area Extraction Method by Image Analysis In general, eye blinks are estimated by measuring the open-eye area [2] or on basis of characteristics of specific moving points between the upper and lower eyelids [3]. Many of these methods utilize image analysis. It is possible to measure the wave pattern of eye blinks if the entire process of an eye blink is captured [3]. Furthermore, the type of eye blink and/or its velocity can be estimated on the basis of this wave pattern. However, it is difficult to measure the wave patterns of eye blinks by using video cameras that are commonly used for measuring eye blinks because the resulting eye images include high noise content owing to the change in light conditions. We have developed a new method for measuring the wave pattern of an eye blink. This method can be used with common indoor lighting sources such as fluorescent lights, and it can measure the wave pattern automatically. Hence, our proposed measurement method can be used under a variety of experimental conditions. In this method, the wave pattern is obtained by counting the number of pixels in the openeye area of the image as captured by a video camera. This image is enlarged for capturing the detailed eye image. We have proposed an algorithm for extracting the open-eye area in a previous study [4]. It utilizes color information of eye images. We have adapted the algorithm to our proposed method for elucidating the wave pattern of eye blink measurement. This algorithm has been developed for our eye-gaze input system, in which it compensates and traces head-movement [5]. Furthermore, the algorithm has been used under common indoor sources of light for a prolonged period. Hereafter, we describe in detail our image-processing algorithm for extracting the open-eye area. 2.1 Binarization Using Color Information on Image Many methods have been developed for the purpose of skin-color extraction; these methods are primarily focused on facial image processing, including those that utilize color information on a facial image. They mostly determine threshold skin-color values statistically or empirically [6]. We have developed an automatic algorithm for estimating thresholds of skin-color. Our algorithm can extract the open-eye area from the eye image on the basis of the skin-color.
Automatic Method for Measuring Eye Blinks Using Split-Interlaced Images
5
Using our algorithm, skin-color threshold is determined by the histogram of the color-difference signal ratio of each pixel—Cr/Cb—that is calculated from the YCbCr image transformed from the RGB image. The histogram of the Cr/Cb value has 2 peaks indicating skin area and open-eye area. The Cr/Cb value indicated by the minimum value between the 2 peaks is designated as the threshold for open-eye area extraction. 2.2
Binarization by Pattern Matching Method
The method described in Subsection 2.1 can extract the open-eye area almost completely. However, the results of this extraction sometimes leave deficits around the corner of eye, because the Cr/Cb value around the corner of eye is similar to the value on skin in certain subjects. To resolve this problem, we have developed a method for open-eye extraction without deficits by combining 2 extraction results. One of them is a binarized image using color information, as described in Section 2.1. The other extraction result is a binarized image using light intensity information, which includes in the extraction result the area around the corner of the eye. Binarization using light intensity information utilizes the threshold estimated by a pattern matching method, which determines the matching point by using the color information of the binarized image as reference data. Hence, the threshold level is estimated automatically. The original image and the extracted open-eye area image are shown in Fig. 1(a) and Fig. 1(b).
(a)
(b)
Fig. 1. Original eye image (a) and extracted open-eye area (b)
3 Measurement Method of Wave Patterns of Eye Blinks Using Split-Interlaced Images Commonly used NTSC video cameras output interlaced images. One interlaced image has 2 field images, which are designated as odd or even fields. If an NTSC camera captures a fast movement such as an eye blink, there is a great divergence between the captured odd- and even-field images. Therefore, the area around eyelids on the captured image has comb-like noise. This phenomenon occurs because of mixing of 2 field images of the fast movement of eyelids. An example of interlaced images during eye blinking is shown in Fig. 2. To describe this phenomenon most clearly, Fig. 2 has been captured at low resolution (145 × 80 pixels).
6
K. Abe, S. Ohi, and M. Ohyama
If one interlaced image is split by scanning even- and odd-numbered lines separately, 2 field images are generated. Thus, the time resolution of the motion images doubles, but the amount of information in the vertical direction decreases by half. These field images are captured at 60 fields/sec, and the NTSC interlaced moving images are captured at 30 fps; therefore, this method yields a time resolution that is double that available in the NTSC format. The duration of a conscious blink is a few hundred milliseconds; therefore, it is difficult to measure accurately the wave pattern of an eye blink by using NTSC cameras. However, the detailed wave pattern of an eye blink can be measured by using our proposed method. The split-interlaced images are shown in Fig. 3. The 2 eye images shown in Fig. 3 are enlarged in a vertical direction and were generated from the interlaced image shown in Fig. 2. Our proposed method measures the wave patterns of eye blinks from these images.
Fig. 2. Blinking eye image (interlaced)
Fig. 3. Split-interlaced image generated from Fig. 2
4 Evaluation Experiment for Proposed Method Either 4 or 5 subjects participated in experiments to evaluate our proposed method, as described in Subsections 4.1 and 4.2, respectively. The experimental setup includes an NTSC DV camera (for home use), a high-speed digital video camera, and a personal computer (PC). The PC analyzes sequenced eye images captured by the video cameras. The DV camera captures interlaced images at 30 fps, and the high-speed digital video camera captures non-interlaced images at 300 fps. In the experiments performed using these video cameras, the wave pattern of eye blinks is measured from sequenced eye images. The experimental setup is shown in Fig. 4.
Automatic Method for Measuring Eye Blinks Using Split-Interlaced Images PC
7
Display
User NTSC or High-speed digital video camera
Fig. 4. Hardware configuration of experimental system
4.1 Experiment for Eye Blink Measurement Using NTSC Camera In this experiment, sequenced eye images were captured using the DV camera at 30 fps in NTSC format. In addition, split-interlaced images are generated from these interlaced NTSC images. These split-interlaced images have a time resolution of 60 fields/s. The wave pattern of eye blinks is measured by the interlaced NTSC images and split-interlaced images. The binarization threshold for open-eye area extraction is determined automatically from the first field image of the experimental moving images. This threshold is estimated by the method described in Section 2. A typical result from this experiment is shown in Fig. 5.
Pixels of open-eye area
1.1 1 0.9 0.8
60fps
0.7
30fps
0.6 0.5 0.4 1
6
11 16 21 Sampling point (1/60 sec.)
26
Fig. 5. Wave patterns of eye blinks measured by DV (30 fps and 60 fps)
In Fig. 5, the longitudinal axis and the abscissa axis indicate pixels of open-eye area and sampling point (interval: 1/60 sec), respectively. To compare the 2 wave patterns of eye blinks, these plots are normalized using the pixels of open-eye area at the first field image. The bottoms of the plots indicate the eye-closed condition. Our proposed algorithm classifies the area of eyelid outline and cilia into the open-eye area; therefore, the pixels at the bottom of the plots are not reduced to zero. From Fig. 5, it is evident that sequenced images at 60 fields/s can be used to estimate the detailed wave pattern of an eye blink. During the eye blink, there is a great difference in the 2 plots of pixels of the open-eye area; however, this difference is not dependent on individual subjects.
8
K. Abe, S. Ohi, and M. Ohyama
Results of the wave pattern of eye blink measurements for 5 subjects are shown in Fig. 6, where the longitudinal axis and the abscissa axis show pixels of open-eye area and sampling point, respectively. These plots also are normalized in a manner similar to those in Fig. 5. From Fig. 6, it is evident that there are great differences in the results for each subject. 1.1
Pixels of open-eye area
1 0.9 0.8 0.7 0.6 0.5 0.4
Subject A Subject D
0.3
Subject B Subject E
Subject C
0.2 1
11 21 31 41 Sampling point (1/60 sec.)
51
Fig. 6. Wave patterns of eye blinks of 5 subjects measured by DV (60 fps)
4.2 Experiment for Eye Blink Measurement Using High-Speed Video Camera To verify the accuracy of the proposed method that utilizes split-interlaced images, experiments were conducted with 4 subjects; this experiment and the one described in Subsection 4.1 were conducted separately. Subjects A and E (listed in Fig. 6) were enrolled in this experiment continuously, in which sequenced images at 3 different frame rates (30, 60, and 150 fps) were generated from moving images captured by the high-speed digital video camera. These sequenced images were then analyzed to measure the wave pattern of eye blinks. The results of eye blink measurements performed using the sequenced images at 3 different frame rates and those taken at 300 fps are compared. Typical examples of measurement results are shown in Fig. 7, Fig. 8, and Fig. 9, which display results at 30, 60, and 150 fps, respectively. From Fig. 7 and Fig. 8, it is evident that the degree of accuracy of measurement at 60 fps is higher than that at 30 fps. The minimum of the wave pattern (bottom of the curve) is quite characteristic of when an eye blink occurs. Results at 60 fps show that the bottom of the plot is measured with a high degree of accuracy. Therefore, sequenced images at this frame rate are suitable for measurement of eyelid movement velocity. Moreover, our proposed method using split-interlaced images (described in Section 3) utilizes 2 field images generated from one interlaced image; that is, the
Automatic Method for Measuring Eye Blinks Using Split-Interlaced Images
9
spatial information of these field images is decreased by half. We have confirmed that this decrease in spatial information does not affect measurement accuracy via an experiment using sequenced images at 60 fps. The sequenced images at 60 fps were generated from moving images captured by a high-speed digital video camera. In this experiment, we generated half-sized eye images by extracting scanned odd-numbered lines from sequenced images at 60 fps. We estimated wave patterns of eye blinks using these half-sized images. Our results show that the measured open-eye area decreases by half, which is in agreement with the results shown in Fig.8.
Pixels of open-eye area
46000 300 fps
44000
30 fps
42000 40000 38000 36000 34000 32000 1
11
21 31 41 51 61 71 Sampling point (1/300 sec.)
81
Fig. 7. Wave pattern of eye blinks measured by high-speed video camera (30 fps)
Pixels of open-eye area
46000
300 fps
44000
60 fps
42000 40000 38000 36000 34000 32000 1
11
21 31 41 51 61 71 Sampling point (1/300 sec.)
81
Fig. 8. Wave pattern of eye blinks measured by high-speed video camera (60 fps)
10
K. Abe, S. Ohi, and M. Ohyama
Pixels of open-eye area
46000
300 fps
44000
150 fps
42000 40000 38000 36000 34000 32000 1
11
21 31 41 51 61 71 Sampling point (1/300 sec.)
81
Fig. 9. Wave pattern of eye blinks measured by high-speed video camera (150 fps)
4.3 Discussion On the basis of Fig.5, it is evident that by using split-interlaced images, the time resolution of measurement is doubled than that of the results obtained in previous studies. These split images are odd- and even-numbered field images in the NTSC format that are generated from NTSC frames. This method can also be utilized for any subject under common indoor lighting sources, such as fluorescent lights. We have shown the wave patterns of eye blinks for 5 subjects in Fig. 6. From results shown in Fig. 7, Fig. 8, and Fig. 9, it is evident that the degree of accuracy of measurement increases with increasing frame rate. A closer estimate of eye blinking velocity can be achieved if the wave pattern of an eye blink were to be measured with higher accuracy. In other words, the type of eye blink can be classified with a high degree of accuracy. In addition, our proposed method can measure the wave patterns of eye blinks efficiently even by using half-sized eye images. As shown by our experimental results presented earlier, we have verified the reliability of our proposed method described in Section 3. Thus, detailed wave patterns of eye blinks can be measured by using our proposed method.
5 Conclusions We present a new automatic method for measuring eye blinks. Our method utilizes split-interlaced images of the eye captured by an NTSC video camera. These split images are odd- and even-numbered field images in the NTSC format and are generated from NTSC moving images. By using this method, the time resolution for measurement increases to 60 fps, which is double that of conventional methods. Besides the function of automatic measurement of eye blinks, our method can be used under common indoor lighting sources, such as fluorescent lights. In evaluation experiments, we measured eye blinks of all subjects without problems.
Automatic Method for Measuring Eye Blinks Using Split-Interlaced Images
11
To verify the accuracy of our proposed method, we performed experiments using a high-speed digital video camera. On comparison of the results obtained using NTSC cameras with those obtained using a high-speed digital video camera, it is evident that the degree of accuracy of measurement increases with increased resolution time. Additionally, a decrease in area of the split-interlaced image has no adverse effect on the results of eye blink measurements. We confirmed that our proposed method is capable of measuring the wave pattern of eye blinks with high accuracy by using an NTSC video camera. In the future, we plan to develop a new method for classifying types of eye blinks using our proposed measurement method reported above. That new method will be capable of profiling eye blinks according to velocity of open-eye area changes. We also plan to apply this new method to more general ergonomic measurements.
References 1. Grauman, K., Betke, M., Gips, J., Bradski, G.R.: Communication via Eye Blinks - Detection and Duration Analysis in Real Time. In: Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, pp. 1010–1017, Lihue, HI (2001) 2. Morris, T., Blenkhorn, P., Zaidi, F.: Blink Detection for Real-Time Eye Tracking. J. Network and Computer Applications 25(2), 129–143 (2002) 3. Ohzeki, K., Ryo, B.: Video Analysis for Detecting Eye Blinking Using a High-Speed Camera. In: Proc. of Fortieth Asilomar Conf. on Signals, Systems, and Computers, Pacific Grove, CA, pp. 1081–1085 (2006) 4. Abe, K., Ohyama, M., Ohi, S.: Eye-Gaze Input System with Multi-Indicators Based on Image Analysis under Natural Light. J. The Institute of Image Information and Television Engineers 58(11), 1656–1664 (2004) (in Japanese) 5. Abe, K., Ohi, S., Ohyama, M.: An Eye-Gaze Input System Using Information on Eye Movement History. In: Proc. on 12th International Conference on Human-Computer Interaction, HCI International 2007, Beijing, vol. 6, pp. 721–729 (2007) 6. Garcia, C., Tziritas, G.: Face Detection Using Quantized Skin Color Regions Merging and Wavelet Packet Analysis. IEEE Trans. on Multimedia 1(3), 264–277 (1999)
A Usability Study of WebMaps with Eye Tracking Tool: The Effects of Iconic Representation of Information Özge Alaçam and Mustafa Dalcı Human Computer Interaction Research and Application Laboratory, Computer Center, Middle East Technical University, 06531 Ankara/Turkey {ozge,mdalci}@metu.edu.tr
Abstract. In this study, we aim to conduct usability tests on different WebMap sites with eye movement analysis. Overall task performance, the effects of iconic representation of information, and the efficiency of pop-up usage were evaluated. The eye tracking technology is used for this study in order to follow the position of the users’ eye-gaze. The results show that there are remarkable differences in task performance between WebMaps. Addition, they also differ in the use of iconic representations according to results of users’ evaluation. It is also found that efficiency of pop-up windows’ usage has an effect on task performance. Keywords: Web mapping, usability, eye tracking, cognitive processes, iconic representations, and the efficiency of pop-ups.
A Usability Study of WebMaps with Eye Tracking Tool
13
processes [5]. Usage of eye tracking on usability field started at 1950’s [6]. However due to the difficulties in the analysis of huge data obtained from the eye tracking tools, it lost its popularity in 1970’s. With the improvements of the eye tracking technologies, eye tracking tools gain their impacts on the usability field again [10] and nowadays they are accepted as a tool to improve computer interface. In one of the studies about WebMap usability conducted by Nivala [13], severity of the usability problems were investigated. In our study, we aim to make additional analysis to find the reason of these usability problems and make them more clear by analyzing eye movements of the users. The focus of this study is to analyze the effects of the iconic representation of the information and to investigate whether the pop-ups are used efficiently by the user. The eye tracking tool is used for this study in order to follow the position of the users’ eye movements, which helps to measure the attended location on the map. It is known that eye movements provide information about cognitive processes such as perception, thinking, decision making and memory [1, 3, 4, 12, 14]. Evaluation of eye movement provided us the opportunity to focus on the iconic representations, efficiency of pop-up windows and their effects in map comprehension in different WebMaps.
2 Method and Materials 26 subjects (12 female, 14 male) either university students or graduate in a range of ages between 18 and 32 participated to this study. In order to get information about their prior knowledge about WebMap usage and to get the user’s idea about the comprehensibility of the icons and preferences about the WebMaps, a questionnaire was carried out. Each subject evaluated two different WebMaps for different places in US in random order. Six tasks shown in Table 1 were used in the experiment. Users were told that they could give up the task or the experiment whenever they wanted to. Tasks given to the users include; to find given address, to find definite places which are represented with icons (s.t airport, metro station, hospital) and to show the route to specific locations. The experiments are conducted at the Human-Computer Interaction Research and Application Laboratory at Middle East Technical University. Eye movements of users were collected by Tobii 1750 Eye Tracker and analyzed with Tobii Studio. Table 1. Task Description Task No Task Description Instruction Welcome to X City/State. You are planning to look at the city map to specify the locations that you want to visit before starting your trip 1 Point the nearest highway intersection to X International Airport You want to go from X International Airport to X University. Could you describe 2 how to arrive to that location? Find the address of the hospital nearest to X Park 3 Now, you are in X Location. Show the nearest metro/railway station to this place 4 You are searching for the library nearest to X place. Find and point it on the map. 5 Show the intersection point of the given address with X street. 6
14
Ö. Alaçam and M. Dalcı
In the Nivela et al.’s study [13], there is an evaluation of four different web mapping sites. These are Google Maps, MSN Maps & Directions, MapQuest and Multimap. However, since the MSN Maps and Directions, Multimap are based on Microsoft Virtual Earth, we replaced these sites with Live Search Maps that is also based on MS Virtual Earth. Since these are well-known and all have zooming and panning options on their 2D map applications, they are very good candidates for usability testing. Although their common properties mentioned above, they differed in terms of usage of icon representation and pop-up window properties. We conducted the usability testing of Google Maps, Live Search Map, MapQuest and Yahoo Map and investigated the effect of iconic representation of information and pop-up windows by analyzing eye movements. We use the term “The iconic representation of information” as to state the relationship between their semantics and appearance. Addition to evaluation of task completion performance (s.t. task completion score and time), eye tracking data such as fixation length, fixation count, observation length was collected.
3 Results Results are presented under three categories; task performance, analysis of the iconic representations and analysis of pop-up windows. 3.1 Task Performance Users are grouped into two categories according to their WebMap usage experience; experienced users (14 users) for high-level usage frequency and inexperienced users (12 users) for low level usage frequency. One way ANOVA test was conducted to compare mean fixation length on task completion time for experience level. Result shows that user’s experience level has a significant effect on task completion time, F(1,52)=5,30, p>.05. One of the evaluation criteria of comparing the usability of WebMaps is users’ task completion scores. Task completion score was evaluated under three categories; accomplished tasks, unaccomplished tasks and partially accomplished tasks that the users thought that they accomplished a task when they actually did not. Table 2 provides the percentage of users, who accomplished, partially accomplished and did not accomplish each task and also overall score was calculated for each WebMap site. Fig. 1 shows the overall completion score for each map. Results of one way ANOVA shows that task completion score of Google Map is significantly different than MapQuest and Yahoo Map, F(3,48)= 8.629 p<.05. It is also worth to note that significance value of difference between Live and Yahoo is .05. In addition to analysis of task completion score, mean fixation length for each task was analyzed individually (see Fig. 2 for comparison). Only fixation length on accomplished and partially accomplished tasks was counted. The results show that, for the first task, there is no significant difference in fixation length according to map type. The fixation length during task two, a significant difference was found between Live Search Map and MapQuest, Google Map and MapQuest, Yahoo Map and MapQuest F (3,43)= 12.538, p<.05. For the third task, Google Map is significantly
A Usability Study of WebMaps with Eye Tracking Tool
15
different than MapQuest, F (3,33)=3.768, p<.05. For the fourth task, Google Map is significantly different than MapQuest and Live Search Map, F(3,23)= 5.398 p<.05. The fixation length on the fifth task, Google Map is significantly different than MapQuest, Yahoo Map and Live Search Map, F(3,35)=12.058, p<.05. For the task six, only difference in Live Search Map and MapQuest is significant, F(3,41)=2.444, p<.05. Statistical analysis of mean fixation count for each task also shows that there are significant differences between the pairs given above. Table 2. Percentage of users’ task completion scores
Fig. 2. Fixation Length (sec.) on each task according to WebMaps
3.2 Analysis of the Iconic Representations In order to investigate the efficiency of iconic representations in these WebMaps, the observation length on icons was counted frame by frame in specific tasks. In order to analyze icon usage, first, third and fourth tasks were selected since these tasks contain specific places which can be represented by icons (such that airport icon for first task, hospital icon for task three, metro/railway icon for task 4, and pointers that appears after users’ search for all three tasks). Other icons displayed on the map during these tasks were also investigated. Fixation length on each icon and the time which they are displayed on the map were counted and then the percentage of iconic representation looking time was calculated. One-way ANOVA results shows that there is no significant difference in observation length on icons for each WebMap, F(3,48)=1,859, p>.05. In addition, no correlation between the icon usage looking time percentage and completion score was found. (Spearman’s rho=.37, p>.05) Icons are divided into two categories as, task related icons and task unrelated icons. However, since every map could not have a representation for each icon, they were investigated individually, without conducting statistical analysis. Since MapQuest had only pointer icons, these are not compared with the specific icons in other maps. Table 3 provides the percentage of looking time for each icon in each WebMap. Observation length on Airport in Yahoo Map, is 13,3% of the time that it appears on the map. However, since the other WebMaps (Google Maps, MapQuest and Live Search Map) do not have icon for airport, there can be no comparison analysis. The pointers which represent the searched location have approximately same looking time for all maps. For the metro and railway icons, Google icon has biggest looking time percentage, and then it is followed by Live Search Map and Yahoo Map respectively. Since the MapQuest and Live Search Maps do not have hospital icons, looking time of icons on only Yahoo Map and Google Map were investigated. Even tough Google and Yahoo Maps contain an icon for “Hospital”, the users are expected to zoom close
A Usability Study of WebMaps with Eye Tracking Tool
17
enough for that icon to be visible. However none of the users were inclined to zoom to that distance while performing their tasks. The eye movement analysis of users which perform the experiment on Live Map Search shows an interesting outcome; some task unrelated icons (s.t. park, hotel or sponsored link icons) have been remarkably fixated by the users. Table 3. The percentage of looking time for each icon Icon Type Google Map Task Related Icons
Unrelated Icons
Live Search
Map Quest
Yahoo Map
Airport
Na*
Na
Na
13,3
Pointers
10,6
10,4
10,9
10,5
Metro/railway
13,5
10,8
Na
7,2
University
3,5
Na
Na
Na
Hospital
0,0
Na
Na
0,0
Park
1,4
Na
Na
Na
Hotel
Na
5,5
Na
Na
Sponsored link
Na
1,7
Na
Na
*Not Applicable.
Task based investigation has been made regarding iconic representations. Correlations between the icon looking time percentage, completion score and observation length for Task 1,3 and 4 were examined individually. For Task 1, no correlation between these parameters was found. The analysis conducted for the third task, correlations have been found between the icon looking time percentage and completion score (Pearson’s r=.523; p<.05) and between icon looking time and observation length (Pearson’s r=-.368; p<.05). For Task four, a correlation has been found between icon looking time and observation length (Pearson’s r=-.289; p<.05). On the other hand, no correlation has been found between the icon looking time percentage and completion score (Pearson’s r=.165; p>.05). Addition to eye movement analysis, user evaluation questionnaire was carried out. The users were asked to predict the meaning of icons and rate their comprehensibility in a scale from 1 to 10 (the results are given in Table 4). This gives us the opportunity to evaluate the efficiency of the relationship between their semantics and appearances and show whether there is an ambiguity. The comprehensibility ratings for only correct predictions were counted. It can be claimed that there is a consistency between its appearance and semantic when both the correct predictions and comprehensibility rate of a icon is high. Although the comprehensibility evaluated by the users is high for the hospital icon in Google, only 6.3 % of the users predict the meaning of the icon correctly. The comparison of the results for metro/railway icons indicates that iconic representations in Google Maps and Yahoo Maps are comprehended more easily by the users than in Live Search Maps. None of the users predicts the meaning of the sponsored link and hotel icon in Live Search Map. When we look at the Table 3 for looking time percentage of these icons, it can be concluded that the users notice them without attaching a meaning.
18
Ö. Alaçam and M. Dalcı Table 4. Users’ ratings on iconic representations Icon
Moreover, users were asked to specify their icon preference and rate the usabilty of the maps which they used during the experiment. We can group the icons into three main categories; these are Pictorial, Textual and Numerical/Alphabetical icons. Textual icons like (s.t. ) are mostly prefered icons (37.5%). 31,3 % of the users prefer pictorial icons (s.t. ), to be used in maps. Another 31,3 % of the users prefer numerical/alphabetical icons (s.t. ). Additionally, users’ usabilty ratings of Web Maps are parallel with task completion score given in Table 2. Their ratings are 8.3 for Google Map 5.1 for live Search Map, 4.6 for Yahoo Map and 3,4 for MapQuest. 3.3 Analysis of Pop-Up Windows Additional analysis was conducted to investigate the usage of pop-up windows. It can be suggested that the aim of the pop-up is to direct users focus on something else where they are provided with additional information either regarding their task or not. Moreover, these pop-ups can facilitate additional search on locations that the task requires. Therefore pop-up windows are very important parts of the web maps and they frequently appear during map usage. In order to investigate the efficiency of popup usage in these WebMaps, the sections that contain pop-up windows on map area were extracted for each task. Fixation length and fixation count on pop-up windows and display time of it on map area was counted. The analysis of looking time of popup windows when they appear on the map gives us an idea about whether the users prefer to use them. The results of this analysis showed that Google Maps use the popup windows more efficiently, since % 64.8 of the fixations on the map are on the popup area. It is followed by MapQuest, Live Search Map and Yahoo Map respectively (see Fig. 3). One way ANOVA results also indicates that there is a significant difference between Google Map and Live Search Map in pop-up usage percentage. Google
A Usability Study of WebMaps with Eye Tracking Tool
Fig. 3. Observation Length on Pop-up / Map for each WebMap
Map is also significantly different than Yahoo Map. Addition, there is also difference between Yahoo Map and MapQuest, F(3,51)=5.939, p<.05. Additionally, correlations have been found between fixation count on pop-ups and overall completion score (Spearman’s rho=.396; p<.05) and between fixation length on pop-ups and overall completion score (Spearman’s rho=.423; p<.05).
4 Conclusion By analyzing eye movements which are indicators of cognitive processes, we examined the effects of iconic representations, pop-up window usage, and their roles on the usability of these mapping sites. Since comprehension of maps contains very complicated cognitive process, decreasing the cognitive load by making the icons more comprehensible and the pop-ups more usable will cause effectiveness and efficiency in the usage of web mapping sites. Task performance evaluation shows that there is a significant difference between these Webmaps. When the tasks are getting complicated (ex. to find an address of specific location near to another location), differences between these WebMaps in terms of task completion time and score has become apparent. Besides the differences in overall display organization, in a micro level framework, we examined the effect of the iconic representation of information and efficiency of the usage of popup windows. The analysis indicates that there is gab between user’s ratings on icons and their looking time percentages. Even the icons that are correctly predicted and given high comprehensibility level by the users have a low looking time. Additionally, it is also worth to be noted that icon’s looking time percentage is low for even task related icons. This means users have difficulty to detect them when they are performing their tasks although the icons are highly related with tasks. To make them more visible can help the users notice them, and increase their task completion performance. Moreover, analysis on the efficiency of pop-up windows, which are other
20
Ö. Alaçam and M. Dalcı
widely used elements of WebMaps, shows that users’ pop-up usage differs significantly according to WebMap type and this parameter is positively correlated with task completion score. In addition, as an expected outcome, experience level has a significant role on WebMap usage performance.
5 Further Studies However, icons for local traffic signs (ex. Highway numbers) were highly fixated during tasks; however these findings have been disregarded due to the user’s lack of familiarity towards to locations. Follow up studies to investigate these icons can be conducted with native users of the location. In addition, to evaluate the interaction between some particular areas of web site (s.t. search bars, menus, map area, information area which shows the results of the searched location) gives additional information about the efficiency and effectiveness of the usability of the WebMaps. Acknowledgements. We thank to the Computer Center / Middle East Technical University, and TUBİTAK (for support under grant SOBAG 104K098) for providing eye tracking system. We also thank to Yasemin Saatiçioğlu Oran and our colleagues in METU Computer Center for their valuable support.
References 1. Barkowsky, T.: Mental Processing of Geographic Knowledge. In: Montello, D.R. (ed.) COSIT 2001. LNCS, vol. 2205, pp. 371–386. Springer, Heidelberg (2001) 2. Davies, C., Wood, L., Fountain, L.: User-centred GI: hearing the voice of the customer. In: AGI 2005: People, Places and Partnerships, Annual Conference of the Association for Geographic Information, London, UK, November 8–10, 2005, Association for Geographic Information, London (2005) 3. Downs, R.G., Liben, L.S.: The Development of Expertise in Geography: A CognitiveDevelopmental Approach to Geographic Education. Annals of the Association of American Geographers 81(2), 304–327 (2005) 4. Downs, R.G., Liben, L.S., Daggs, D.G.: The Development of Expertise in Geography: A Cognitive-On Education and Geographers. The Role of Cognitive Developmental Theory in Geographic Education 78(4), 680–700 (2005) 5. Duchowski, A.T.: A Breadth-First Survey of Eye Tracking Applications. Behavior Research Methods, Instruments, & Computers (BRMIC) 34, 455–470 (2002) 6. Fitts, P.M., Jones, J.E., Milton, J.L.: Eye movements of aircraft pilots during instrument landing approaches. Aeronautical Engineering Review 9(2), 24–29 (1950) 7. Global Mapping International, http://www.gmi.org/mapping/websites.htm 8. Haklay, M., Zafiri, A.: Usability Engineering for GIS: Learning from a Screenshot. The Cartographic Journal 45(2), 87–97 (2008) 9. ISO. ISO/DIS 9241-11 Ergonomic requirements for office work with visual display terminals (VDTs) – Part 11: Guidance on usability, International Organization for Standardization (1998) 10. Jacob, R.J.K., Karn, K.S.: Eye tracking in human-computer interaction and usability research: Ready to deliver the promises (Section commentary). In: Hyona, J., Radach, R., Deubel, H. (eds.) The Mind’s Eyes: Cognitive and Applied Aspects of Eye Movements, Elsevier Science, Oxford (2003)
A Usability Study of WebMaps with Eye Tracking Tool
21
11. Montello, D.R.: Cognitive Map-Design Research in the Twentieth Century: Theoretical and Empirical Approaches. Cartography and Geographic Information Science 29(3), 283– 304 (2002) 12. Murata, A., Furukawa, N.: Relationships among display features, eye movement caharcteristics and reaction time in visual search. Human Factors 47(3) (2005) 13. Nivala, A.M., Brewster, S., Sarjakoski, L.T.: Usability Evaluation of Web Mapping Sites. The Cartographic Journal 45(2), 129–138 (2008) 14. Richardson, D.C., Dale, R., Spivey, M.J.: Eye movements in language and cognition: A brief introduction. In: Gonzalez- Marquez, M., Coulson, S., Mittelberg, I., Spivey, M.J. (eds.) Methods in cognitive linguistics, John Benjamins, Amsterdam (in press) 15. Skarlatidou, A., Haklay, M.: Public Web Mapping: Preliminary Usability Evaluation. In: Proceedings of GIS Research UK Conference, Nottingham, April 5-7 (2006) 16. Usability 101: Introduction to Usability (August 25, 2003), http://www.useit.com/alertbox/20030825.html (retrieved 16:22, September 21,2008)
Feature Extraction and Selection for Inferring User Engagement in an HCI Environment Stylianos Asteriadis, Kostas Karpouzis, and Stefanos Kollias National Technical University of Athens, School of Electrical and Computer Engineering, Image, Video and Multimedia Systems Laboratory, GR-157 80 Zographou, Greece [email protected], {kkarpou,stefanos}@cs.ntua.gr
Abstract. In this paper, we present our work towards estimating the engagement of a person to the displayed information of a computer monitor. Deciding whether a user is attentive or not, and frustrated or not, helps adapting the displayed information of a computer in special environments, such as e-learning. The aim of the current work is the development of a method that can work userindependently, without necessitating special lighting conditions and with only requirements in terms of hardware, a computer and a web-camera. Keywords: User engagement, Head Pose, Eye Gaze, Facial Feature tracking.
Feature Extraction and Selection for Inferring User Engagement
23
reported in [12], where facial symmetry along with Gabor filters are used for estimating head pose and eye gaze respectively. A look-up table is then built for corresponding the resulting eye gaze and head pose with the final focus of attention estimate. Here, we propose a work that can be summarized as follows: Face detection [11], followed by facial feature detection is the first step of our method, while tracking follows. Based on facial features' motion, a series of biometric measurements are extracted and their appropriateness is evaluated for inferring the level of frustration or attentiveness of a user in a Human-Computer-Interaction scenario. Our algorithm is able to recover and re-initialize in cases of occlusion or tracking failure.
2 Facial Points Detection and Tracking The method reported in [1] is used for face and eye centre and mouth corner (here, enhanced with upper and lower lip points) localization. For the detection of the eye corners (left, right, upper and lower eyelids) a technique similar to that described in [13] is used. In the current work, the point between the nostrils and two points on each eyebrow have also been used, as will be discussed later. For nostrils detection, an area around a segment of the perpendicular to the inter-ocular line was extended starting from the middle of the eyes. The darkest row of this area is considered as the vertical position of the nostrils and the middle point of this row is further used for our experiments. In a similar manner, two points on each eyebrow are extracted, as the darkest points in a neighborhood above the eye corners. The above are illustrated in Fig. 1, where the luminance values of two search areas have been projected on the vertical axis. The minimum of the projections corresponds to the features in search. Tracking is done using a three-Pyramid Lukas-Kanade algorithm. Geometrical face models, and prototypes of natural human motion are employed for recovering from erroneous tracking (see subsection 3.1).
Fig. 1. Eyebrow and nose detection search regions
3 Feature Extraction The features extracted in our method are the following: Head pose, eye gaze, eyebrow movements, head horizontal and vertical speed components, mouth horizontal and vertical opening, relative movements of the user back and forwards.
24
S. Asteriadis, K. Karpouzis, and S. Kollias
3.1 Head Pose Estimation Head rotation is calculated by examining the translation of the point in the middle of the inter-ocular line with regards to its position when the user was rotated frontally (see Fig. 2), thus providing the Head Pose Vector p = [px py], where px and py are the horizontal and vertical components of the eye middle point’s translation, respectively, normalized with the inter-ocular distance, as calculated at start-up, to cater for scale variations. The fraction of the inter-ocular distance with the vertical distance between the eyes and the mouth is monitored, and if it is restricted within certain limits with regards to its value at a frontal position, no rotation is decided. As tracking, many times, fails, thus giving false estimates of head pose (as well as the other features), a series of rules were integrated: After large rotations, some features are occluded and cannot be further recovered. In this case, when the user comes back to a frontal position, after nt1 frames, the pose length reduces in length but is above a certain threshold, as one of the eyes is not well tracked and the eye center is not at the same neighborhood as at start-up. In this case, the algorithm can re-initialize. The above can be modeled as in equations (1),(2):
p(n) < thr1 ⋅ p(n − nt1 )
(
)
var p n−nt 2:n < thr2
(1) (2)
where ║*║ denotes a vector length metric. Equations (1)-(2) are interpreted as follows: If the Head Pose Vector length at the current frame n is smaller than a fraction thr1 of its value at frame n-nt1, and its variance for the last nt2 frames (with nt2
Fig. 2. Pose changes during a video of a person in front of a monitor
Feature Extraction and Selection for Inferring User Engagement
25
smaller than thr2, the algorithm re-initializes. In our experiments, we used nt1=10, nt2=7, thr1=0.7, thr2=0.05. If the above conditions are met, but the user has not turned frontally, face detection fails and, thus, frontal rotation is not decided. In general conditions, however, the algorithm re-initializes by re-detecting the face, facial features and re-starting to track. Further constraints that are taken into account are related with the displacement of features in subsequent frames. By assuming an orthographic projection in the interval between two subsequent frames, it is expected that features are shifted in a uniform way. Finding such outliers and re-calculating the mean shift with the rest of the features helps positioning erroneous points to the position that would agree with the rest of the features’ shift. As experiments showed, the above refinement is achieved after 7-10 iterations per frame.
Fig. 3. Gaze changes during a video of a person not moving his head in front of a monitor
3.2 Eye Gaze Estimation Eye gaze is extracted by monitoring the eye centre movements with regards to a coordinate system defined by the positions of the eye corners and eyelids at each frame (see Fig. 3). The resulting displacement provides with the Eye Gaze Vector gy], where gx and gy is the horizontal and vertical component respectively.
g = [gx
3.3 Extraction of Further Features The vertical movements of the eyebrows with regards to the upper eyelids are also extracted, and the horizontal and vertical components of the speed (in pixels per frame) of the head movements are calculated. Furthermore, mouth opening is calculated, with reference to the initial distance between the mouth corners and the lips distance at start-up. Finally, the inter-ocular distance changes are monitored, and
26
S. Asteriadis, K. Karpouzis, and S. Kollias
calculated as fractions of the eye centres’ distance with regards to the first frame of each initialization of the system. In this way, when changes in inter-ocular distance are not due to head rotations, qualitative measurements of user movement back and forth are achieved.
4 Feature Selection The experiments were conducted on a database consisting of children with learning difficulties, between the age of 8 and 10. The recorded videos were 720×576 pixels and the frame-rate 25fps. A total of about 10000 and 12250 frames were used for the case of attention/non-attention and frustration/non-frustration problems respectively. The videos were annotated by experts. One of the difficulties of the dataset was that the positive instances (frustration, non-attentiveness) were very few in comparison to the negative ones, and this limited the training session prototypes. In order to evaluate the appropriateness of each of the above features, the Fisher's exact test [4] was used. To this aim, the 3-bin histogram of each of the features was calculated and the resulting distribution for positive instances was compared against the distribution of the same feature throughout all videos, regardless of the state. In our case, we chose Fisher's exact test and not any other method (e.g. chi-square method) because it is ideal for small scale data. Indeed, in many occasions (for example, horizontal speed of the head when the user is frustrated), there are only a few instances where there are low or high values at the correspondent histogram bins. Fisher's exact test is ideal in cases of such small samples. For the event of non-attentiveness, among all features, tests showed that for the features of Head Pose, Eye Gaze, Inter-Ocular Distance Changes and Head Speed, the null hypothesis (that observed and expected distributions do not differ) should be rejected with higher confidence than for the rest of the features. For the event of frustration, Head Pose, Horizontal and Vertical Head speed and Eye Gaze do not follow the null hypothesis as much as the rest of the features do, as it was expected. Figure 4 justifies the rejection of some of the features due to high p-values, while Figures 5 and 6 illustrate examples of data for each class.
Fig. 4. p-values for feature selection in attention/non-attention and frustration/non-frustration scenarios
Feature Extraction and Selection for Inferring User Engagement
27
Fig. 5. Features used for attention/non-attention classification
Fig. 6. Features used for frustration/non-frustration classification
5 Experimental Results For testing the accuracy of our system, a Sugeno-type fuzzy [9] inference system was built for each case. The idea behind using fuzzy systems has to do with the fact that behavioral states do not necessarily belong to certain classes but, rather, they are fuzzy concepts. For example, frustration or distraction can be given confidence values, and outputs of fuzzy systems are ideal in this case. Prior to training, our data were clustered using the sub-cluster algorithm described in [2]. This algorithm, instead of using a grid partition of the data, clusters them and, thus, leads to fuzzy systems deprived of the curse of dimensionality. The number of clusters created by the algorithm determines the optimum number of the fuzzy rules. After defining the fuzzy inference system architecture, its parameters (membership function centers and widths), were acquired by applying a least squares and back-propagation gradient descent method [5]. Tables 1 and 2 summarize the results of the overall accuracy of our system in estimating the behavior of a user in the cases attention/non-attention and frustration/non-frustration experiments using different sets of features with low p-value as inputs and 1 or 0 the target states (attention/non-attention or frustration/non-frustration). Testing was done by adopting a leave-one-out protocol.
28
S. Asteriadis, K. Karpouzis, and S. Kollias
From tables 1 and 2, it can be seen that, in the case of frustration, although eye gaze has low p-value (see Fig. 3), excluding it from experiments does not deteriorate the results but they are marginally higher. This is due to the fact that, in cases of frustration, in our dataset, eye gaze vector length was strongly correlated with head pose vector length. Similarly, although Head Speed (horizontal and vertical) has low p-values in the attention tests, results have shown that, excluding these parameters from our decision systems, would improve the results. More careful observation of our data and the corresponding annotation gave the following explanation: Head speed is only large at the beginning of those time segments when a person is turning his/her head away from the camera. At those time segments, head pose vector has small values but head speed is high. However, such movements can also be met during attention time-stamps, as it is very frequent for a reader to make small rapid movements without changing his head pose a lot. For the above reason, it was decided to exclude head speed from our experiments in the case of attention estimation. The database we used was acquired under normal lighting conditions, with very challenging subjects: Children with learning difficulties. Testing our system on such a dataset is challenging, not only because of its nature, but also due to the fact that annotation is subjective. However, the results obtained are extremely promising. Table 1. Neuro-Fuzzy System decision accuracy for two different sets of low p-value features for detecting User Attention Features Head Pose, Eye gaze, Distance changes, Head speed Head Pose, Eye gaze, Distance changes
Overall success rates 84.00% 88.00%
Table 2. Neuro-Fuzzy System decision accuracy for two different sets of low p-value features for detecting User Frustration Features Head Pose, Horizontal and Vertical Head speed Head Pose, Horizontal and Vertical Head speed, Eye Gaze
Overall success rates 82.00% 80.63%
6 Conclusions and Future Work We presented a method for the automatic estimation of the behavior of a person in front of an HCI environment. Our system is un-intrusive, thus, leaving space for spontaneous behavior, it does not depend on controlled conditions in terms of lighting, and this constitutes it ideal for different settings. Furthermore, since the system does not require any a-priory knowledge of the user or the camera, it does not need any kind of training or calibration beforehand. Future extensions of our work shall include the creation of a common framework for discriminating among a set of states simultaneously. To this aim, we will build a database suitable for our research and work on developing a facial feature tracker, highly specialized for such applications.
Feature Extraction and Selection for Inferring User Engagement
29
Acknowledgments This work has been funded by the FP6 IP CALLAS (Conveying Affectiveness in Leading-edge Living Adaptive Systems), Contract number IST-34800 and by the IST Project ’FEELIX’, (under contract FP6 IST-045169).
References 1. Asteriadis, S., Nikolaidis, N., Pitas, I., Pardàs, M.: Detection of facial characteristics based on edge information. In: 2nd International Conference on Computer Vision Theory and Applications (VISAPP), Barcelona, Spain, pp. 247–252 (2007) 2. Chiu, S.L.: Fuzzy Model Identification Based on Cluster Estimation. Journal of Intelligent and Fuzzy Systems 2(3), 267–278 (1994) 3. D’Orazio, T., Leo, M., Guaragnella, C., Distante, A.: A visual approach for driver inattention detection. Pattern Recognition 40(8), 2341–2355 (2007) 4. Fisher, R.A.: Statistical Methods for Research Workers. Hafner Publishing (1970) 5. Jang, J.S.R.: ANFIS: Adaptive-Network-Based Fuzzy Inference System. IEEE Transactions on Systems, Man, and Cybern. 23, 665–684 (1993) 6. Matsumoto, Y., Ogasawara, T., Zelinsky, A.: Behavior recognition based on head pose and gaze direction measurement. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Takamatsu, Japan, pp. 2127–2132 (2000) 7. Otsuka, K., Takemae, Y., Yamato, J.: A probabilistic inference of multiparty-conversation structure based on markov-switching models of gaze patterns, head directions, and utterances. In: ICMI, pp. 191–198 (2005) 8. Smith, P., Member, S., Shah, M., Lobo, N.D.V.: Determining driver visual attention with one camera. IEEE Trans. on Intelligent Transportation Systems 4, 205–218 (2003) 9. Takagi, T., Sugeno, M.: Fuzzy identification of systems and its applications to modeling and control. IEEE Trans. Syst. Man Cybern. 15(1), 116–132 (1985) 10. Victor, T., Blomberg, O., Zelinsky, A.: Automating driver visual behavior measurement. In: 9th Vision in Vehicles Conference, Australia (2001) 11. Viola, P.A., Jones, M.J.: Rapid object detection using a boosted cascade of simple features. In: Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, pp. 511–518 (2001) 12. Weidenbacher, U., Layher, G., Bayerl, P., Neumann, H.: Detection of head pose and gaze direction for human-computer interaction. In: André, E., Dybkjær, L., Minker, W., Neumann, H., Weber, M. (eds.) PIT 2006. LNCS, vol. 4021, pp. 9–19. Springer, Heidelberg (2006) 13. Zhou, Z.H., Geng, X.: Projection functions for eye detection. Pattern Recognition 37(5), 1049–1056 (2004)
Informative or Misleading? Heatmaps Deconstructed Agnieszka (Aga) Bojko User Centric, Inc. 2 Trans Am Plaza Dr, Ste 100, Oakbrook Terrace, IL 60181 USA [email protected]
Abstract. Eye tracking heatmaps have become very popular and easy to create over the last few years. They are very compelling and can be effective in summarizing and communicating data. However, heatmaps are often used incorrectly and for the wrong reasons. In addition, many do not include all the information that is necessary for proper interpretation. This paper describes several types of heatmaps as representations of different aspects of visual attention, and provides guidance on when to use and how to interpret heatmaps. It explains how heatmaps are created and how their appearance can be modified by manipulating different display settings. Guidelines for proper use of heatmaps are also proposed. Keywords: Heatmaps, attention maps, eye tracking.
EyeTools. Anyone with an eye tracker and the right software can create a heatmap. No knowledge of eye movements or of how heatmaps are created is required. As a result, heatmaps are often generated unnecessarily or are misinterpreted by those who do not understand what the visualizations are really showing or, perhaps even more importantly, not showing. Heatmaps can be deceptive because they look so intuitive that we often do not realize how much we actually do not understand. This paper describes different heatmap types and their limitations, as well as settings used to manipulate the appearance of heatmaps. It also discusses when heatmaps should and should not be used. Proposed guidelines for using heatmaps correctly conclude this work.
2 Types of Attention Heatmaps Heatmaps are often shown with little, if any, description of what it is they are representing. The assumption is that they are showing “attention” or “eye movements” but knowing that is certainly not enough to be able to truly understand a heatmap. There are different aspects of eye movements that heatmaps can represent. Examples include fixation count, absolute or relative gaze duration, and percentage/proportion of participants who fixated on each area of the stimulus. Choosing the right heatmap to present depends on the study objectives and the eye movement measures that address these objectives. For example, if search efficiency was of interest to the researchers, one of the measures collected and analyzed might be the number of fixations prior to acquiring the target [1]. Therefore, assuming that the analysis would benefit from data visualization, a fixation count heatmap should be presented. A fixation count heatmap would also be appropriate if the study goal was to determine the amount of interest generated by various elements of the stimulus during a free-view task (i.e., task with no specific task instructions) [1]. However, if noticeability of a particular element was of interest, the percentage of participants who fixated on the element could be used as a measure (in addition to, for example, time to first fixation), which would warrant a participant percentage heatmap. Because each heatmap type has different limitations that impact its interpretation, it is not only important to be aware of these limitations, but also to know the types of all heatmaps included in papers, reports, and presentations. 2.1 Fixation Count Heatmap A visual fixation can be loosely defined as a relatively stationary eye position focused on a particular location of the stimulus (a more precise definition is discussed in section 3.1). Fixations are important events that provide insight into human cognition because during each fixation we extract visual information that we process [2]. A fixation count heatmap (see Fig. 1) shows the accumulated number of fixations across participants. Each fixation made by each participant adds a value to the color map at the location of the fixation [3]. This value is the same for each fixation regardless of its duration, so a 100 ms fixation is represented in the same way as a 900 ms fixation. Thus, when looking at a fixation count heatmap, we cannot assume that areas of the same color received similar total gaze time.
Fixation count heatmaps can also be biased towards individuals who show high interest in elements that others do not. For example, two elements can be the same color but one attracted ten fixations from one participant only, while the other attracted attention of ten participants, one fixation from each. Therefore, we cannot assume that areas that appear similar in terms of “heat” are equivalent in terms of the number of participants who looked at them. Another limitation of fixation count heatmaps is the fact that they can be skewed towards individuals who had a longer exposure to the stimulus and thus an opportunity to produce more fixations. For example, if participant A spent twice as much time on the stimulus than participant B, participant A’s data would impact the heatmap twice as much as participant B’s data. This is something to always keep in mind when viewing heatmaps created based on unequal exposure times. 2.2 Absolute Gaze Duration Heatmap An absolute gaze duration heatmap (see Fig. 2) shows the accumulated time participants spent looking at the different areas of the stimulus. Each fixation made by each participant adds a value to the color map that is proportional to its duration [3]. For example, a 900 ms fixation will be nine times higher in color value than a 100 ms fixation. Because fixation duration is an indicator of cognitive processing [4], a heatmap that is scaled by fixation duration not only shows which areas were attended to but also represents the level of cognitive processing that the areas required.
An absolute gaze duration heatmap can be misleading because it displays different phenomena in the exact same way. For example, this type of heatmap will make one 900 ms fixation look the same as nine 100 ms fixations. A 900 ms fixation on an element indicates that one person looked at it for a while, while nine 100 ms fixations could mean, for example, that one person made nine brief fixations to the element or nine people made one brief fixation each. In addition, similar to the fixation count heatmaps, absolute gaze duration heatmaps can be biased towards individuals who spent more time looking at the stimulus. To eliminate any bias due to unequal exposure times, the gaze duration data can be normalized to create relative gaze duration heatmaps. 2.3 Relative Gaze Duration Heatmap A relative gaze duration heatmap shows the accumulated time each participant spent fixating at the different areas of the stimulus relative to the total time the participant spent looking at the stimulus [3]. In other words, if participant A spent 6 seconds on a web page including 2 seconds on the navigation and participant B spent 60 seconds on the same web page including 20 seconds on the navigation, this type of heatmap will make their data, as it relates to the navigation, the same weight. Similar to the absolute gaze duration heatmap, this heatmap will also show a high gaze time by an individual (proportional to his/her total viewing time) the same way as several short gaze times by a number of individuals (proportional to their total viewing times). If the exposure time is equal across participants (e.g., all participants saw the page for 12 s), which is the case in the examples presented in this paper, the relative gaze duration heatmap and the absolute duration heatmap will be identical. 2.4 Participant Percentage Heatmap A participant percentage heatmap (see Fig. 3) shows the percentage of participants who fixated on the different areas of the stimulus. Each participant who looked at any given location adds a value to the color map. This value is the same for each participant regardless of the number of fixations he or she made or the fixation durations. Thus, an area that was briefly fixated once by each of the participants will be presented in the same color as an area that was fixated multiple times by each of the participants and the fixations were much longer.
3 Display Settings for Creating Heatmaps The appearance of heatmaps can be modified by manipulating various display settings. While adjusting the settings cannot change the relative distribution of attention, certain areas can be made to appear “hotter” or “colder.” This can be done, for example, by changing the fixation criteria used for the analysis, showing raw data instead of fixation data, changing the upper threshold definition of the color scale, or modifying the time segment for the presented data. Since heatmap display settings can have a great impact on the appearance of heatmaps, they must be properly selected and communicated to ensure accurate interpretation. 3.1 Changing Fixation Criteria There are several algorithms that can be used to define a fixation. A common algorithm used in commercial eye tracking software is based on duration and dispersion threshold identification. To define a fixation using this algorithm, two parameters need to be specified: minimum fixation duration (e.g., 80 ms) and maximum dispersion threshold (e.g., 0.5 degree of visual angle) [5]. A fixation defined by 80 ms and 0.5° will encompass all consecutive eye movements that occurred within 0.5° from each other for at least 80 ms. Unless noted otherwise, the heatmaps in this paper were created using these settings. Manipulating the duration and dispersion thresholds will change the number of fixations in the data. For example, increasing the minimum fixation duration from 80 ms to 200 ms will decrease the number of fixations because it will exclude all the fixations between 80 ms and 200 ms. Increasing the maximum dispersion threshold from 0.5 degree to 1 degrees of visual angle will also decrease the number of fixations because some of the fixations that are closer together will be combined. Conversely, reducing the minimum fixation duration and maximum dispersion threshold will increase the number of fixations. More fixations in the data will increase the amount of “heat” in the heatmap, as shown in Figure 4.
Fig. 4. Fixation count heatmaps based on the same data (n = 13; 0 – 12 s; free-view task). On the left: minimum fixation duration = 200 ms; on the right: minimum fixation duration = 80 ms.
The lack of explicit fixation definition is an issue that does not pertain specifically to heatmaps but to entire papers and reports. Many user experience studies that analyze fixation data never mention how these fixations were defined. However, this information is very important for two reasons. First, the results of different studies are
Informative or Misleading? Heatmaps Deconstructed
35
not comparable unless it is clear that fixations were defined in the same way. Second, if the definition is not provided, it cannot be verified whether or not the fixation criteria were appropriate for the stimuli used in the study. For example, fixation duration threshold for image viewing should be higher than for reading because image viewing tends to produce longer fixations than reading due to the fact that more information is being processed in a single fixation [6]. The fixation definition used to create heatmaps should match the definition used for data analysis. Changing the fixation duration to obtain a visualization of a particular intensity is not a good practice. 3.2 Displaying Raw Data Instead of Fixation Data One step further from decreasing the fixation duration and dispersion thresholds is presenting raw data instead of fixation data. Raw data consists of meaningful eye movements (raw fixation points) and “noise” – eye movements that have little meaning in most user experience research. The noise includes eye movements that take place during saccades (rapid eye movements between fixations) as well as drifts, tremors, and flicks that occur during fixations [5]. Adding all the noise intensifies the heatmap, increasing the area covered in red (see Fig. 5).
Fig. 5. Fixation count heatmaps based on the same data (n = 13; 0 – 12 s; free-view task). On the left: fixation data; on the right: raw data.
As a general rule, fixation data rather than raw data should be used when creating visualizations unless the stimulus has moving elements and the heatmap has to show smooth pursuit eye movements. We can assume that fixation data was used if the paper or report specifies how the researchers defined a fixation. 3.3 Changing the Definition of the Color Scale Upper Threshold Another way to manipulate the amount of heat in heatmaps is by changing the definition of the upper threshold on the scale, which is usually indicated by the red color. If the requirements for an area to be red are lowered, the amount of red in the heatmap will increase. Lowering of the upper threshold of the scale can be achieved by decreasing the minimum number of fixations in fixation count heatmaps (see Fig. 6) or by decreasing the minimum gaze length in absolute gaze duration heatmaps.
36
A. (Aga) Bojko
There is no set process for choosing the right upper threshold. The rule of thumb is to make sure that the heatmap properly captures the range of values that are of interest to the study. Threshold selection can be compared to setting the maximum value on the Y axis in a graph, where the Y axis indicates values of the independent variable. If the Y axis is too short, the data points that exceed the maximum on the axis are cut off. As a result, we only know that these data points are higher than the maximum but we do not know what they are exactly and how they differ from one another. On the other hand, if the Y axis is too high, the graph data will appear compressed and the differences between the data points will look smaller. Similarly, if a heatmap’s upper threshold is set too low, many areas will be covered in red with no differentiation between the amount of attention each attracted. Conversely, setting a heatmap’s upper threshold too high will limit the range of colors (e.g., constricting it to yellow or orange as the maximum) and no areas will be covered in red. Regardless of what criteria have been selected, they should be explicitly communicated in the figure legend or caption (e.g., “red = 10+ fixations” or “red color indicates areas that accumulated 10 s or more of gaze time”). It is also useful to put this value in context by providing the average number of fixations each participant made on the stimulus or the average time each participant spent looking at it. The heatmaps presented in this paper were created based on data with the average of 42 fixations on the page per participant and consistent exposure time of 12 seconds. If heatmaps are generated for different experimental conditions or participant groups, their upper threshold definition should be identical, so the heatmaps can be compared. If the display settings are not the same, the differences between the heatmaps may be due to factors other than the data itself.
Fig. 6. Fixation count heatmaps based on the same data (n = 13; 0 – 12s; free-view task). On the left: red = 10+ fixations; on the right: red = 3+ fixations.
3.4 Modifying the Time Segment Sometimes it may be appropriate to present data based on a shorter time segment than the total time during which a stimulus was shown to the participants. This will obviously decrease the amount of data, thus reducing the size of red areas in all heatmap types mentioned in this paper except for the relative gaze duration heatmap (see Fig. 7). Therefore, if a heatmap presents data from a time segment that is shorter than the total viewing time, this needs to be specified in the figure legend or caption.
Informative or Misleading? Heatmaps Deconstructed
37
In addition to the time segment, it should also be clear what participants were trying to do when the data presented in the heatmap was being collected. All too often heatmaps are shown without any context of the task. Eye movements are very taskdependent [7] and participants trying to log in to a website will produce very different attention distribution than if they were trying to find a product. Even if there was no specific task, it should still be noted that the data was collected in a free-view situation, which is how the heatmaps included in this paper were obtained.
Fig. 7. Fixation count heatmaps of the same free-view task (n = 13). On the left: data from 0 to 6 seconds; on the right: data from 0 to 12 seconds.
4 Heatmap Usage 4.1 Common Mistakes There are a few common mistakes when it comes to using heatmaps. The biggest of them is a belief that heatmaps are appropriate for just about any user experience study and for any research question. Sometimes creating heatmaps even becomes the objective of the study (e.g., “we just wanted to see where people look”). This popular “let’s-track-and-see-what-happens” approach has a limited value because of its lack of focus (i.e., the study design does not target specific questions) and frequent lack of proper data analysis. Researchers often draw conclusions or make recommendations based on the results of those studies, which is inappropriate. For example, we cannot say that the reason why an element did not get much attention was because of its suboptimal placement or insufficient size unless we have tested other conditions (i.e., alternative placements or sizes) and compared the data using appropriate statistics. Even if different conditions were tested, sometimes conclusions regarding differences between conditions are made just by looking at the heatmaps. However, heatmaps do not lend themselves to any systematic comparison. Without any data analysis, it is impossible to tell if there are real differences between heatmaps, even if the heatmaps appear to be different. 4.2 Proper Usage Examples Heatmaps should be used in a purposeful way and only when they add value. They can serve as illustrations of participants’ viewing behavior and distribution of attention.
38
A. (Aga) Bojko
While they can communicate data, they cannot explain it or help analyze it. Therefore, heatmaps can rarely stand on their own. To maximize their usefulness and reduce ambiguity, heatmaps should accompany a quantitative analysis. One of our studies investigated a new standardized label template for prescription drug labels [8]. The goal was to determine the impact of the template on pharmacists’ drug selection speed and accuracy as compared to the existing label designs. The eye tracking measures included the number of fixations prior to target selection as an indicator of search efficiency, average fixation duration as a measure of information processing difficulty, and pupil diameter as a measure of cognitive workload. The results were presented in the form of statistical analyses but no heatmaps were included in the report. The study was a quantitative assessment of the effectiveness of the new labels, and heatmaps showing attention distribution were simply of no value. In another study, we evaluated a new homepage design for a professional organization against the original homepage [9]. Our objective was to identify which design was better and why based on a series of tasks during which participants attempted to locate the correct entry point on the homepage. Measures of search efficiency such as the number of fixations and the number of eye visits to the target prior to target selection were analyzed. The analysis was supplemented with heatmaps to show the distribution of attention on the page and to help account for any inefficiencies that occurred. For example, several tasks were more efficient using the new design which had a more centralized navigation. The heatmaps of the original design showed scattered fixations covering multiple navigation areas, while heatmaps of the new design revealed fixations focused mostly around the targets.
5 Guidelines for Using Heatmaps Even though heatmaps are very compelling and seemingly easy to understand, they should be used with caution and according to the following guidelines, summarized based on the discussion from the previous sections of this paper: A. B. C. D.
E. F. G. H.
I. J.
Generate heatmaps only if they add value to the research. Use heatmaps for data visualization instead of data analysis. Use heatmaps to support quantitative analysis rather than on their own. Understand the different heatmap types and only use the ones that represent measures which address your study objectives (e.g., when analyzing gaze time, use a gaze duration heatmap). Specify the type of data the heatmap is representing (e.g., fixation count or absolute fixation duration). Know the limitations of each heatmap type to avoid incorrect interpretation. When creating heatmaps, use fixation data rather than raw data. Provide fixation definition and keep it consistent for analyses and visualizations within a study (e.g., min fixation duration = 100 ms and max dispersion threshold = 0.5°). Provide the definition for the upper threshold of the heatmap color scale (e.g., red = 10+ fixations). Put the upper threshold value in context (e.g., average number of fixations on the stimulus per participant).
Informative or Misleading? Heatmaps Deconstructed
39
K. Specify the time segment based on which the heatmap was created (e.g., the first 10 seconds of exposure). L. Provide task context for each heatmap (e.g., data obtained from participants during the checkout task). M. Use the same heatmap settings (e.g., upper threshold and time segment) for conditions that you are comparing. N. If a paper or report does not provide important information about its heatmaps (e.g., type, fixation definition, upper threshold definition, time segment), ask the authors for clarification before making any assumptions.
6 Conclusion Blinded by the attractiveness and apparent intuitiveness of heatmaps, we often do not realize how much information in addition to the visualization itself is necessary to fully understand a heatmap and properly interpret the data it represents. In other words, the biggest danger involved in creating and reading heatmaps is that we are often unaware of what we do not know, and thus we do not look or ask for the missing information. This paper has exposed some of these gaps in our meta-knowledge in the hope to encourage more critical thinking about the usage of heatmaps.
References 1. Jacob, R.J.K., Karn, K.S.: Eye Tracking in Human-Computer Interaction and Usability Research: Ready to Deliver the Promises. In: Hyona, J., Radach, R., Deubel, H. (eds.) The Mind’s Eye: Cognitive and Applied Aspects of Eye Movement Research, pp. 573–605. Elsevier Science, Amsterdam (2003) 2. Liversedge, S.P., Findlay, J.M.: Saccadic Eye Movements and Cognition. Trends in Cognitive Sciences 4, 6–14 (2000) 3. Tobii.: Tobii Studio 1.X User Manual (2008) 4. Duchowski, A.: Eye Tracking Methodology: Theory and Practice. Springer, Heidelberg (2003) 5. Salvucci, D.D., Goldberg, J.H.: Identifying Fixations and Saccades in Eye-Tracking Protocols. In: Proceedings of Eye Tracking Research and Applications Symposium (2000) 6. Castelhano, M.S., Rayner, K.: Eye Movements During Reading, Visual Search, Scene Perception: An Overview. In: Rayner, K., Shem, D., Bai, X., Yan, G. (eds.) Cognitive and Cultural Influences on Eye Movements, pp. 3–33. Psychology Press (2008) 7. Yarbus, A.L.: Eye Movements and Vision. Plenum Press (1967) 8. Bojko, A.: Measuring the Effects of Drug Label Design and Similarity on Pharmacists’ Performance. In: Tullis, T., Albert, W. (eds.) Measuring the User Experience: Collecting, Analyzing, and Presenting Usability Metrics, pp. 271–280. Morgan Kaufmann, San Francisco (2008) 9. Bojko, A.: Using Eye Tracking to Compare Web Page Designs: A Case Study. Journal of Usability Studies 1, 112–120 (2006)
Toward EEG Sensing of Imagined Speech Michael D’Zmura, Siyi Deng, Tom Lappas, Samuel Thorpe, and Ramesh Srinivasan Department of Cognitive Sciences, UC Irvine, SSPB 3219, Irvine, CA 92697-5100 {mdzmura,sdeng,tlappas,sthorpe,r.srinivasan}@uci.edu
Abstract. Might EEG measured while one imagines words or sentences provide enough information for one to identify what is being thought? Analysis of EEG data from an experiment in which two syllables are spoken in imagination in one of three rhythms shows that information is present in EEG alpha, beta and theta bands. Envelopes are used to compute filters matched to a particular experimental condition; the filters' action on data from a particular trial lets one determine the experimental condition used for that trial with appreciably greater-than-chance performance. Informative spectral features within bands lead us to current work with EEG spectrograms. Keywords: EEG, imagined speech, covert speech, classification.
2 Methods Four subjects participated in an experiment with six conditions determined factorially through combination of two syllables and three rhythms (see Figure 1). A single experimental session comprised 20 trials for each of the six conditions; conditions were presented in block-randomized order. Each subject participated in six such sessions for a total of 120 trials per condition. EEG was recorded using a 128 Channel Sensor Net (Electrical Geodesics) in combination with an amplifier and acquisition software (Advanced Neuro Technology). The EEG was sampled at 1024Hz and on-line average referenced. Subjects were instructed to keep their eyes open in the dimly-lit recording room and to avoid eye and other movement during the six seconds, following the cue, during which speech was imagined without any vocalization whatsoever.
Fig. 1. Timelines for the six conditions, labeled at left, in the imagined speech experiment. Durations are indicated in eighths of a second. The syllable /ba/ was imagined in one of three different rhythms in conditions 1-3; the syllable /ku/ was imagined in conditions 4-6. The condition for each trial was cued during an initial period of duration 4.5 sec (12/8 sec = 1.5 sec; 3 x 1.5 sec = 4.5 sec). During this initial period, subjects heard through Stax electrostatic earphones either a spoken "ba" or a spoken "ku" followed by a train of clicks (arrows) indicating the rhythm to be reproduced. Subjects were instructed to speak in their imagination the cued syllable, illustrated for condition 1 by {ba}, using the cued rhythm and tempo. Desired imagined syllable onset times reproduce the cued rhythm and are indicated by asterisks.
3 Analysis and Results Our aims were to classify single trials offline according to condition and discern condition signatures needed to create online filters. EEG waveform envelopes provide encouraging results when used to classify trials according to experimental condition.
42
M. D’Zmura et al.
These envelopes were computed for each electrode in the theta (3-8Hz), alpha (813Hz) and beta (13-18Hz) frequency bands and used to construct matched filters. These filters work well to classify trials according to condition, yet further results with spectra lead us to explore a finer-grained analysis using spectrograms. 3.1 Matched-Filter Classification Using Envelopes We used waveform envelopes to compute matched filters, one for each of the six conditions (see Figure 2). The inner product of each matched filter with a particular trial's envelope provides six numbers. The matched filter which gives rise to the largest inner product is the best match, and the condition to which the matched filter corresponds is declared the best guess as to the experimental condition for that trial. Offline preprocessing steps include • segment the EEG data to provide time-varying waveforms for each condition, electrode and trial; • remove from further consideration 18 electrodes most sensitive to electromyographic artifact: those with the lowest positions about the head and spanning locations close to the eyes, low on the temple, at or below the ear, and at or below the external occipital protuberance; • remove the mean and linear trend from each segmented waveform; • low-pass filter the detrended, segmented, waveforms to remove 60Hz line noise; • use thresholds to identify and remove from further consideration filtered waveforms likely contaminated with electromyographic artifact. Alpha-, beta- and theta-band activity were computed for each electrode and trial using band-pass elliptic filters. These band-pass filtered waveforms (we[t] and bottom right plot in Fig. 2) were Hilbert-transformed to provide corresponding envelopes (ve[t] and middle right plot). The envelopes serve as input to the matched-filter classification and as data used to construct the matched filters. An electrode's average envelopes, found by averaging across trials for each of the six conditions, serve as matched filters with one further step: for each electrode, pseudoinvert the six conditions' average envelopes to provide six filters (Fec[t] and top right plot). The inversion of the average envelopes provides filters which return an inner product of one for the corresponding condition's average envelope and an inner product of zero for all other condition's average envelopes. Information may be integrated across electrodes by summing the measures pe,c across electrodes and determining the maximum of the resulting six numbers. Information may be integrated across bands by weighing individual response measures for the three bands by band reliability estimates or by a voting procedure. Classification results shown in Table 1 show that the beta band (13-18Hz) is, for all but one subject, the most informative frequency band; theta (3-8Hz) and alpha (813Hz) are comparable for all subjects but S1. The per-condition classification performance is shown for subject S2 in Fig. 3. Rows refer to a trial's actual condition, while columns refer to the condition of the matched filter for which the maximal response was obtained. Darker values indicate higher percentages of trials. These dark values describe diagonals, which indicates a match between a trial's actual condition and the condition with matched filter providing the greatest response.
Toward EEG Sensing of Imagined Speech
43
Fig. 2. Preprocessed and band-pass-filtered waveform we[t], recorded by electrode e, is Hilberttransformed to provide envelope ve[t]. Shown at bottom right is the preprocessed waveform from an electrode with mid-parietal location from subject S1 for a trial during which {ba} is spoken in imagination at times 1.5, 3.0 and 4.5 sec (condition 1). Shown at middle right is the corresponding envelope. The inner product <, > of this envelope with each matched filter Fe,c[t], one filter per condition per electrode, provides six numbers pe,c. These six numbers measure how well the particular trial's envelope matches the filter for each condition. The maximum of these six numbers is used to determine the most likely condition c˜ e . Shown at top right is a matched filter for condition c = 1.
The distributions of classification performance across the scalp are similar across the four subjects; those for subject S4 are shown in Figure 4. The most informative electrodes lie largely near the top of the head (vertex) where electromyographic artifacts have their least influence. These distributions of information differ significantly from the distributions of envelope amplitude, which follow well-known patterns for alpha activity (parietal) and theta activity (frontal).
44
M. D’Zmura et al.
Table 1. Classification performance using matched filters for envelopes in three frequency bands. The fraction of correctly-classified trials (720 trials per subject, identified in the left column) is indicated. The chance performance level in this classification among six conditions is 1/6 (0.17).
S1 S2 S3 S4
alpha 0.38 0.63 0.44 0.64
beta 0.80 0.87 0.68 0.62
theta 0.63 0.59 0.46 0.59
Fig. 3. Classification matrices for subject S2. Values along the diagonals indicate correct classification, while non-white values off the diagonal indicate errors. The black gray-value in the middle panel (beta-band) for actual condition 1 and matched filter condition 1 (top left square) represents 91% of the trials; lighter shades indicate smaller values (through white indicating 0%). Perfect performance would be indicated by all off-diagonal entries set to white.
Fig. 4. Distributions of classification performance across the scalp for alpha (α), beta (β) and theta (θ) bands for subject S4. Darker values indicate the positions of electrodes providing better classification performance.
3.2 Spectral Features
Analysis of trials’ power spectral densities shows that differences in power within single frequency bands can provide information concerning trial condition. Spectral
Toward EEG Sensing of Imagined Speech
45
differences within a single band are completely invisible to the previous analysis, which grouped together activity at all frequencies within a single band. Data for subjects S1 and S2 show that, for condition 3, there is a peak in the power spectrum in the range 5-6Hz, relative to baseline power measured at 3.5-4.5Hz. This peak is largely absent for condition 1. This difference between the average power spectra for conditions 3 and 1 is localized to electrodes with front central locations on the scalp (see Fig. 5, leftmost panels). This spectral difference alone, within the theta band, provides ~75% correct two-way classification between these two conditions. The peak and the localized spatial distribution do not exist for subjects S3 and S4 (see Fig. 5, rightmost panels).
Fig. 5. Difference between condition 3 and condition 1 theta sub-band spectral power localized primarily to electrodes with front, center locations for subjects S1 and S2; this pattern is not evident for S3 and S4
3.3 Spectrographic Analysis
The spectral data suggest several things. The first is that one would do well to represent EEG data spectrographically: record power as a fine-grained function of both frequency and time. The second and third are more cautionary. There is a tremendous amount of trial-by-trial variability in EEG recordings. Adding channels, like further frequency bands of narrow range as in spectrograms, may have the effect of masking signals more effectively. Furthermore, fine subdivision of the frequency domain may reveal individual differences (as in Fig. 5), so that averages across subjects become less meaningful. These caveats notwithstanding, our current work focuses on spectrographic means of identifying imagined speech.
4 Discussion The syllables /ba/ and /ku/ have little semantic content in and of themselves, so that differences in EEG records underlying classification performance are unlikely to reflect semantic contributions to imagined speech production. Furthermore, each experimental trial starts with a period during which the cue to the condition is presented; cue recognition and response planning occur during the cue period and are likely absent during the following period in which EEG recording is made of imagined speech. The EEG recordings analyzed here concern most likely immediate aspects of imagined speech production based on readout from working memory.
46
M. D’Zmura et al.
EEG electrodes record not only signals from brain cortex but also electromyographic activity. Common sources of the latter are muscles responsible for eye and head movements and other muscles near recording electrodes, like those in the temples and neck. Electrodes atop or close to such muscles are most sensitive to electromyographic activity. The most flagrant such offenders were excluded from analysis in the work reported here. Furthermore, recordings for which EEG signals (detrended and filtered to excluded 60Hz line noise) exceeded 30 μV in absolute value were excluded from consideration. This threshold eliminated many (but not all) recordings contaminated with artifact. The most informative electrodes are those positioned on the scalp near the top of the head (Fig. 3), and this suggests that the information is due more to cortical than electromyographic activity. Subjects were instructed to think the imagined speech without any vocal or subvocal activity: without moving any muscles involved in producing overt speech. This contrasts with what is the most successful method yet for recognizing speech produced without any acoustic signal: surface electromyographic (SEMG) recordings from face and throat muscles. Substantial progress has been made in automatic speech recognition from surface EMG recordings of vocal tract articulators made during both spoken speech [5-10] and subvocalized speech [11-17]. Further non-acoustic methods for extracting imagined speech information potentially include magnetoencephalography (MEG) and invasive methods, like electrocorticography (ECoG), which involve implanted electrodes. While there are no published positive results yet for classification of imagined speech using MEG, its high temporal resolution and moderate spatial resolution (comparable to high-density EEG) suggest its value, as does success in classifying heard speech [18-21]. MEG is most sensitive to cortical signals from surfaces perpendicular to the scalp (in sulci) while EEG is most sensitive to signals from parallel surfaces (gyri) [22], so that the two methods are complementary. While the use of electrodes implanted in the brain is limited to small clinical populations for which these electrodes are indicated in combination with neurosurgery, it can provide very useful information concerning imagined speech [23]. Functional magnetic resonance imaging (fMRI) studies have contributed significantly to our neuroscientific knowledge of language perception and production [24-27]. Yet the poor temporal resolution of fMRI and of functional near-infrared imaging (fNIR) make these unlikely candidates for speech recognition. EEG is a non-invasive, non-injurious method for probing cortical activity which has a high temporal resolution, moderate spatial resolution (~2cm), a relatively low cost and which is increasingly portable. We feel that its greatest contribution as a braincomputer interface is likely to be found in circumstances where a human subject is trained to produce activity discernible by EEG [28]. Such training can reduce the variability associated with imagined speech production and also help one emphasize aspects of imagined speech which are most easily discerned by EEG. Our next step is to close the loop by providing real-time feedback concerning imagined speech signals.
Acknowledgements We thank David Poeppel for suggesting the value of an experiment like this. This work was supported by ARO 54228-LS-MUR.
Toward EEG Sensing of Imagined Speech
47
References 1. Dewan, E.M.: Occipital alpha rhythm eye position and lens accommodation. Nature 214, 975–977 (1967) 2. Farwell, L.A., Donchin, D.: Talking off the top of your head: toward a mental prosthesis utilizing event-related brain potentials. Electroencephalogy and Clinical Neurophysiology 70, 510–523 (1988) 3. Suppes, P., Lu, Z.-L., Han, B.: Brain wave recognition of words. Proceedings of the National Academy of Science USA 94, 14965–14969 (1997) 4. Suppes, P., Han, B., Lu, Z.-L.: Brain wave recognition of sentences. Proceedings of the National Academy of Science USA 95, 15861–15866 (1998) 5. Morse, M.S., O’Brien, E.M.: Research summary of a scheme to ascertain the availability of speech information in the myoelectric signals of neck and head muscles using surface electrodes. Computers in Biology and Medicine 16, 399–410 (1986) 6. Chan, A.D.C., Englehart, K., Hudgins, B., Lovely, D.F.: Hidden Markov model classification of myoelectric signals in speech. IEEE Engineering in Medicine and Biology 21, 143– 146 (2002a) 7. Chan, A.D.C., Englehart, K., Hudgins, B., Lovely, D.F.: Multiexpert automatic speech recognition using acoustic and myoelectric signals. IEEE Transactions on Biomedical Engineering 53, 676–685 (2002b) 8. Bu, N., Tsuji, T., Arita, J., Ohga, M.: Phoneme classification for speech synthesizer using differential EMG signals between muscles. In: Proceedings of the 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference, pp. 5962–5966 (2005) 9. Jou, S.-C., Maier-Hein, L., Schultz, T., Waibel, A.: Articulatory feature classification using surface electromyography. In: Acoustics, Speech and Signal Processing, ICASSP 2006 Proceedings, pp. I–605–I–608 (2006) 10. Jou, S.-C., Schultz, T., Walliczek, M., Kraft, F., Waibel, A.: Towards continuous speech recognition using surface electromyography. In: Interspeech 2006 – ICSLP, pp. 573–576 (2006) 11. Jorgensen, C., Lee, D.D., Agabon, S.: Sub-auditory speech recognition based on EMG signals. In: Proceedings of the International Joint Conference on Neural Networks, Portland Oregon (July 2003) 12. Betts, B.J., Jorgensen, C.: Small vocabulary recognition using surface electromyography in an acoustically harsh environment. NASA/TM-2005-213471 (2005) 13. Maier-Hein, L.: Speech recognition using surface electromyography. Diplomarbeit, Universität Karlsruhe (2005) 14. Maier-Hein, L., Metze, F., Schultz, T., Waibel, A.: Session independent non-audible speech recognition using surface electromyography. In: 2005 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 331–336 (2005) 15. Jorgensen, C., Binsted, K.: Web browser control using EMG based sub vocal speech recognition. In: Proceedings of the 38th Annual Hawaii International Conference on System Sciences (HICSS 2005), pp. 294–301 (2005) 16. Binsted, K., Jorgensen, C.: Sub-auditory speech recognition. In: HICSS (2006) 17. Walliczek, M., Kraft, F., Jou, S.-C., Schultz, T., Waibel, A.: Sub-word unit based nonaudible speech recognition using surface electromyography. In: Interspeech 2006 - ICSLP, pp. 1487–1490 (2006) 18. Numminen, J., Curio, G.: Differential effects of overt, covert and replayed speech on vowel-evoked responses of the human auditory cortex. Neuroscience Letters 272(1), 29–32 (1999)
48
M. D’Zmura et al.
19. Ahissar, E., Nagarajan, S., Ahissar, M., Protopapas, A., Mahncke, H., Merzenich, M.M.: Speech comprehension is correlated with temporal response patterns recorded from auditory cortex. Proceedings of the National Academy of Sciences 98, 13367–13372 (2001) 20. Houde, J.F., Nagarajan, S.S., Sekihara, K., Merzenich, M.M.: Modulation of the auditory cortex during speech: an MEG study. Journal of Cognitive Neuroscience 15, 1125–1138 (2002) 21. Luo, H., Poeppel, D.: Phase patterns of neuronal responses reliably discriminate speech in human auditory cortex. Neuron 54, 1001–1010 (2007) 22. Nunez, P.L., Srinivasan, R.: Electric Fields of the Brain: The Neurophysics of EEG, 2nd edn. Oxford University Press, New York (2006) 23. Barras, C.: Brain implant helps stroke victim speak again. New Scientist (July 2008) 24. Indefrey, P., Levelt, W.J.M.: The spatial and temporal signatures of word production components. Cognition 92, 101–144 (2004) 25. Hickok, G., Poeppel, D.: Dorsal and ventral streams: a framework for understanding aspects of the functional anatomy of language. Cognition 92, 67–99 (2004) 26. Hickok, G., Poeppel, D.: The cortical organization of speech processing. Nature Reviews Neuroscience 8, 393–402 (2007) 27. Poeppel, D., Idsardi, W.M., van Wassenhove, V.: Speech perception at the interface of neurobiology and linguistics. Philosophical Transactions of the Royal Society B 363, 1071–1086 (2007) 28. Fetz, E.: Volitional control of neural activity: implications for brain-computer interfaces. Journal of Physiology 579, 571–579 (2007)
Monitoring and Processing of the Pupil Diameter Signal for Affective Assessment of a Computer User Ying Gao, Armando Barreto, and Malek Adjouadi Department of Electrical and Computer Engineering Florida International University Miami, FL 33174 USA {ygao002,barretoa,adjouadi}@fiu.edu
Abstract. The pupil diameter (PD) has been found to respond to cognitive and emotional processes. However, the pupillary light reflex (PLR), is known to be the dominant factor in determining pupil size. In this paper, we attempt to minimize the PLR-driven component in the measured PD signal, through an ∞ Adaptive Interference Canceller (AIC), with the H time-varying (HITV) adaptive algorithm, so that the output of the AIC, the Modified Pupil Diameter (MPD), can be used as an indication of the pupillary affective response (PAR) after some post-processing. The results of this study confirm that the AIC with the HITV adaptive algorithm is able to minimize the PD changes caused by PLR to an acceptable level, to facilitate the affective assessment of a computer user through the resulting MPD signal. Keywords: Pupil diameter (PD), Pupillary light reflex (PLR), Pupillary affec∞ tive response (PAR), Adaptive Interference Canceller (AIC), H time-varying (HITV) adaptive algorithm.
psychological factors controlling pupil size, such as emotional processes have recently been investigated. For example, in 2003, Partala and Surakka found, using auditory emotional stimulation, that the pupil size variation can also be seen as an indication of affective processing during human-computer interaction [3]. Therefore, in this paper, we focus our study on monitoring and processing the pupil diameter (PD) signal for the affective assessment of a computer user. To achieve the required separation of the PLR-driven component from PD changes due to Pupillary Affective Responses (PAR) we propose an Adaptive Interference Canceller (AIC), which is known to be able to remove an unwanted interference component z(k) that pollutes a measured signal s(k) using an independent measurement of the interference n(k) [4] (See Figure 1). The core portion of this AIC system is an Adaptive Transversal Filter (ATF), which performs the adaptive algorithm to implement the function of interference canceling. To ensure a successful noise removal from the AIC, it is important to select an adaptive algorithm that possesses good robustness properties. The H∞ adaptive algorithm, introduced in robust control theory, is an attempt at addressing this need, with features that are useful in safeguarding against the worst-case of model uncertainties and makes no assumption on the (statistical) nature of the signals [5].
Thus, we propose to use an H∞ –based adaptive technique, namely the H∞ timevarying (HITV) algorithm, in the Adaptive Interference Canceller for the removal of the PLR-driven component from the pupil diameter variation. Our intent is to use the output of the AIC, the Modified Pupil Diameter (MPD), further refined by additional processing (median evaluation of a sliding window), to indicate the occurrence of affective changes (e.g., onset of stress) in the computer user.
2 Methods 2.1 AIC Overview A basic AIC block diagram is illustrated in Figure 1, above. The signal of interest, polluted with an uncorrelated noise signal, is transmitted over the top channel, and constitutes the primary input to the AIC. The bottom channel receives a signal, which
Monitoring and Processing of the PD Signal
51
is uncorrelated with the signal of interest, but correlated with the interference, constituting the reference input. The output of AIC is expected to provide a signal where the correlated interference has been removed. As described in the previous paragraph, the key equations describing the AIC are: Primary Input: Reference Input:
Output:
d (k ) = s (k ) + z (k )
(1)
r ( k ) = n( k ) + u ( k )
(2)
e(k ) = d (k ) − y (k ) = d (k ) − zˆ (k ) = sˆ(k )
(3)
where, 9 9 9 9
s(k) is the signal of interest (PD- driven by PAR); z(k) is the interference in the primary sensor (PD- driven by PLR); n(k) is the actual source of the interference (Illumination changes); u(k) is the measurement noise. (In this study, u(k) is assumed to be zero);
In the AIC system, the core element is an adaptive transversal filter (ATF), where the reference signal r(k) is processed to produce an output y (k ) = zˆ (k ) that is an approximation of z(k). The state space model of the ATF is given by [5]: w(k + 1) = w(k ) + Δw(k )
(4)
d (k ) = r (k ) w(k ) + v(k )
(5)
z (k ) = r (k ) w(k ) + υ (k )
(6)
T
T
v(k ) = s(k ) + υ (k )
(7)
In these equations, 9 9 9 9 9 9
w(k) = system state vector, is the ATF coefficient vector of size m × 1 (m is the order of ATF); d(k) = measurement sequence, is the observed pupil diameter(PD) signal. z(k) = sequence to be estimated Δw(k ) = process noise vector, represents the time variation of the ATF weights w(k). v(k) = measurement noise vector, includes s(k) (PD- driven by affective changes) and model uncertainties υ (k ) .
r (k ) = [rk , r( k −1) ,......, r( k − m +1) ]T is the interference vector of size m ×1
As shown in Figure 1, for this study we define the primary input of the AIC as the recorded pupil diameter signal, which is composed by the signal of interest s(k) (PD- driven by PAR) and the interference z(k) (PD- driven by PLR). Although, the reference input comprises both the actual source of the interference n(k) (actual illumination changes) and the measurement noise u(k), under the assumption that the measurement noise is negligible, an independent measurement of illumination in the neighborhood of the eye of the subject (IL) is used as the reference input. We expect
52
Y. Gao, A. Barreto, and M. Adjouadi
the adaptive transversal filter (ATF) to emulate the transformation of the illumination variations to pupil diameter changes, which would convert the noise n(k) into a closeenough replication of the PLR-driven components of PD (the output y(k)). Therefore, the error, e(k) (Modified Pupil Diameter), would be the estimation of the desired signal s(k), i.e., the PD variations due exclusively to affective changes (PAR). 2.2 H∞ Time-Varying Adaptive Algorithm
In this study, the adaptive algorithm we applied to the Adaptive Interference Canceller system is the H∞ time-varying (HITV) adaptive algorithm, which aims to remove the noise from the recorded signal by adaptively adjusting the impulse response of the ATF. The robustness of this HITV algorithm is derived from its minimization of the maximum energy gain from the disturbances to the estimation errors with the following solutions [6]: P% −1 (k ) = P −1 (k ) − ε g−2 r(k )rT (k )
Here, g ( k ) is the gain factor; ε g , η and ρ are positive constants. Note that ρ reflects a priori knowledge of how rapidly the state vector w(k ) varies with time, and η reflects the a priori knowledge of how reliable the initial estimate available for the state vector w(0) is.
3 Experiments 3.1 Subjects
We have collected data from twenty-two volunteer students who have completed the protocol described below. All the subjects reported to have normal color vision and had experience using computers. This paper presents results achieved through the processing methods proposed on a subset of the subject pool recorded to date. 3.2 Task
In order to observe the response of pupil diameter changes to affective stimuli, the “Stroop Color-Word Interference Test”, implemented as a flash program [7], was used to elicit mild mental stress in the participating subjects during the experiment.
Monitoring and Processing of the PD Signal
Fig. 2. Sample of Stroop Test Interface
53
Fig. 3. Stimuli schedule of the Stroop Test
In the test, a word presented to the subject designates a color that may or not match the font of the word. The subjects are instructed automatically by the program to select (out of 5 possible choices) the screen button that indicates the font color of the word presented, by clicking on it (An example, showing the appearance of the test interface is shown in Figure 2). The stimuli schedule of the test in the experiment is shown in the Figure 3. The complete protocol is composed of three consecutive sections. In each section, there were four segments. 9 9 9 9
‘IS’ – the Introductory Segment to let the subject get used to the task environment; ’C’ – the Congruent segment of the Stroop Test, in which the subject was asked to click the on-screen button naming the font color of a word that correctly spelled the font color being displayed; ‘IC’ – the Incongruent segment of the Stroop Test in which the subject was asked to click the on-screen button naming the font color of a word that spelled the name of a different color; ‘RS’ – is a Resting Segment to let the subject relax for some time.
The incongruent Stroop segments (IC) were expected to elicit mild mental stress in the subject, according to previous research found in the psychophysiological literature [8]. In contrast, the congruent Stroop segments (C) were expected to allow the subject to continue in a relaxed state. The binary numbers shown in Figure 3 represent the demultiplexed output of the stimulus generator, which is used to insert the corresponding values (1, 2, 3) in the event channel of the PD data file, and the corresponding time marks in the illumination measurement data, recorded simultaneously. 3.3 Instruments
In this study, a desk-mounted eye tracking system (TOBII T60) was used to measure the pupil diameter signals from both eyes of the subjects at 60 samples/sec. The average ((L+R)/2) was recorded as the “Measured PD” signal, which corresponds to d(k), in Figure 1. Simultaneously, the illumination intensity level present in the area around the eyes of the subjects was recorded by a system built for that purpose. This system is composed by a BS500B0F photo-diode (Sharp), placed on the forehead of the subject and connected to an amplification circuit to provide an analog output voltage that is proportional (~0.0043 v/Lux) to the illumination intensity level [9]. The “Measured IL” signal, r(k), shown in Figure 1 was finally provided by sampling the analog
54
Y. Gao, A. Barreto, and M. Adjouadi
output of the luminance meter at the frequency of 360 Hz with a Data Acquisition (DAQ) System (PCI-DAS6023 board from Measurement Computing Co.) 3.4 Procedure
In our experiments, participants were asked to remain seated in front of the TOBII screen, interacting with the “Stroop Test” for about 30 minutes, while wearing a head band with the photo-diode. During that time, all the normal lights in the room were kept on, but an additional level of illumination provided by a desk lamp placed above the eye level of the subject was switched ON and OFF, alternatively, at intervals not previously known by the subject, using a dimmer. This was done to repeatedly introduce passages of high and low illumination in the experiment, which would trigger the pupillary light reflex.
4 Results Before its application to the adaptive interference canceller, the recorded pupil diameter signal is pre-processed by a blink-removal algorithm implemented in MATLAB, which is able to: 1. Detect the PD data interruptions due to eye blinks (identified as a value of “4” in the validity code provided by the TOBII system); 2. Compensate the missing data by linear interpolation. 3. Filter out the blink responses through a low pass, 512th order FIR filter designed for a cutoff frequency 0.13Hz. Figure 4 illustrates the stages of the blink-removal process on data collected from subject 13. MEASURED PD SIGNAL 6 4 2 0 -2
0
0.5
1
1.5
2
2.5
3 4
x 10 VALIDITY CODES FROM BOTH EYES 6 4 2 0 -2
0
0.5
1
1.5
2
2.5
3 4
x 10 PD SIGNAL AFTER BLINK COMPENSATION 6 4 2 0 -2
0
0.5
1
1.5
2
2.5
3 4
x 10 PD SIGNAL AFTER LOW PASS FILTER 6 4 2 0 -2
C1 0
0.5
IC1 1
C2 1.5
IC2 2
C3 2.5
IC3 3 4
x 10
Fig. 4. PD Signal Before & After Blink-Removal
Monitoring and Processing of the PD Signal
55
The PD signal obtained after blink-removal is applied to the AIC system as the primary input signal d(k), and the reference input r(k) is the simultaneously measured illumination intensity level signal, which is down-sampled from 360Hz to 60 Hz to share the same sampling rate with the PD signal. A MATLAB program was created to apply the HITV adaptive algorithms for the ATF with 120 weights and the parameter settings of η = 0.001 , ε g = 2.0 as well as a time-varying parameter ρ , changed according to the IL value to enable the AIC system to have a quicker response when there is a sudden increase in IL. The output of the AIC, the MPD, is shown in the bottom plot of Figure 5. This signal is further processed as described below to become a useful indicator of pupillary affective response in the subject. AIC PRIMARY INPUT: PD SIGNAL 6 4 2
0
0.5
1
1.5
2
2.5
3 4
x 10 AIC REFERENCE INPUT: IL SIGNAL 2 1.5 1
0
0.5
1
1.5
2
2.5
3 4
x 10 -6
1
AIC TIME-VARING PARAMETER: RHU
x 10
0.5 0
0
0.5
1
1.5
2
2.5
3 4
x 10 AIC OUTPUT: MPD SIGNAL 2 0 -2
C1 0
0.5
IC1 1
C2 1.5
IC2 2
C3 2.5
IC3 3 4
x 10
Fig. 5. AIC Implementation with HITV Algorithm
The affective state of “Stress” is expected to cause a dilation of the pupil [2]. Therefore, the negative portions of the MPD signal, are zeroed to isolate significant MPD increases, which indicate the emergence of stress in the subject. The result of applying this non-negative restriction to the MPD signal is shown in the bottom panel of Figure 6.
56
Y. Gao, A. Barreto, and M. Adjouadi
A sliding window with a width of 1200 samples is applied throughout the nonnegative MPD signal to calculate the median value within each window. The effect of this process on both the original PD signal and the non-negative MPD signal is compared in Figure 7. In this figure, it is clear that the significant increases that have been isolated in the processed MPD signal correlate closely with the occurrence of IC segments, regardless of the presence of higher illumination passages during segments IC2 and C3. It should also be noted that the same post-processing operated on the PD signal obtained directly from the eye gaze tracking system does not set apart the IC segments as clearly. Sliding Window Median Analysis of PD
AIC Output: MPD Signal 2
6
1
5
0
4
-1
0
0.5
1
1.5
2
2.5
3
3
0
0.5
1
1.5
2
2.5
3
4
4
x 10
x 10
Non Negative MPD Signal
Sliding Window Median Analysis of MPD
2
1
1
.5
0
0 C1
-1
0
0.5
IC1 1
C2 1.5
IC2
C3
2
IC3
2.5
C1 .5
3
0
0.5
IC1 1
C2
IC2
1.5
C3 2
2.5
4
x 10
Fig. 6. MPD Signal Non Negative Processing for Stress Indication
Fig. 7. Comparison of Sliding Window Median Analysis on PD & MPD
Signal Processing Results of Subject 11 3.5 3 2.5 MPD 2 Non Negtive1.5 MPD 1 Sliding 0.5 Window Median MPD 0 C1 -0.5
0
0.5
IC1 1
3 4
x 10
GSR
IC3
C2 1.5
IC2 2
C3 2.5
IC3 3
3.5 4
x 10
Fig. 8. Signal Processing Results of Subject 11
Monitoring and Processing of the PD Signal
57
A similar result is observed for data collected from Subject 11, as shown in Figure 8, where the 3 top signals (GSR, which was recorded simultaneously, MPD and Nonnegative MPD) have been shifted up to make the display clear. In this case, also, the most significant increases of the bottom trace, after the initial adaptation that precedes segment C1, occur during the incongruent segments (IC1, IC2 and IC3). This figure shows the appearance of Skin Conductivity Responses (SCRs) and overall elevation of the GSR signal during the incongruent segments. Furthermore, the bottom plot (the result of the proposed post-processing) also shows significant increases during IC1, IC2 and IC3, with minimal output on the other segments after the initial adaptation.
5 Conclusions This study implemented an H∞ time-varying adaptive algorithm in an Adaptive Interference Canceller to discount the influence of Pupillary Light Reflex from a measured PD signal, so that the output of the AIC, the MPD, with the application of a non negative constraint and sliding window median analysis, can be used as an indicator of Pupillary Affective Responses due to, for example, subject stress. This indicates that this approach might be useful for the affective assessment of a computer user, even in the presence of illumination changes. A comparison of this result with the outcome obtained by applying the same post-processing directly to the recorded pupil diameter signal, points out the advantage of the adaptive implementation, which output relatively more distinctive increases when the incongruent Stroop segments occurred. Data from other subjects who participated in our experiment have revealed similar results. These outcomes, therefore, encourage the continued exploration of adaptive processing algorithms applied to pupil diameter signals, as a non-invasive mechanism to achieve affective assessment of computer users in ordinary environments. Acknowledgements. This work was sponsored by NSF grants CNS-0520811, HRD0833093 and CNS-0426125. Ms. Ying Gao is the recipient of a Dissertation Year Fellowship from Florida International University.
References 1. Picard, R.W.: Affective Computing. MIT Press, Cambridge (1997) 2. Beatty, J., Lucero-Wagoner, B.: The Pupillary System. In: Cacioppo, Tassinary, Berntson (eds.) Handbook of Psychophysiology, 2nd edn., pp. 142–162. Cambridge University Press, Cambridge (2000) 3. Partala, T., Surakka, V.: Pupil size variation as an indication of affective processing. Int. J. of Human-Computer Studies 59, 185–198 (2003) 4. Widrow, B., Stearns, S.D.: Adaptive signal processing, pp. 337–339. Prentice-Hall, Englewood Cliffs (1985) 5. Hassibj, B., Kailath, T.: H∞ adaptive filtering. In: Proc. Int. Conf. Acoustics, Speech, Signal Processing, pp. 945–952 (1995) 6. Puthusserypady, S.: H∞ Adaptive filters for eye blink artifact minimization from electroencephalogram. IEEE Signal Processing Letters 12, 816–819 (2005)
58
Y. Gao, A. Barreto, and M. Adjouadi
7. Zhai, J., Barreto, A.: Significance of Pupil Diameter Measurements for the Assessment of Affective State in Computer Users. Biomedical Sciences Instrumentation 42, 495–500 (2006) 8. Renaud, P., Blondin, J.-P.: The stress of Stroop performance: physiological and emotional responses to color-word interference, task pacing, and pacing speed. Int. Journal of Psychophysiology. 27, 87–97 (1997) 9. Todoroki, A., Hana, N.: Luminance change method for cure monitoring of GFRP. Key Engineering Materials, 321–323, 1316–1321 (2006)
Usability Evaluation by Monitoring Physiological and Other Data Simultaneously with a Time-Resolution of Only a Few Seconds Károly Hercegfi, Márton Pászti, Sarolta Tóvölgyi, and Lajos Izsó Budapest University of Technology and Economics, Department of Ergonomics and Psychology, Egry J. u. 1, 1111 Budapest, Hungary {hercegfi,tim,ebediyen,izsolajos}@erg.bme.hu
Abstract. This paper outlines the INTERFACE methodology developed by researchers of our department. It is based on the simultaneous assessment of Heart Period Variability (HPV), Skin Conductance (SC), and other data. The objective and significance of this paper are (1) showing its capability of identifying quality attributes of software elements with a time-resolution of only a few seconds and (2) presenting its practical applicability in the evaluation phase of a real software development process. The Department of Ergonomics and Psychology at the Budapest University of Technology and Economics carried out a contract-based applied research project for the GeneraliProvidencia Insurance Co. Ltd. The Company was in the process of further developing the software used in its customer centers, and our Department contracted to assess the user interface. Both analytical and empirical usability evaluation methods were applied. In this paper, we highlight the new experiences of the INTERFACE testing methodology. Keywords: Usability testing and evaluation, empirical methods, case study, Heart Period Variability (HPV), Skin Conductance (SC).
This software is used by approximately 1500 users. From these 1500 users, 300 users use it daily. The main part of the users work in personal customer centers. One smaller part of the users work as call center operators. It is a smaller group of the users, however, the time pressure in their job underlines the importance of the usability factors. The Department of Ergonomics and Psychology at the Budapest University of Technology and Economics latched on to the software development to carry out the usability assessment. The main goal of our project was to collect problems of the user interface (UI). From these misses we made a list that we gave it to the developers. Developing solutions for the found problems can be a next project. Other goal of the Company was supporting a scientifically interesting research project transferring results of foundational researches to really applicable methodology in real software development process.
2 Applied Methods The preparation of the project started in November 2007. The contract was signed in March 2008. We got in touch with the developers at the beginning of April 2008. We started our project with studying the software and collecting data and facts. While we were studying it, the system was updated, so sometimes it was difficult to be up to date. In the 2nd part of April we started to observe the users of the software, and we also made interviews with them. We performed the observation and interviews at 3 different locations: in the biggest customer service of the firm, in another type, smaller customer service, and in a call center. We obtained some logfile data to establish objective initial data for the subsequent assessment. We applied analytical methods (usability inspection methods) in May and June 2008. The backbone of the analytical evaluation was a Guideline Review, supported by Cognitive Walkthrough elements. Also GOMS model-based analysis was carried out. The main part of the assessment was a carefully planned, deep empirical series of experiments applying the INTERFACE methodology described in the following section. The series of experiment were carried out in July 2008; the analysis of the collected records was performed in August 2008.
3 Description of Our Main Methodology: The INTERFACE Figure 1 shows the conceptual arrangement of the INTERFACE (INTegrated Evaluation and Research Facilities for Assessing Computer-users' Efficiency) workstation.
Usability Evaluation by Monitoring Physiological and Other Data
61
observable behavior current screen content
data collecting and
keystrokes and mouse clicks
processing frame system
physiological signals by ISAX
Fig. 1. Conceptual arrangement of the INTERFACE user interface testing workstation
The advantage of the methodology applied in our study lies in its capability of recording continuous on-line data characterizing the user’s current mental effort derived from Heart Period Variability (HPV) and the user’s emotional state indicated by Skin Conductance (SC) parameters simultaneously and synchronized with other characteristics of Human-Computer Interaction (HCI). This way, a very detailed picture can be obtained which serves as a reliable basis for the deeper understanding and interpretation of psychological mechanisms underlying HCI. Elementary steps of HCI, like the different mental actions of users followed by a series of keystrokes and mouse-clicks, are the basic and usually critical components of using software. These steps can be modeled and analyzed by experts, but empirical studies of real users’ interactions often highlight new HCI issues or give more objective results than expert analyses. One of the key aspects of the empirical methods is measuring mental effort as it is laid down e.g. in the earlier international standard of software product evaluation (ISO/IEC 9126:1991). Hence we need methods capable of monitoring users’ current mental effort during these elementary steps. To attain the above, a complex methodology was developed earlier at the Budapest University of Technology and Economics, by Prof. Lajos Izsó and his team [3, 4, 5]. This study presents an improved methodology and a new case study. The INTERFACE simultaneously investigates the following: • Users' observable actions and behavior − keystroke and mouse events; − video record of the current screen content; − video records of users’ behavior: (1) mimics, (2) posture and gestures. • Psycho-physiological parameters − Power spectrum of Heart Period Variability (HPV), regarded as an objective measure of current mental effort – we apply this signal successfully since more than 15 years [1, 2, 3, 4]; − Skin Conductance (SC) parameters, indicating mainly the emotional reactions – recently integrated into our system.
62
K. Hercegfi et al.
A number of studies [1, 2, 3, 5, 7, 8, 9, 10] have shown that an increase in mental effort causes a decrease in the mid-frequency (MF) peak of the HPV power spectrum. The main advantage of the assessment method of the spectral components integrated into our system over the previously existing HPV-based methods is that the MF component of HPV shows changes in mental effort in the time range of several seconds (as opposed to the earlier methods with a resolution of tens of seconds at the best). This feature was achieved by an appropriate windowing data processing technique, and application of an all-pole auto-regressive model with built-in recursive Akaike's Final Prediction Error criteria and a modified Burg’s algorithm. We watch the Alternating Current (AC) component of the Skin Conductance (SC) responses focusing mainly on the emotional aspects of the HCI, in addition to our well-tried approach of mental effort. An interesting series of experiments analyzing SC responses is finished by one of our colleagues [6]. It is a good example of the promising way to use data mining techniques in empirical usability studies. Participant, operator of the Computer used Camera to Display Standard call center with standard by the record the with IP headset regularly used in the participant – facial the phone call center. The Skin according to expressions. software of the Conductance (SC) electrodes can the standard Motorized face currently call tested center tracking, be seen on the left hand, the workstation of zoomable ECG electrodes are on the torso the call center
The ISAX equipment to record the physiological signals
Camera to record the body posture
Computer of the experimenter – during the session, online curves of the physiological signals, video images of the cameras and the editor window for the comments can be seen
Fig. 2. The experimental arrangement applied during the sessions of the INTERFACE usability testing installed on a standard workstation of the call center
Usability Evaluation by Monitoring Physiological and Other Data
63
4 Applying INTERFACE in the Current Series of Experiments The empirical sequence of experiments applying the previously introduced INTERFACE assessment was performed with the call center operators in the middle of July 2008. Due to the impersonality caused by the phone calls, on one hand the simulation was more authentic than it could be in the personal customer centers. On the other hand, because of the specialty of the call center, we could count on concentrated, quick problem-solving. What more, the usability problems are the most critical in the call centers due to the time pressure. The tasks occurred in the call center are usually not solved only by means of the Genesys, but these other software are not being examined in this project. However,
The screen just seen by the user Two cameras: Facial expression and body posture
Keyboard and mouse actions
Experimenter’s comments
Upper (blue) curve: AC component of the Skin Conductance (SC). The higher deviation means more emotional event.
Signals derived from the ECG, related to mental effort. Red RR curve in the middle: periods between the subsequent heart beats in ms. Last (green) profile curve: the MidFrequency (MF) power of the variability of the RR curve. Its low values mean significant mental effort; peaks mean relief, relaxation.
Fig. 3. The INTERFACE Viewer screen with a record of the empirical test of the Genesys software. As it can clearly be seen, currently the user makes significant mental effort – it is shown by the facial expression and gesture, and the low value of the last green profile curve of the Mid-Frequency (MF) power of the Heart Rate Variability (HRV) at the cross-hair.
64
K. Hercegfi et al.
we had to give controlled, simulated tasks to the users in order that they can solve them in the Genesys system. These simulated tasks helped us to be able to compare the 12 sessions. We used the version of the Genesys from the test server that is substantially equal to the version actually used in real. 12 real operators were involved as participants, each of whom we recorded a onehour-long session with. The quantity of data gained from these sessions is really significant according to the depth of the enquiry. Due to the real life situation the users participating the series of experiments, were more or less disturbed by their colleagues’ calls or talks. Since these are the employees’ real conditions, they have got used to them, and so they could work on typical, real workstations of their workplace. These advantages gave reason not to hold this usability testing in laboratories. Nevertheless, we chose a workstation that was located in the corner of the operators’ room, in order not to disturb the others. Behind this workstation, the team leader’s glass wall can be found, and so our staff could sit and make simulated phone calls from behind this pane. As mentioned before, during the recorded sessions, the users were given tasks from the real life, with quasi-real data, names, problems, questions, etc.; the main difference was that the customers were not real customers, but from our staff. Three ECG electrodes were put on the users’ torso and one on the left hand (in case of a right-handed person) for measuring Skin Conductance. After that the users put on their headsets and adjusted their seats. Figure 2 shows the experimental arrangement applied during the sessions of the INTERFACE usability testing installed on a standard workstation of the call center. Figure 3 shows the INTERFACE Viewer screen with a record of the empirical test of the Genesys software. At the beginning of the session, we asked the users to relax for two minutes. We told them the aim and the details of the assessment. We always emphasized that we are not willing to assess they themselves, but the Genesys software by means of their help. Relaxation was followed by two-minute mental effort: mental arithmetic. The result of the counting was not important; we only wanted to generate mental effort. These periods were planned for “calibrating” the physiological curves. After that, the hard work started. The first customer (from our staff) rang the phone and asked some questions. To answer these questions the operator had to use the software under testing. Later 4 more similar calls came. Each call contained 2 to 4 questions. One part of the questions were just to “warm up”, others were really difficult. The subtasks were based on the interviews, observations and expert analyses performed earlier. The last part of the INTERFACE session was an interview.
5 Validation As it was mentioned, the periods of relaxation and mental arithmetic were planned for “calibrating” the physiological curves. The curves shown in the upper part of the Figure 4 were recorded during session #11, the ones shown at the bottom were recorded during session #10.
Usability Evaluation by Monitoring Physiological and Other Data
relaxation
65
mental arithm.
relief
relaxation
mental arithm.
relief
Fig. 4. The typical pattern of the relaxation and mental arithmetic in cases of two participants
In both cases, the three curves are the blue curves of the AC of Skin Conductance (SC), the red RR curves (Heart Periods), and the green profile curves of the MidFrequency (MF) power of Heart Rate Variability (HPV). The blue curve of AC of SC is relatively smooth during both the relaxation and the mental arithmetic. During these sections, there are not any emotional peaks, and these two participants can be characterized as “stabile” type according to the typology of physiology. However, the beginnings and the ends of the sections are followed by peaks. During relaxation, the MF component of the HPV increases, then the red RR curve has zigzags, and the green profile curve is relatively high. (In case of perfect relaxation, the profile curve should be consistently high. However, this is not expected in this experimental situation. The curve can be considered as high, especially in comparison with the next section.) During mental arithmetic, the red RR curve gets smoother, and the green profile curve is significantly low. After the “calibration” tasks, the participants really relieve. During this short period of relief, the participants get more relaxed than during the conscious, intended relaxation: the green curves have their highest peaks here.
66
K. Hercegfi et al.
Mean of MF power of HPV [ms2]
These “calibration” tasks prove a validation of our method. The values of the MF power of the HPV were significantly higher during relaxation than during mental arithmetic. A non-parametric statistical method, the Wilcoxon Signed Ranks Test proves the difference (sig. 0.037 – Figure 5).
100,0
80,0
60,0
40,0
20,0
0,0 relaxation
mental arithm.
Fig. 5. Validation of measuring Mid-Frequency (MF) power of Heart Rate Variability (HPV) as an indicator of mental effort: the MF power of HPV was significantly higher during relaxation than during mental arithmetic (sig.0.037)
It is a significant difference, in spite of the non-perfect relaxation. However, the mental arithmetic task works better: the significance of the difference between the values of MF power of HPV during mental arithmetic and in general, during the whole software usage section is better: the Wilcoxon test results sig. 0.002. The values of the deviation of the AC component of Skin Conductance (SC) do not differ during the relaxation and the mental arithmetic significantly. As it was described earlier, this is the expected result. However, the deviations of the AC of SC during the relaxation and the mental arithmetic are significantly lower than in general, during the whole software usage section: the Wilcoxon tests result sig. 0.009 and sig. 0.017. After these results, we can say that the low value of the curve of the MF power of HPV really means mental effort, and the high deviation of AC of SC probably means higher emotions. Than, in the section of software usage, we look for moments with relatively high (and unwanted) mental efforts and high (and unwanted and not positive) emotions. This method gives us the key to find the problems of the UI.
6 A Sample UI Problem Identified by the INTERFACE Methodology Commercial sensitivities prevent publication of the most of the details of the particular software problems found. However, Figure 6 gives an illustration.
Usability Evaluation by Monitoring Physiological and Other Data
67
Fig. 6. The 11th participant during the first task of the second call. The mental effort can clearly be seen: it is shown by the facial expression and gesture, and the low value of the last green profile curve of the Mid-Frequency (MF) power of the Heart Rate Variability (HRV) at the cross-hair. In this case, the problem was caused by a bad design solution to ensure choosing a time period for a list view. This problem can also be found by analytical methods – but the INTERFACE highlighted this problem and gave objective evidence for it.
7 Conclusion Based on the results presented here as well as in related papers, it can be stated that the INTERFACE methodology in its present form is capable of identifying the relative weak points of the HCI. By this methodology and the related workstation, it was possible to study events occurring during the HCI in such detail and objectivity that would not have been possible using other methods presently known to us. The sophisticated Heart Period Variance (HPV) profile function integrated into the INTERFACE system is a powerful tool for monitoring events in such a narrow time frame that it can practically be considered as a time-continuous recording of relevant elementary events. Measuring the Skin Conductance (SC) is a new opportunity to modulate the results. Acknowledgements. The authors would like to thank Dr. Eszter Láng for the earlier developments, the Generali-Providencia Insurrance Co. Ltd. for the support of the deep research, and the participants of the series of experiments for their valuable contribution.
68
K. Hercegfi et al.
References 1. Chen, D., Vertegaal, R.: Using Mental Load for Managing Interruptions in Physiologically Attentive User Interfaces. In: Proc. CHI 2004, pp. 1513–1516. ACM Press, New York (2004) 2. Hercegfi, K., Kiss, O.E., Bali, K., Izsó, L.: INTERFACE: Assessment of HumanComputer Interaction by Monitoring Physiological and Other Data with a Time-Resolution of Only a Few Seconds. In: Proc. ECIS 2006, ECIS Standing Comm., pp. 2288–2299 (2006) 3. Izsó, L.: Developing Evaluation Methodologies for Human-computer Interaction. Delft University Press, Delft, The Netherlands (2001) 4. Izsó, L., Hercegfi, K.: HCI Group of the Department of Ergonomics and Psychology at the Budapest University of Technology and Economics. In: Ext. Abstracts CHI 2004, pp. 1077–1078. ACM Press, New York (2004) 5. Izsó, L., Láng, E.: Heart Period Variability as Mental Effort Monitor in Human Computer Interaction. Behaviour Information Technology 19(4), 297–306 (2000) 6. Laufer, L., Németh, B.: Predicting User Action from Skin Conductance. In: Proc. IUI 2008, pp. 357–360. ACM Press, New York (2008) 7. Lin, T., Imamiya, A.: Evaluating Usability Based on Multimodal Information: An Empirical Study. In: Proc. ICMI 2006, pp. 364–371. ACM Press, New York (2006) 8. Mulder, G., Mulder-Hajonides van der Meulen, W.R.E.H.: Mental Load and the Measurement of Heart Rate Variability. Ergonomics 16, 69–83 (1973) 9. Orsilia, R., Virtanen, M., Luukkaala, T., Tarvainen, M., Karjalainen, P., Viik, J., Savinainen, M., Nygard, C.-H.: Perceived Mental Stress and Reactions in Heart Rate Variability – A Pilot Study Among Employees of an Electronics Company. International Journal of Occupational Safety and Ergonomics (JOSE) 14(3), 275–283 (2008) 10. Rowe, D.W., Sibert, J., Irwin, D.: Heart Rate Variability: Indicator of User State as an Aid to Human-Computer Interaction. In: Proc. CHI 1998, pp. 18–23. ACM Press, New York (1998)
Study of Human Anxiety on the Internet Santosh Kumar Kalwar and Kari Heikkinen Lappeenranata University of Technology, Department of Information Technology, Lappeenranata, Finland [email protected], [email protected]
Abstract. In this paper a conceptualization of human anxiety on the Internet is introduced; it is built on the understanding of human behavior with regard to technology. The objective of this paper is to conceptualize the human anxiety. An integral part of understanding is an inter-disciplinary (psychology science, cognitive science, behavioral science and communication technology) literature review, of which and overall summary is presented. The understanding is conceptualized by designing, implementing and evaluating through a developed user study model. In this paper the preliminary result of utilizing the developed user study found seven particular anxiety areas which need further studies. Keywords: Human, study, anxiety, internet.
skyrocketed to 541 million [2]. The fact and digital data from various sources confirms that number of humans accessing the Internet is growing at a rapid pace. As the Internet evolves in terms of number of human online, it feels as if it is evolving as social community. Worldwide, the Internet population is growing at a rapid pace. The number of people getting access to information, learning, and going online is booming like never before. It should be remembered that as late 1988, only a few countries were connected to the Internet. According to the 2004 CIA World Fact book, over 50 countries have at least one million humans using the Internet [3]. Humans once found it difficult and expensive to communicate during the times of voice telephones but with the rapid technological development communication has improved drastically. The distance has been shortened from family, friends and from seeking information which can be seen as the replacement of very important daily interactions. Boundaries of time, distance and identity are broken by the transfer of simple applications like e-mail to the complex world of virtual communities. Together with the positive growth, its negative effects are growing too. According to the U.S. Department of Justice, the Internet is an anonymous and effective way for many predators to find and groom children for illegal activities [4]. The fear of using the internet is further amplified by social disintegration, psychological and cognitive implications. 1.1 Objective and Scope The main research question is, “how can we address the challenges such as Internet addiction, psychology and human computer interaction it is currently facing now?” To understand the objectives, some of the questions were researched in details. The paper aims to find answers to the following hypotheses: • • • • •
Do users shows increased or reduced anxiety level when using the Internet? What kinds of behaviors are shown when using the Internet? What is the role of the content? Finding types of the anxiety behaviors? How human process information at the internet interface?
The scope of the work includes design, implement and evaluate methods used in understanding the behavior of humans using the Internet technology. It has wide range of scope from the field of psychological perspective to cognitive science, behavioral science and communication technology. The scope is complex. However, To study human and their tasks and how to relate information to design style, human behavior theories, standards, procedures or guidelines in order to build an appropriate model of interaction with the help of some existing methods is investigated. 1.2 Internet Anxiety A lot has been written in past about the negative use of the Internet anxiety, Internet addiction and full dependence on the internet is welling up [5] [6] [7]. The service disruption because of network faults, software bugs, administrator mistakes and version upgrade could seem less tolerable. Millions of human around the world use Internet to search, inform, find, communicate, work and play. Internet should not be
Study of Human Anxiety on the Internet
71
only viewed as negative such as addiction and pathological nor should it be vilified. One must be aware of negative consequences of overuse of the Internet by understanding the behavior of themselves and from others. Four types of Internet anxiety was identified by Presno. C,, using qualitative study method [8]. These are: Internet terminology anxiety: anxiety produced by an introduction to a host of new vocabulary words and acronyms. Net search anxiety: anxiety produced by searching for information in a maze-like cyberspace. Internet time delay anxiety: anxiety produced by busy signals, time delays, and more and more people clogging the Internet. General fear of Internet failure: a generalized anxiety produced by fear that one will be unable to negotiate the Internet, or complete required work on the Internet. Additional three areas of the Internet anxiety from the qualitative study were found in this research. Experience anxiety: an anxiety produced by lack of concentration or focus. Usage anxiety: a generalized anxiety produced by excessive usage of the Internet. Environment and attraction anxiety: anxiety produced by content on the Internet. For example: interactive game, pornography and larger number of colorful applications. Table 1. Types of Anxiety recorded for Subject I and Subject II Types of anxiety Terminology Search Time delay General fear Experience Usage Environment & attraction
p1
p2
p3
x x x x x
x x x x x
p4 x x x x x x
p5
p1 x
x x
p2 x
p3 x x
x x
x
x x x
x x x
x x
x x x x
p4
p5
x x x
x x x x x
x
2 Inter-disciplinary Literature Review Sociability is important as well as usability of applications in the Internet. While usability is concerned with making sure that the application, software and system is consistent, predictable, and easy and satisfying to use, sociability and the social aspect of building and maintaining an online community focuses on processes and styles of social aspect in interaction that support human behavior on the internet to some extent. Research has shown certain social groups to be under-represented on the Internet [9] [10] not simply because of a lack of access, but more because of cognitive, motivational and affective factors [11]. Psychology therefore has an important role in advancing the understanding of why humans choose to use or not to choose the usage of the Internet [12]. There will always be an argument to model psychology with technology or technology with psychology however; combining psychology with technology will give rise to new technology called psycho technology. To understand the Internet technology in broader ways, interaction between human and the technology through the Human Computer Interaction becomes essential. Brain Computer
72
S.K. Kalwar and K. Heikkinen
Interface techniques are only studied in this research work, no practical implementation was carried out due to the limitation of resources. In BCI, skill developed by a human involves proper control of electro physiological signals which are easily adapted and modulated by the brain for better feedback. In understanding human anxiety, BCI techniques could be used for testing and analyzing different activities on the Internet. The ethnographic study of the Internet can be divided into two categories. First, user-based and second, content based. User-based analysis is about the investigation, examination and the study of humans using the Internet. Whereas, content based analysis is mainly focused on text. Humans are capable of providing reasons to support their points of view if asked a question such as “What color is my shirt?” and are capable of knowing without explicit Deduction or reasoning answers to questions like, “If I were you, I would hate myself”, whereas computers cannot function without specific programming instructions.
3 Methods, Design and Implementation The user study model used both qualitative and quantitative research methods. The qualitative research was conducted using interviews and observational analysis. The quantitative research was conducted in three iterations by using questionnaires and surveys. The Questions were analyzed with the help of different types of questionnaires being constructed. Intentionally, Participants were given very general types of question to answer. Two types of Questionnaires were used: Using the Internet and Using the Pen and Paper method. Task calculation was carried out by dividing the task into subtask. Two types of subject were categorized based on skill level (novice, intermediate and expert human): Subject I and Subject II. Usability Test recording form was used for each participant. This form was used to record both the verbal and non-verbal behaviors of humans using the Internet. The task was chosen which was based on the
Fig. 1. The user study model implemented to test behavior of humans using the Internet
Study of Human Anxiety on the Internet
73
Internet. The task contained three modules which were divided into task 1, task 2 and task 3 based on the level of difficulties. It was found from the task analysis that humans using the Internet took less than a second to complete all of these tasks. There is a division of the participant’s overt behavior into two major general categories: verbal and nonverbal. Verbal behavior includes anything a participant says and Nonverbal behaviors include various activities that the participants actually do. In non verbal behavior mostly facial expression such as smiling, looks of surprise, furrowing brow or showing body gestures such as leaning close to the screen, rubbing the head is shown by the participants these are facial expression, eye-tracking, pupil diameter, skin conductance among others.
4 Results and Discussion The goal of the study was to determine “how can we address the challenges such as Internet addiction, psychology and human computer interaction it is currently facing now?” In order to evaluate the main research question, the main research question was broken down into hypothesis. There were five different hypothesis formulated in the beginning of the research. Now, Let us try to discuss these hypotheses to see our method, design, evaluation and analysis of the research was supported (fully supported, partially supported and not supported) or not. H1: Do users shows increased or reduced anxiety level when using the Internet? Hypothesis 1 was fully supported. Human shows the sign of increased level of anxiety when using the Internet. It appears that, with any given task on the Internet number of anxiety increase in human. More number of participants said “yes” to five or more items from the QS 2, which indicates problematic Internet Usage. Using HADS and PHQ-9 it was observed that, one participant seemed to have a Case of higher depression scale. Therefore, in this particular case users showed increased anxiety level when using the Internet. H2: What kinds of behaviors are shown when using the Internet? Hypothesis 2 was fully supported. Literature review revealed that there are two types of behavior shown by human using the Internet: Verbal and Non verbal. During Observation of behavior for the participants, Most of the times humans were laughing, smiling, drumming their fingers on the table and looking aimlessly around. These behaviors pattern was verbal and non-verbal. These types of the behavior patterns were observed among narrowly selected group of the participants. H3: What is the role of the content? Hypothesis 3 was partially supported. Role of content could determine the predicted or unpredicted human behavior on the Internet. Such as addiction, anxiety and stress of using the Internet, since humans were successfully able to complete the task with ease, it could be predicted that-any sort of given task for humans is very easy to do on the Internet. Therefore, the role of content has principal impact on how human behaves on the Internet.
74
S.K. Kalwar and K. Heikkinen
H4: Finding types of the anxiety behaviors? Hypothesis 4 was fully supported. It was found that in this particular case there are seven main area of anxiety in the humans: Internet terminology anxiety, Internet search anxiety, Internet time delay anxiety, general fear of Internet failure anxiety, experience anxiety, usage anxiety, and environment and attraction anxiety. Using Observation methodology and Comparing two types of subjects: Subject I and Subject II, it concludes that- all the participants showed these above cited anxiety. H5: How human process information at the internet interface? Hypothesis 5 was partially supported. When human interacts with the internet interface, it appears that, everything that human senses such as sight, hearing, touch, smell and taste are processed as the information in the mind. This information could result in behavior such as verbal and non-verbal. Even if behavior initially disappears, it may partially return as undamaged parts of the brain reorganize their linkages. The human behavior in totality of processing information includes internal cognitive processes which can result in observable behavior. Processing of the information at the internet interface has the realistic approach such as thinking of mental processes as several railroad lines that all feed to same terminal. Two schools of thought have emerged which confirms with the hypothesis that, there are two types of the behavior while using the Internet: Verbal and Non-Verbal behavior. In more general terms human using the Internet can use the content available on the Internet in two different ways: Positive or Negative. The gestures or types of human behavior shown could lead to anxiety. Seven major types of anxiety were studied and validated: Internet terminology anxiety, Internet search anxiety, Internet time delay anxiety, and general fear of Internet failure anxiety, experience anxiety, usage anxiety, and environment and attraction anxiety. Two types of behavior (verbal and non-verbal) were formulated from relevant literature study, empirical analysis and evaluation. The study of Brain Computer Interface concludes that signal could be send in human brain physically to control and observe behavior of humans, however the BCI techniques were not used in the study and without using BCI techniques, the study conducted discovered sample of humans showing increased level of anxiety when using the Internet. The task completion behaviors of humans were calculated. By the end of this discussion session, We can reach to the conclusion that, to reduce Internet anxiety, addiction and depression scale on the Internet, it is important to have many multicultural experiences, and control over own self behavior to accumulate successful experiences of behavior.
5 Conclusions and Future Work Taking the results and discussion to the logical conclusion it appears to the authors knowledge that, “Internet has lulled humans with the sense of dependency to greater extent”. Five types of hypothetical questions were answered in this study. These questions were: Do users shows increased or reduced anxiety level when using the Internet? What kinds of behaviors are shown when using the Internet? What is the role of the content? Finding types of the anxiety behaviors? , And How human process information at the internet interface? Seven major types of anxiety were studied
Study of Human Anxiety on the Internet
75
and validated: Internet terminology anxiety, Internet search anxiety, Internet time delay anxiety, and general fear of Internet failure anxiety, experience anxiety, usage anxiety, and environment and attraction anxiety. Two types of behavior (verbal and non-verbal) were formulated from relevant literature study, empirical analysis and evaluation. The study of Brain Computer Interface concludes that signal could be send in human brain physically to control and observe behavior of humans, however the BCI techniques were not used in the study and without using BCI techniques, the study conducted discovered sample of humans showing increased level of anxiety when using the Internet. From the book by Sir Tony Hoare, The first passage in Communicating Sequential Processes reads, “Forget for a while about computers and computer programming, and think instead about objects in the world around us, which act and interact with us and with each other in accordance with some characteristic pattern of behavior”. The same idea is followed in the study of human anxiety on the Internet. • Large sample size, different demography structure and discovery of perfect user study model are needed, for larger impact and generalization. • In contrast to several findings of negative effect in the Internet addiction, anxiety and depression group, some positive effects could be determined in future by building the framework for future learning through Imagination, Investigation and Innovation. • In depth analysis and comparisons of the human brain and the network open System Interconnection (OSI) model could be performed. Despite the above limitations, undoubtedly the Internet has provided a collection of applications that is having a profound effect on mankind. Like the wheel, the plow, and steam power before it, it is a proving a truly differentiating tool in our world, changing the very ways in which we interact with each other. Progress is relatively easy to recognize if we follow technology exploration. A more challenge is to find technology we want to change ourselves and for our civilization. Understanding the human factors for the design and development of the technology, system and services to ensure successful and perfect application environment is major concern. The forms of anxiety identified suggest areas for future Internet development and research.
References 1. ISOC.org (2008), http://www.isoc.org/internet/history/brief.shtml 2. ISC.org. Millions of host on the internet (2008), https://www.isc.org/ 3. Central Intelligence Agency. The DigiWorld in the global economy. In: DigiWorld 2008 (2008), https://www.cia.gov/library/publications/ the-world-factbook/geos/xx.html; https://www.cia.gov/library/publications/ the-world-factbook/fields/2184.html 4. Golden, S.M.J.: Protecting children in the internet age (2008), http://www.senate.state.ny.us/sws/ Protecting%20Children%20in%20the%20Internet%20Age.pdf
76
S.K. Kalwar and K. Heikkinen
5. Kraut, R.E. (2008), http://www.cs.cmu.edu/~kraut/RKraut.site.files/articles/ Bessiere06-Internet-SocialResource-DepressionL.pdf 6. Chou, C.: Incidence and correlates of internet anxiety among high school teachers in taiwan. Computers in Human Behaviour 19, 731–749 (2003) 7. Skinner, B.F. (2008), http://www.bfskinner.org/aboutfoundation.html; http://www.bfskinner.org/f/Science_and_Human_Behavior.pdf 8. Presno, C.: Taking the byte out of internet anxiety: Instructional techniques that reduce computer/internet anxiety in the classroom. J. Educ. Comput. Res. 18, 147–161 (1998) 9. Jackson, L.A.: Social psychology and the digital divide. In: The 1999 Conference of the Society for Experimental Social (1999) 10. Sax, L.J., Ceja, M., Teranishi, R.T.: Technological preparedness among entering freshmen: the role of race, class and gender. Journal of Educational Computing Research 24, 363– 383 11. Jackson, L.A., Ervin, K.S., Gardner, P.D., Schmitt, N.: Gender and the internet: Women communicating and men searching. Sex Roles 44(5), 363–379 (2001) 12. Jackson, L.A., Ervin, K.S., Gardner, P.D., Schmitt, N.: The racial digital divide:Motivational, affective and cognitive correlates of internet use. Journal of Applied Social Psychology 31, 2019–2046 (2001)
The Research on Adaptive Process for Emotion Recognition by Using Time-Dependent Parameters of Autonomic Nervous Response Jonghwa Kim1, Mincheol Whang2, and Jincheol Woo1 1
Dept. of Computer Science, Sangmyung University, 7 Hongji-dong, Jongno Gu, Seoul, Korea {rmx2003,mcun}@naver.com 2 Dept. of Digital Media Technology, Sangmyung University, 7 Hongji-dong, Jongno Gu, Seoul, Korea [email protected]
Abstract. This study is to propose new method, called by TDP (time dependenet parameter) anlaysis, of physiological signal processing for emoiton recognition. TDP consised of delay, activation, half recovery and full recovery. TDP was determined from running average and normalization of physiological signals for finding tonic and phasic reponse according to emotion at entire time range from stimulating emotion to recovery. As the results of this study, TDP analysis and adaptive TDP analysis enhanced accuracy of emotion recognition in the comparison with tonic analysis. Speciallly, TDP analysis enhanced the accuracy while adaptive TDP analysis reduced the individual difference of the accuracy. Keywords: Physiological signal, GSR, ECG, PPG, Skin temperature, emotion recognition, accuracy.
Physiological response has been characterized by individual difference. Same emotion could be recognized by different regulation of physiological response. Therefore, the strategy or rule of emotion recognition should consider individual characteristics. Some findings showed that algorithm of emotion recognition set physiological variation automatically based on verification of subjective emotion and that this process enhances the accuracy of emotion recognition[9]. Therefore, emotion could be well recognized by considerations of noise reduction, discrimination between tonic level and phasic response of physiological signals, and individualization. Considering these issues, this study is to suggest new analysis methods of physiological signals, called by TDP (time dependent parameter) and to attempt to show effectiveness of emotion recognition when it is used.
2 Method 2.1 Research Purpose This study is to propose new analysis methods for physiological response and to prove that this method was effective for emotion recognition. This research was proceeded to compare the accuracies of emotion recognition from different methods which were tonic analysis, TDP analysis, and adaptive TDP analysis. 2.2 Definition of TDP TDP (Time dependant parameter) of physiological response was defined in this study as shown in Fig. 1. The delay was moment difference between stimulation and the activation. The activation meant time at peak form the beginning. The half recovery indicated the time at half peak and the full recovery did the time back to base state. The full recovery could be inferred from half recovery when it could not be measured. In this study, ECG (Electrocardiogram), RSP (Respiration), PPG (photoplethysmogram), GSR (Galvnic skin resistance) and SKT(Skin temperature) were processed to construct TDP curve as shown in Fig. 1.
Fig. 1. TDP (time dependent parameter) of physiological measurement
The Research on Adaptive Process for Emotion Recognition
79
2.3 Emotion Induction Emotion was tried to be induced by image pictures. The images were chosen by previous study [8]. 100 university students (33 females and 67 males) participated, and were not visually handicapped. Participants were asked to score subjective emotion after watching the images. Significant images of emotion induction were categorized into 2 dimensional emotion model[10]. 6 images were selected for evoking unpleasantness-arousal emotion and 10 for pleasantness-relaxation emotion as shown Fig. 2. The pleasantness-relaxation was called as the positive emotion, and the unpleasantness-arousal was called as the negative emotion in this study.
Fig. 2. The images evoking emotions
2.4 Experiment 4 university students (average 26.5 years old) participated in the experiments and were healthy with no problem of vision. 24 prepared images were presented to participants for inducing emotion. PPG, GSR, RSP and SKT were measured during presenting images. Experiment procedure was shown as fig. 3. Participant experienced first non-image state as a reference state for 30 seconds followed by presenting the image for 10 seconds. Then, non-image state called by neutral state was presented for 30 seconds. A procedure consisted of presenting 4 images and a participant experienced 6 times a procedure. The experiment took 190 seconds for a procedure and total experimental time was 1440 seconds for a participant.
Fig. 3. Experimental procedure
80
J. Kim, M. Whang, and J. Woo
3 Analysis 3.1 Data Acquisition Collected were 6 sets consisting of 4 physiological signals and subjective score of emotion. For purpose of analysis, data unit was set at 70 seconds including neutral state for 30 seconds, stimulation for 10 seconds and another neutral state for 30 seconds. Total 74 data (6 set * 4 pictures * 4 participants) were prepared for tonic analysis, TDP analysis, and adaptive TDP analysis. 3.2 Running Average for Noise Reduction and Normalization The running average has been effective to reduce noise [3]. In this research, the time interval for running averaging was determined by response rate of each signal. Time intervals of physiological signals were tried to be set for noise reduction. The determination was done by visual scanning for confirming signal stability. The time interval of GSR and SKT was set at 0.5 seconds while one of RSP was at 3 seconds. PPG was analyzed to convert HR by frequency analysis. Therefore, time intervals for PPG and HR were set 2 seconds. Running average was performed by sliding window method at the pre-determined time interval on all the physiological signals. Then, the stimulus state of physiological signal was normalized from the neutral state. This process was able to observe activation level (tonic level) of physiological signal. Normalization was determined by equation 1 and performed at every 0.5 seconds. Normalized state = (Stimulus state –Neutral state) / Neutral state
(1)
3.3 TDP Rule for Emotion Recognition TDP rule for emotion recognition was determined from previous study [8]. Visual stimulus from the prepared images induced their emotion as shown Fig 2. Then, physiological signals were analyzed into TDP and their threshold values of emotion recognition were set for constructing rule of emotion recognition as shown Table 1. Table 1 showed mean and standard deviation of physiological response time of emotion based on TDP definition. The range was defined by mean plus and minus standard deviation for recognizing respective neutral, positive and negative emotion. 3.4 Adaptive TDP Rule for Individualization Since physiological response on same emotion was individually different, TDP rule for emotion recognition needed to be developed for individualization. The process of adaptive TDP rule was shown as Fig. 4. First, emotion was recognized by nonadaptive TDP rule. Second, difference between measured subjective emotion and emotion by physiological signals was calculated. Then, if difference existed, TDP rule was adaptively set by individual input of subjective emotion. Otherwise, emotion recognition was performed. Through these processes, it became individual and accurate adaptively for a particular person.
The Research on Adaptive Process for Emotion Recognition
Fig. 4. The process of adaptive TDP rule for individualization
4 Result Results showed the accuracies of emotion recognition from three different methods such as tonic analysis, TDP analysis and adaptive TDP analysis as shown table 2-5. The accuracy was determined by match rate between subjective emotion and emotion determined by physiological signals. Tonic response rule was determined from previous research of autonomic response pattern for emotion[5, 11]. If negative emotion was evoked, electrodermal and cardiovascular response were increased and thermal response was decreased and vice versa [11]. As shown in Table 2, the accuracy of emotion recognition was about 62% in negative emotion recognition and about 50% in positive emotion. In the same condition, the accuracy was enhanced up to 60% when TDP analysis was used as shown in Table 3. It was about 70% in recognition of both positive and negative. Therefore, the accuracy of emotion recognition by TDP analysis could observe responses more than one by tonic analysis. There were findings for individualization of emotion recognition as shown Table 5. Adaptive TDP rule made the accuracy enhanced little more. Interestingly, the participants having the accuracy lower than 70% increased up to 70% or more. Therefore, adaptive TDP analysis could be effective to increase accuracy of a particular person who was low. Figure 5 showed overall accuracies from three analyses. TDP analysis showed improvement of accuracy but adaptive TDP did not much. Since accuracy improved in individual showing low, overall accuracy did not contribute improvement.
82
J. Kim, M. Whang, and J. Woo Table 2. Accuracy of emotion recognition from tonic analysis
participants
Negative emotion Emotion by Subjective physiological emotion signals
Positive emotion Emotion by Subjective physiological emotion signals
Accuracy
A
8
5
10
6
61.1%
B
6
3
9
4
46.7%
C
7
4
8
4
53.3%
D
8
6
9
4
58.8%
Sum
29
18
36
18
Accuracy
62.1%
55.0%
50.0%
Table 3. Accuracy of emotion recognition by TDP rule
Participants
Negative emotion Emotion by Subjective physiological emotion signals
Positive emotion Emotion by Subjective physiological emotion signals
Accuracy
A
8
7
10
7
77.8%
B
6
3
9
6
60.0%
C
7
4
8
7
73.3%
D
8
6
9
5
64.7%
Sum
29
20
36
25
Accuracy
69.0%
69.0%
69.4%
Table 4. Accuracy of emotion recognition by adaptive TDP rule
Participants
Negative emotion Emotion by Subjective physiological emotion signals
Positive emotion Emotion by Subjective physiological emotion signals
Accuracy
A
8
6
10
7
72.2%
B
6
4
9
6
66.7%
C
7
4
8
7
73.3%
D
8
6
9
6
70.6%
Sum
29
20
36
26
Accuracy
69.0%
72.2%
70.1%
The Research on Adaptive Process for Emotion Recognition
83
Fig. 5. Accuracy comparison of three analyses such as tonic, TDP and adaptive TDP analysis
5 Conclusion TDP and adaptive TDP were newly proposed to analyze physiological response for emotion recognition. The methods were successful of enhance its accuracy. The adaptive TDP was effective to count individual difference of physiological response. Comparing accuracies of emotion recognition among tonic analysis, TDP analysis, and adaptive TDP analysis, this study concluded the followings. First of all, TDP analysis enhanced the accuracy more than tonic analysis. Accuracy by tonic analysis was an average of 55%, and one of TDP analysis was of 69%. Second, the accuracy of adaptive TDP analysis reduced the individual difference. In the case of analysis by using TDP rule, the accuracy of individual difference was between 60% and 77% while in the case of adaptive TDP rule, the accuracy of individual difference was between 66.7% and 73.3%. Results showed adaptive TDP could enhance accuracy that was relatively low but work less for a participant whose accuracy was already high. Therefore, TDP and adaptive TDP method may be useful of emotion recognition and observe detail significant response of physiological response.
References 1. Christine, L., titia, L., Fatma, N.: Using noninvasive wearable computers to recognize human emotions from physiological signals. EURASIP J. Appl. Signal Process., 1672–1687 (2004) 2. Allanson, J., Fairclough, S.H.: A research agenda for physiological computing. Interacting with Computers 16, 857–878 (2004) 3. Haag, A., Goronzy, S., Schaich, P., Williams, J.: Emotion recognition using bio-sensors: First steps towards an automatic system. In: André, E., Dybkjær, L., Minker, W., Heisterkamp, P. (eds.) ADS 2004. LNCS, vol. 3068, pp. 36–48. Springer, Heidelberg (2004)
84
J. Kim, M. Whang, and J. Woo
4. Mandryk, R.L., Atkins, M.S.: A fuzzy physiological approach for continuously modeling emotion during interaction with play technologies. International Journal of HumanComputer Studies 65, 329–347 (2007) 5. Boucsein, W.: Electrodermal Activity. Plenum Press, New York (1992) 6. Whang, M.: The emotional computer adaptive to human emotion. Phillips Research: Probing Experience 8, 209–219 (2008) 7. Whang, M., Lim, J., Boucsein, W.: Preparing computers for affective communication: a psychophysiological concept and preliminary results. The Journal of the Human Factors and Ergonomics 45, 623–634 (2003) 8. Kim, J., Whang, M., Kim, J., Woo, J.: The study on emotion recognition by timedependent parameters of autonomic nervous response. Korean Journal of the science of emotion & Sensibility 11, 637–644 (2008) 9. Fredrickson, B.L., Losada, M.F.: Positive Affect and the Complex Dynamics of Human Flourishing. American Psychologist 60, 678–686 (2005) 10. Russell, J.A.: A circumplex model of affect. Journal of personality and social psychology 39, 1161–1178 (1980) 11. Whang, M., Chang, G., Kim, S.: Research on Emotion Evaluation using Autonomic Response. Korean Journal of the science of emotion & Sensibility 7, 51–56 (2004)
Students’ Visual Perceptions of Virtual Lectures as Measured by Eye Tracking Yu-Jin Kim, Jin Ah Bae, and Byeong Ho Jeon Dept. of Media Image Art & Technology, Kongju National University, Sinkwandong, Gongju, South Korea {yujinkim,jinabae,bhjeon}@kongju.ac.kr
Abstract. In this paper, we used eye tracking methodologies to investigate students’ visual perceptions of lectures using 3D real-time virtual studio technology. For measuring learning performance, we also gave the students multiple-choice paper quizzes at the end of the lectures. Three virtual lectures were created with different types of lecture materials (text-centered, image-centered, and lecturercentered) and 3D virtual sets (classroom, cyberspace, and lecture-theme space). Through analyzing students’ eye movements in viewing still and moving scenes of the virtual lectures, we found that layouts and movements of design elements on lecture screens significantly influenced students’ scanpaths and areas of interest (AOIs). Lecture material types affected learning performance while 3D virtual sets had no effect due to students’ inattention to the virtual background areas. We discuss effective ways to develop virtual lectures and design lecture screens for better presentation of lecture content and higher learning performance. Keywords: Virtual lectures, virtual studios, eye tracking, visual perception, learning performance, user-centered screen design.
2 Related Work 2.1 Lectures Using Virtual Studio Technology Virtual education, by the use of cyberspace, eliminates the spatial and temporal limitations by removing the need for the lecturer and students to be present at an instructional site at a designated time [4]. Recently, the proliferation of information and communication technologies (ICTs) has spawned a boom in virtual education by creating a variety of virtual lecture content production techniques such as Flash, Web 3D, 3D real-time virtual studio, and others. Among these production technologies, several researchers have explored the effectiveness of the virtual studio application to lectures. In 2003, Morozov, Debelov, and Zhmulevskaya claimed that the application of virtual studio systems enhances the efficiency of instructional processes by providing lecturers more possibilities for diverse forms of education [5]. According to Brown and Cruickshank (2003), a virtual lecture can be more effective than classroom lecture by achieving significant savings in the costs of lecture delivery and student support, though a very large investment of time and effort is required by the lecturer [6]. Along with the previously mentioned research interests in the educational effects of the application of new technology, researchers have studied the factors affecting the delivery of lecture content in virtual education in order to increase the motivation and effects of learning. These factors include screen design [7, 8], content design [9], content suggestion style, and lecture material [10]. 2.2 Eye Tracking in Visual Information Processing Eye movements are driven both by properties of the visual world and processes in a person’s mind [11]. Therefore, many scholars have used eye tracking techniques in order to explore the relations between eye movements and visual information processing in diverse research fields. In particular, researchers in HCI have tracked eye movements to understand visual and display-based information processing, as well as to discover the factors that may impact the usability of system interfaces [12]. In fact, eye movement data provide more detailed and specific information about a user’s cognitive process in many different kinds of displays [13, 14]. They also help researchers determine the roots of some usability problems, and then come up with effective solutions. Meanwhile, the recent trend of eye-tracking studies in HCI shows that eye-tracking methods have been rapidly employed in conducting usability testing of websites. Laarni et al. (2003) tracked the eye movements of users while they were selecting and reading online news items on a small computer [15]. Cowen (2004) asked the subjects to perform two tasks on a website and measured their total fixation duration, the number of total fixations, average fixation duration, and distribution of fixations on the screen [13]. Through these measurements, he suggested effective ways to design webpages in terms of user interface. In 2007, Cutrell and Guan also measured web surfers' eye movements to observe their web search strategies [16]. Like the aforementioned studies, many researchers have analyzed the interface of and interaction with web content, mainly focusing on a sequence of still scenes of web pages.
Students’ Visual Perceptions of Virtual Lectures as Measured by Eye Tracking
87
However, there have been increasing interests in employing eye-tracking methods in moving visual images along with the proliferation of more dynamic web content, which is developed by adopting new types of web authoring tools and media, such as Flash, Web 3D, Virtual Studio, and others. 2.3 Eye Tracking Metrics The main measurements used in eye-tracking research are “fixations,” which are moments when eyes are relatively stationary, taking in or encoding information, and “saccades,” which are quick eye movements occurring between fixations [12]. From these basic measurements, a multitude of metrics are also derived [17]: (1) “gaze duration,” the cumulative duration and average spatial location of a series of consecutive fixations within an area of interest; (2) “area of interest (AOI),” the area of a display or visual environment that is of interest to the research or design team and, thus, defined by them (not by the participant); and (3) “scanpath,” a special arrangement of a sequence of fixations.
3 Hypotheses The focus of our study is the investigation of students’ visual perceptions in virtual lectures. In order to identify the patterns of students’ visual perceptions, we tracked their eye movements according to different layouts and movements of design elements on lecture screens. We also studied the effects of lecture material types and 3D virtual sets on students’ learning performance. In line with these purposes, we investigated the following research hypotheses: Hypotheses 1a. Students’ eye movements (AOIs, scanpaths) are affected by the layouts of design elements on lecture screens. Hypotheses 1b. Students’ eye movements (AOIs, scanpaths) are affected by the movements of design elements on lecture screens. Hypotheses 2a. Students’ learning performance varies according to lecture material types. Hypotheses 2b. Students’ learning performance is different according to 3D virtual set types.
4 Experiment 4.1 Virtual Lecture Prototypes In order to test the aforementioned hypotheses, we designed and conducted an experiment to analyze students’ eye movements and learning performance in the context of virtual studio-based lectures. As experimental material, three virtual lecture prototypes were created under the theme of “Media and Culture” with varying layouts and movements of lecture screen design elements (lecturer, lecture board, 3D virtual sets, text, images, and movies on the lecture board). The three prototypes (see Figure 1) were also produced with different types of lecture materials (text-centered, image-centered,
88
Y.-J. Kim, J.A. Bae, and B.H. Jeon
and lecturer-centered) and 3D virtual sets (classroom, lecture-theme space, and cyberspace): (1) Prototype 1 - text-centered lecture materials in classroom background sets; (2) Prototype 2 - lecturer-centered lecture materials in cyberspace background sets; and (3) Prototype 3 - image-centered lecture materials in lecture-theme space background sets. Figure 1 shows screen shots of these three prototypes. Three screen shots located in the first row display their 3D background sets. The other three shots in the second row illustrate how the same lecture content about the phenomena of mass communication is delivered by different types of lecture materials.
Fig. 1. Screen shots of three prototypes (left: Prototype 1, middle: Prototype 2, and right: Prototype 3)
After modeling and animating the above 3D virtual sets using 3D Studio Max 9.0., we combined these 3D background sources with the live-action footage of lecturers by a real-time chroma key technique of the "VS2000" system. In addition, we created animation buttons for controlling lecture screen design elements using the VS scripts of the "HotAction" program. 4.2 Participants Forty participants were selected for the experiment, but ten of them had problem with the calibration of the eye-tracking system. Calibration was a fine-tuning process for the experiment, so 30 participants successfully completed the experiment. The 30 subjects were divided into three groups of ten who participated in the experiment composed of three kinds of virtual lectures. The subjects were freshmen and sophomores of K University and had almost no previous knowledge of the content of the experiment lectures. They were between the ages of 18 and 23, and the gender ratio was 1:1. Twenty-four of the students had experienced distance lectures, and eight of them had been exposed to VR lectures. 4.3 Apparatus This experiment used the hardware "Eyegaze Development System" developed by LC Technologies, Inc. This device tracks the x-y coordinates of the participant’s
Students’ Visual Perceptions of Virtual Lectures as Measured by Eye Tracking
89
gazepoint on the computer screen automatically using the pupil-center-corneal reflection (PCCR) method, which directs infrared light into the eye. This system generates raw eyegaze point location data at a camera field rate of 60 Hz [18] and calculates three kinds of gazepoints: (1) a moving point that a participant looks at for only 1/60 second, then looking at other points; (2) a fixating point that a participant looks around in; and (3) a fixation completed point that a participant maintains the gaze in. This study used the software "EyeTrack v1.0" and “EyeTrackMovie v1.0” for analyzing eye movements in viewing still and moving scenes, respectively. These programs were developed by the Human Computer Interaction Laboratory (HCIL) of KAIST for the visual monitoring and analysis of eye-tracking results that were expressed in coordinates [19]. “EyeTrack” and “EyeTrackMovie” provide “Replay” and “Analysis.” "Fixation Mark" and "Color Variation" options enable the monitoring of eye-tracking data with a variety of graphic effects in chronological order. For an efficient analysis of eye movement patterns, it also provides five analysis options: (1) "shadow," which shows the area that subjects focused on and illustrates the rest in shades with gradation effects; (2) "frequent area," which marks a clear boundary between the area that the subjects focused on and the rest, highlighting the former; (3) "hotspot," which shows the amount of gaze (red meaning more gaze and green meaning less gaze); (4) "selected area," which shows the amount of gaze in a certain area and its duration in numbers; and (4) "priority order," which shows the duration of a gaze over time in circles. All participants’ eye movements could be monitored at once due to the function of the programs which can analyze multiple data simultaneously. The patterns of AOIs and scanpaths, described in Section 2.3, were discovered from these eye movements. 4.4 Experimental Design and Procedure The experiment was performed in four stages, and it took about an hour for a participant to complete the experiment: (1) questionnaires on demographics, media education level, and cyber learning experiences; (2) still scene eye-tracking with the captured scenes according to distinguishable types of screen layouts (Prototype 1: 6 scenes, Prototype 2: 9 scenes; and Prototype 3: 7 scenes); (3) moving scene eyetracking with three prototype movies; and (4) a quiz with multiple-choice questions. Eye movements were tracked twice for still and moving scenes of virtual lectures. Compared with the eye-tracking of moving scenes, the eye-tracking of still scenes allowed more detailed and accurate analysis of eye movements for the different screen layout types. On the other hand, the eye-tracking of the moving scenes enabled an additional analysis of the effects of the movements of screen layout elements on eye movements.
5 Results 5.1 AOIs According to the Layouts and Movements of Lecture Screen Design Elements Through analyzing students’ eye movements in viewing still and moving scenes of the virtual lectures, we found that layouts and the movement of lecture screen design
90
Y.-J. Kim, J.A. Bae, and B.H. Jeon
elements (the lecturer, lecture board, and 3D background set) significantly influenced students’ scanpaths and areas of interest (AOIs) [H1a and H1b]. In the case of AOIs, parts that students paid close attention to were similar in both still and moving scene tests. By pointing out the design elements that received the most attention on the sill scenes, it was found that students generally gazed into a lecturer’s face in all of the still scenes that feature a lecturer (Prototype 1: 4 scenes, Prototype 2: 5 scenes, and Prototype 3: 5 scenes) regardless of the lecturer’s size and position on the screen. Figure 2 shows the “hotspot” option analysis results of the three scenes (left: Prototype 3, middle: Prototype 2, and right: Prototype 3), where different sizes of a lecturer appear in different positions. We calculated the distribution of fixations on the face of the lecturer in the left scene of Figure 2 using the “selected area” option and found that 52.5% of the total fixations were on the lecturer’s face.
Fig. 2. “Hotspot” option analysis results (Note: The amounts of fixations gradually increased from green to red, and the reddest parts are marked in circle)
In addition, we found that the students’ points-of-regard are likely to stay not just on the lecture’s face, but also on the face of people illustrated in the lecture material images (see Figure 3). Another finding was that the student’s attention to the people whose profile or back views were shown to decrease compared with front view images.
Fig. 3. “Selected area” option analysis results (1. Front view (28%), 2. Profile view (11.9%), and 3. Back view (12.5%) of people in the image)
Students’ Visual Perceptions of Virtual Lectures as Measured by Eye Tracking
91
Fig. 4. Comparing visual attention to text areas (3. Body (40.1%) > 2. Subtitle (29.5%) > 1. Title (7.3%))
Along with the aforementioned visual attention to people’s faces in the still scenes, we also found a tendency in the moving scenes that students’ points-of regard are mostly directed at the lecturer’s face as well as the faces of people illustrated in the lecture material images. In addition, the analysis results also suggest that students’ eye movements and text sizes in lecture boards were inversely proportional. There were less eye movements and shorter gazes on larger texts. In fact, visual attention was generally highest in the body, subtitles, and title (see Figure 4). In other words, considering that titles take up more space than subtitles, but receive less visual attention, it would be effective to deliver new and important messages using subtitles. 5.2 Scanpaths According to the Layouts and Movements of Lecture Screen Design Elements In order to measure the student’s scanpaths and scanpath duration, we selected a student whose eye movement pattern can represent the visual attention of the ten students participating in the same still scene experiment and then analyzed his/her eye movement over time using the “fixation map” and “priority order” options. From the results, it turned out that the farther from the center the objects are positioned, the
Fig. 5. Scanpath tracking (left: from left to right, right: from top to bottom)
92
Y.-J. Kim, J.A. Bae, and B.H. Jeon
more likely they are to be out of the scanpath or stared at later. This analysis also verified the commonly accepted idea that people in Korean-speaking culture read words and sentences from left to right and from top to bottom (see Figure 5). On the other hand, students’ scanpaths showed different patterns in still and moving scene tests in terms of the starting point of the eye’s gaze. In the still scene test, the starting points of the students’ gazes were related to screen design element types, as well as their positions. For example, the majority of students firstly gazed into the lecturer’s face, regardless of its position, and then moved their gazes to other elements located around the center of the screens. In the moving scene test, the movement of screen design elements, rather than their type, dominantly influenced the starting points, as illustrated in Figure 6. We also found that students moved their gazes to empty parts where the next lecture content was expected to show up. Properly timed anticipation could enable the students to better perceive animated lecture content by preparing them for what will come next in the lecture.
Fig. 6. Starting points of students' gazes according to the lecture screen layout type
5.3 Students’ Learning Performance According to Lecture Materials and 3D Virtual Sets The average quiz scores for the three groups who respectively watched text-centered, image-centered, and lecturer-centered virtual lectures (12.9, 9.7, and 10.2, respectively) were significant at the alpha level of 0.05 (F=4.694). These results support H 2a, which suggests the possibility of a relationship between students’ learning performance and lecture material types. Students could easily understand and memorize lecture content in text-centered lecture material which presented the lecturer’s
Students’ Visual Perceptions of Virtual Lectures as Measured by Eye Tracking
93
explanations in clear print on the screen. While lecture material types affected learning performance, 3D virtual sets had no effect on learning performance due to students’ lack of attention to the background areas. Consequently, H 2b, which suggests a strong relationship between students’ learning performance and 3D virtual set type, was not supported. In fact, the effects of the virtual sets on the learning process are very small because students’ visual attention is usually drawn to the background only when they found some noticeable images or figures in the set.
6 Conclusions This study investigated students’ visual perceptions of virtual lectures by analyzing their AOIs and scanpaths in the still and moving scenes of the lectures. It also explored how to develop lecture materials and 3D background sets of virtual lectures for improving learning performance by testing the students’ comprehension with multiple-choice questions. Our research revealed that the following issues on virtual lecture production should be carefully considered for better presentation of lecture content: (1) the lecturer’s size and positions on lecture screens and people’s postures illustrated in lecture materials; (2) text’s sizes and positions in lecture materials; (3) movements of screen design elements (the lecturer, lecture board, animating parts on the lecture board); and (4) balanced layouts between the lecturer, lecture materials, and virtual background sets. Educators and material designers should also consider that text-centered lecture materials were the most effective for higher learning performance than imagecentered and lecturer-centered ones. Even though statistical significance was not reached between 3D background sets and learning performance, an effective background set design suited to the lecture objectives and content is necessary for enhancing students’ learning interest and motivation. Finally, it is hoped that this study will assist lecturers in understanding students’ visual information processing in virtual lectures, in designing virtual lecture content more effectively, and in improving the instructional effects of virtual lectures.
Acknowledgment We thank Kun-pyo Lee and Jung-mi Park for their comments and help in conducting our eye-tracking experiment.
References 1. Fukaya, T., Fujikake, H., Yamanouchi, Y., Mitsumine, H., Yagi, N., Inoue, S., Kikuchi, H.: An Effective Interaction Tool for Performance in the Virtual Studio-Invisible Light Projection System. NHK Science & Technical Research Laboratories, Japan (2003) 2. Dolgovesov, B.S., Morozov, B.B., Shevtsov, M.Y.: The System for Interactive Virtual Teaching Based on “Focus” Virtual Studio. In: International Conference Graphicon 2003 in Moscow, Russia (2003)
94
Y.-J. Kim, J.A. Bae, and B.H. Jeon
3. In this paper, a virtual lecture means instructional content that is created using virtual studio techniques for virtual education 4. Kuroda, K., Shanawez, H.D.: Strategies for Promoting Virtual Higher Education: General Considerations on Africa and Asia. Africa and Asian Studies 2(4), 565–575 (2003) 5. Morozov, B.S., Develov, B.B., Zhmulevskaya, M.Y.: The System for Interactive Virtual Teaching Based on “Focus” Virtual Studio. In: International Conference Graphicon, Moscow, Russia (2003), http://www.graphicon.ru/ 6. Brown, S., Cruickshank, I.: The Virtual Studio. International Journal of Art & Design Education 22(3), 281–288 (2003) 7. Kim, M.R.: Strategies on Screen Design of Learner-Centered Web-based Instructional Systems. Journal of Educational Technology 16(4), 51–65 (2008) (in Korean) 8. Shon, M., Chung, H.H.: An Analysis on the Learning Hindrance Factors in BlendedLearning Environment. Journal of Educational Information and Media 13(2), 251–276 (2007) (in Korean) 9. Ryu, I.: Factors Influencing the Effectiveness of Web-Based Distance Learning. Management Education Review 6(2), 7–27 (2003) (in Korean) 10. Kang, M.H., Gu, M.H., Moon, H.N., Jung, S.Y., Chung, J.Y., Kim, J.S.: Examining the Effects of Tutor Delivery Modes on Cognitive Presence and Learning Outcomes in Online Lectures. Journal of Educational Information and Media 13(4), 155–181 (2007) (in Korean) 11. Richardson, D.C., Spivey, M.J.: Eye-Tracking: Characteristics and Methods, 1-9. In: Wnek, G., Bowlin, G.: Encyclopedia of Biomaterials and Biomedical Engineering. Informa HealthCare (2004) 12. Poole, A., Ball, L.J.: Eye Tracking in Human-Computer Interaction and Usability Research: Current Status and Future Prospects. In: Ghaoui, C. (ed.) Encyclopedia of Human Computer Interaction, Idea group (2004) 13. Cowen, L.: An eye movement analysis of web-page usability. Masters by Research in the Design and Evaluation of Advanced Interactive Systems (2001) 14. Lohse, G.L.: Consumer Eye Movement Patterns on Yellow Pages Advertising. Journal of Advertising 26(1), 61–73 (1997) 15. Laarni, J., Isotalus, P., Kojo, I., Kärkkäinen, L.: Reading News from a Pocket Computer: An Eye-Movement Study. In: Harris, C.D., Duffy, V., Smith, M.J., Stephanidis, C. (eds.) Human-Centered Computing: Cognitive, Social and Ergonomic Aspects. The Proceedings of HCI International 2003, Lawrence Erlbaum, Mahwah (2003) 16. Cutrell, E., Guan, Z.: What Are You Looking for?: an Eye-tracking Study of Information Usage in Web Search. In: Proceedings of the SIGCHI conference on Human factors in computing systems, San Jose, California, USA (2007) 17. Jacob, R.J.K., Karn, K.S.: Eye Tracking in Human-computer Interaction and Usability Research: Ready to Deliver the Promises (Section commentary). In: Hyönä, J., Radach, R., Deubel, H. (eds.) The Mind’s Eyes: Cognitive and Applied Aspects of Eye Movements, Elsevier Science, Oxford (2003) 18. Eyegaze Development System Information, http://www.eyegaze.com 19. Park, J., Lee, K.: Eyetrack - Developing Eyegaze Analysis Visualization Software for Designers’ Use. In: KEER 2007, Sapporo, Japan, p. 10 (2007)
Toward Constructing an Electroencephalogram Measurement Method for Usability Evaluation Masaki Kimura*, Hidetake Uwano, Masao Ohira, and Ken-ichi Matsumoto Graduate School of Information Science, Nara Institute of Science and Technology {masaki-k,hideta-u,masao,matumoto}@is.nasit.jp
Abstract. This paper describes our pilot study toward constructing an electroencephalogram (EEG) measurement method for usability evaluation. The measurement method consists of two steps: (1) measuring EEGs of subjects for several tens of seconds after events or tasks that are targets to evaluate, and (2) analyzing how much components of the alpha and/or beta rhythm are contained in the measured EEGs. However, there only exists an empirical rule on measurement time length of EEGs for usability evaluation. In this paper, we conduct an experiment to reveal the optimal time length of EEGs for usability evaluation by analyzing changes of EEGs over time. From the results of the experiments, we have found that the time length suitable for usability evaluation was more than 0~56.32 seconds.
(1) experimenters (or usability experts) measure a subject’s EEG for several tens of seconds after the subject finishes target tasks for usability evaluation, and (2) experimenters analyze how much the measured EEG contains the alpha and beta rhythm components, which respectively indicate a comfortable or uncomfortable state of the subject’s mind. The results of the evaluation should change according to the time length to perform the EEG. After the tasks, subjects’ EEGs return to the usual condition as time passes. Hence, the time length of the analysis is too long, and the ratio of the EEG changed by the tasks decreases. Conversely, if the time length is too short, evaluation accuracy decreases because the ratio of the noise to the entire the EEG increases. However, the measurement time at EEG analysis has been decided by experiential standards. So we need to analyze EEGs by time series for usability evaluation. In this paper, we try to obtain the proper time length of EEG data.
2 Related Research In this paper, we quantitatively evaluate a psychological state of computer users during a system in-use, using alpha and beta rhythms composed of frequency components of brain waves. Power spectrums of alpha and beta rhythms, which are obtained by discrete Fourier transform, the ratio of alpha and beta rhythms in all brain waves, and beta/alpha, which is the ratio of alpha and beta rhythms, are often used as common indicators for observing the psychological state of human beings. Matsunaga et al. developed a brain wave measurement system for evaluating satisfaction of human beings and validated the hypothesis that people feel comfortable if the amount of information processing in the brain is small, while people feel uncomfortable if the amount of information processing in the brain is large [5]. In this paper, we also use the ratio of alpha and beta rhythms in brain waves and beta/alpha as indicators. These indicators are often used for studies using brain waves. So our experimental results are easy to compare with implications and insights from previous work.
3 Experiment 3.1 Overview Using Microsoft Excel 2003 and Excel 2007, which are the most popular spreadsheet software, participants performed eight kinds of tasks operating spreadsheets, and we measured participants’ EEGs after each task. Excel 2003 and Excel 2007 are different versions of software released in 2003 and 2007, respectively. Both versions have almost the same functions, but have a different look and feel in their graphical user interfaces (GUIs). Excel 2007 has a new GUI called “ribbon,” which is designed to improve task performance and user experience. However, the newly designed ribbon interface introduced users to a lot of changes to menus, tool bars, and working windows. So, even if users would like to use a familiar function such as “Save As,” they need to select a menu or button with different names and/or positions between the two versions. Furthermore, even if names and positions of menus are not changed, the design of working windows displayed after selecting a menu is often different from Excel 2003. In this way, not only new users but also existing users of Excel 2003 need to learn how to operate the new interface of Excel 2007.
Toward Constructing an EEG Measurement Method for Usability Eval.
97
Table 1. Subjects’ usage frequency of Excel 2003 and Excel 2007
never several times per year several times per month several times per week
Excel 2003 0 2 3 5
Excel 2007 6 1 2 1
Using Excel 2003 and Excel 2007, we investigate the relationship between the experiences of software and the attributes of brain waves without the effect of functional differences. In our experiment, we also analyzed the relationship between the results of subjective evaluation by questionnaire and the attributes of brain waves. 3.2 Participants Ten master’s students from the graduate school of information science participated in the experiment. Table 1 shows participants’ usage frequency of Excel 2003 and Excel 2007. All participants had experience in using Excel 2003 and understood basic operations and functions, but half of the participants had never used Excel 2007. 3.3 Task Participants performed eight tasks (four types of tasks for each version of Excel) operating the spreadsheets given in advance. Table 2 shows a list of tasks used in the experiment. All tasks can perform both Excel 2003 and Excel 2007. The content (a grade report) of the data file that is used in this experiment is the same in all tasks. Participants can continue the tasks until the task time exceeds five minutes. We counterbalanced the order of the tasks to minimize learning and fatigue effects. The following are the details of each task. Same Place Task In this task, the participant selects a particular menu that has the same name in the same position in different versions of Excel. The task is completed when the participant selects the objective menu. Different Place Task The participant selects a menu that has a different name in a different position in the two versions of Excel. This task is completed when the participant selects the objective menu. Same Interface Task The Same Interface Task consists of functions that have the same interfaces of modal dialog in different Excel versions. In the task, menu name and position were given to the participant before the task. Different Interface Task In this task, functions that have different dialog interfaces in Excel 2003 and Excel 2007 were used. As with the Same Interface Task, menu name and position were given to the participant before the task.
98
M. Kimura et al. Table 2. A task list used in the experiment
Task type Same place Different place
Same interface
Task name Open Clip Art Pane Filter Setting Display of version information Record of macros Format Cells Page Orientation
Different interface
Conditional Formatting Insert Bar Chart
Description Open clip art pane to select clip art from a list. Set options for data filtering. Display the version information of Excel. Change date formats from Mar-01 to 03/01. Change a page orientation to landscape and set margins. Change a page orientation to landscape and set margins. Indicate cells that have less than “C” or “Absence” as red font. Insert stacked bar chart of student's scores with chart/axis titles.
3.4 Environment The Emotional Spectrum Analysis System ESA-16 was employed to record EEGs of participants. After the task, we recorded participants’ EEGs for two minutes at 200Hz sampling frequency in the eyes-closed, resting condition. Electrode locations are based on the International 10-20 System, shown in Figure 1. We adapted referential derivation to observe the EEG, and used the right earlobe (A2) as a reference electrode. As a ground electrode, the center of the forehead (Fpz) was employed and the center of the parietal (Pz) was used as an exploring electrode to minimize the electromyogram (EMG) artifact. We also recorded electrocardiogram (ECG) from both arms. In addition to this, we used a headrest and elastic net bandage to secure electrodes placed on the head. Before the first task, each participant adjusted the height of chair and position of the mouse/keyboard. 3.5 Procedure The procedure of the experiment was as follows. 1. Preparation: Authors informed participant about experiment and EEG measurement. 2. Environment setting: Put the electrodes on the participant at the points described in Section 3.4, then set up the EEG analyzer. 3. Practice tasks: The participant performed two practice tasks to understand the procedure of EEG measurement. These tasks were excluded from analysis. 4. Perform a task: The participant performed a main task described in Table 2. 5. EEG measurement: After each task, the EEG of the participant was measured. 6. Perform eight tasks: The participant performed tasks repeatedly until finishing eight tasks and EEG measurements. 7. Questionnaire: After the tasks, the participant filled out the questionnaire described in Section 3.6.
Toward Constructing an EEG Measurement Method for Usability Eval.
99
3.6 Questionnaire After the eight tasks, participants answered the questionnaire sheet to investigate subjective satisfactions for each version of Excel and usage frequency of each function that was used in the tasks. The questionnaire was created by the authors based on the Questionnaire for User Interaction Satisfaction (QUIS). Each question about usage frequency consists of a four-point scale (from “Never” to “Few times per week”) and seven-point scale (from “Strongly disagree” to “Strongly agree”) for subjective satisfaction. Figure 2 shows a part of questionnaire sheet that used in the experiment.
Fp1 F7
T3
Fpz
Fp1
Fz
F3
C3
F4
F8
T4
C4
Cz
A2
A1 T5
P3
P4
Pz O1
T6
O2
Fig. 1. Electrode Locations in the International 10-20 System
Brain wave time[sec]
Fig. 2. Questionnaire Sheet
Method 1. Power spectral analysis at even intervals
5.12
Method 2. Power spectral analysis using different lengths of time window
5.12
5.12
5.12
10.24 15.36
Fig. 3. Two Kinds of Analysis Methods for Electroencephalogram
4 Analysis for EEG We applied power spectral analysis to EEG data we collected at a sampling frequency of 200Hz. To have a clear understanding of how frequency components of brain waves changed over time in the setting of our experiment and how analysis results varied according to lengths of the analysis window, we used two analysis methods for the EEG data as follows. Figure 3 illustrates the difference between the analysis methods.
100
M. Kimura et al.
Method 1. Power spectral analysis at even intervals This analysis aims to observe how brain waves change over time. We analyzed EEG data with a time interval of 5.12 seconds by cutting out the entire EEG data every 5.12 seconds so as not to overlap analysis data. The EEG data with 19 intervals (from 0 to 92.16 second) was analyzed. Method 2. Power spectral analysis using different lengths of time window This analysis aims to observe how analysis results differ according to lengths of analysis window. We analyzed EEG data by increasing time length of the analysis window by 5.12 seconds, without changing the start position of our analysis. The time length was increased by 5.12 seconds (min.) to 97.28 seconds (max.) (i.e., 0~5.12 sec., 0~10.24 sec , …, and 0~97.28 sec.). Next, the target data was filtered to reduce the artifacts from eye blinking, myoelectric activity and so on. We used a high-pass filter (HPF, 3Hz cutoff frequency, +6dB/oct attenuation factor), a low-pass filter (LPF, 60Hz cutoff frequency, -6dB/oct attenuation factor), and a band-elimination filter (BEF, 60Hz central frequency, 47.5Hz~72.5Hz stopband, second order). The band-elimination filter was used to remove the influence of an alternating-current power supply. After the EEG data was multiplied by the Hamming window and processed with the fast Fourier transform (FFT), we obtained the power spectrum from the EEG data. From the obtained power spectrum, we calculated the respective proportions of alpha rhythm and beta rhythm to all brain waves, and also calculated beta/alpha, which divided the proportion of alpha rhythm into the proportion of beta rhythm. In accordance with the classification of the international 10-20 system, we set the frequency components of alpha rhythm and beta rhythm to 8~13Hz and 13~30Hz respectively. We also set the range of all brain waves to 3~30Hz. Since the proportions of alpha and beta rhythms to all brain waves are widely used for observing various activities in the brain, we also decided to use them as indexes for measuring the physiological state of subjects after the tasks. However, because the proportions and intensity of alpha and beta rhythms vary from individual to individual, comparisons of brain waves with the absolute value would be inappropriate. In this paper, we normalized EEG data by an average value of each subject’s power spectrum and compared it with each data.
5 Results 5.1 Results of the Power Spectral Analysis at Even Intervals Figures 4, 5, and 6 respectively show the mean and the standard deviation of the alpha rhythm, beta rhythm and beta/alpha in the power spectral analysis at even intervals. In each graph, the left x-axis, the right x-axis, and the y-axis are the mean, standard deviation, and time respectively. Figure 4 indicates that the mean of alpha rhythms for Excel 2003 were larger than that for Excel 2007 after 56.32 seconds and that the difference of the alpha rhythms between Excel 2003 and 2007 was largest at 81.92 seconds. The standard deviation was always comparatively small and lowest at 56.32 seconds. Figure 5 shows the mean of the beta rhythms for Excel 2007 was larger than that for Excel 2003 after 46.08 seconds, and the difference of the beta rhythms between Excel 2003 and 2007
Toward Constructing an EEG Measurement Method for Usability Eval.
101
was greatest at 87.04 seconds. The standard deviation was higher on the whole than the standard deviation of alpha rhythms. The lowest standard deviation was observed at 40.96 seconds. Figure 6 presents that the mean of beta/alpha in Excel 2007 was larger than that in Excel 2003 after 46.08 seconds and that the difference of beta/alpha between Excel 2003 and 2007 was at 81.92 seconds. The standard deviation is larger than the results of alpha rhythms and beta rhythms and smallest at 40.96 seconds. 5.2 Results of the Power Spectral Analysis Using Different Length of Time Window
Figures 7, 8, and 9 respectively show the mean and the standard deviation of the alpha rhythm, beta rhythm and beta/alpha in the power spectral analysis using different lengths of time window. In each graph, the left x-axis, the right x-axis, and the y-axis are the mean, standard deviation, and time respectively.
102
M. Kimura et al.
Figure 7 shows that the mean of alpha rhythms in Excel 2003 was larger than that in Excel 2007 in the case of using time windows over 40.96 seconds. As lengths of time window became longer, the alpha rhythms in Excel 2003 tended to increase and the standard deviation decreased. Figure 8 presents that the mean of the beta rhythms in Excel 2007 was larger than that in Excel 2003 in the case of using time windows over 56.32 seconds. As lengths of time window became longer, the standard deviation tended to decrease as well as the results of the alpha rhythms. Figure 9 indicates that the mean of beta/alpha in Excel 2007 was larger than that in Excel 2003 in the case of using time windows over 40.96 seconds. As lengths of time window became longer, the standard deviation tended to decrease as well as the results of the alpha and beta rhythms. Table 3. A Result of Questionnaire Usage
Understand Productivity
frequency Excel 2003 Average 3.3 SD 0.82 Excel 2007 Average 1.8 SD 1.14 p < 0.05 yes
5.0 1.33 3.5 2.17 no
Simple Interface Easy Satisfacto use to use tion 5.4 4.8 4.9 5.3 5.0 1.26 1.32 1.45 1.34 1.15 3.5 4.0 3.1 3.4 3.3 1.72 1.63 2.13 1.90 1.95 no yes yes yes yes
5.3 Results of Questionnaire Table 3 shows the mean, standard deviation and the result of the two-sample t-test of each questionnaire item. In the table, there were significant differences in “Productivity,” “Interface,” “Easy to use” and “Satisfaction” between Excel 2003 and Excel 2007. Our subjects gave Excel 2007 lower scores than Excel 2003.
6 Discussions From the results of the power spectral analysis at even intervals, we could confirm that all three indexes tended to be relatively stable after 56.32 seconds. In fact, the alpha rhythms in Excel 2003 were larger than those in Excel 2007. The beta rhythms and beta/alpha in Excel 2007 were larger those in Excel 2003. The standard deviations of all three indexes also indicated lowest or relatively lower from 40.96 seconds to 56.32 seconds. These results would imply that EEG data from 56.32 seconds to 61.44 seconds is stable among all subjects, is less influenced by artifacts, and appropriately reflects the influence of the difference between Excel 2003 and 2007. However, for all indexes, the difference of the influence between two versions of Excel and the standard deviations constantly fluctuated. We consider that this might occur due to individual differences of subjects’ brain waves, fatigue from the long experiment duration, myogenic potential by attitude variation, and so forth. Therefore, it is necessary not to use a time window with short length, but to use one with long length, in order to conduct accurate usability evaluation.
Toward Constructing an EEG Measurement Method for Usability Eval.
103
From the results of the power spectral analysis using different lengths of time window, we could observe that the alpha rhythms in Excel 2003 were stably higher in the case of using time windows over 40.96 seconds, the beta rhythms in Excel 2007 were stably higher in the case of using time windows over 56.32 seconds, and beta/alpha in Excel 2007 was stably higher in the case of using time windows over 40.96 seconds. For all indexes, the standard deviations tended to become small as the time window became longer. This might be because the rate of EEG influenced by artifacts became small as the time window became longer. These results suggest us that we could analyze EEG data with no influence of artifacts by using a time window over 56.32 seconds. The results of our analysis showed that the alpha rhythms in Excel 2003 were larger than those in Excel 2007, and the beta rhythms and beta/alpha in Excel 2007 were larger than those in Excel 2003. It is clarified by previous studies on EEG measurement methods that the amount of alpha rhythms is decreased and the amount of beta rhythms and the value of beta/alpha are increased when a subject’s mental work load is high. The results of the questionnaire showed our subjects preferred Excel 2007 to Excel 2003. These results of questionnaire and our analysis agree with previous work. Since the results of our analysis and questionnaire also supported the results of the past studies on EEG, we can conclude EEG measurement would be useful for evaluating software usability.
7 Conclusion and Future Work In this paper, we have conducted an experiment to have a clear understanding of the appropriate timing and lengths of time window in order to analyze EEG data for accurate usability evaluation. From the results of the experiments, we could obtain the following insights. • A short length of time window (e.g. 5.12 seconds) is not suitable for usability evaluation because the frequency components of brain waves constantly fluctuate. • The accuracy of usability evaluation can be improved by using length of time windows of 56.32 seconds. In this experiment, we could not observe normal state of EEG data but EEG influenced by the tasks. In the near future, we need to conduct another experiment to observe it.
References 1. Ericsson, K.A., Simon, H.A.: Protocol analysis: Verbal reports as data. MIT Press, Cambridge (1993) 2. Osgood, C.E., Suci, G.J., Tannenbaum, P.H.: The measurement of meaning. University of Illinois Press, Urbana (1957) 3. Chin, J.P., Norman, K.L., Shneiderman, B.: Subjective user evaluation of CF PASCAL programming tools. Technical Report (CAR-TR-304) (1987)
104
M. Kimura et al.
4. Hart, S.G., StaveLand, L.E.: Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. In: Handcock, P.A., Meshkati, N. (eds.) Human Mental Workload, pp. 139–183. Elsevier, Amsterdam (1988) 5. Matsunaga, H., Nakazawa, H.: A Study on Human-Oriented Manufacturing System (HOMS) – Development of Satisfaction Measurement System (SMS) and Evaluation of Element Technologies of HOMS using SMS. In: Int. Conf. Manufacturing Milestones toward 21st Century, pp. 217–222 (1997)
Automated Analysis of Eye-Tracking Data for the Evaluation of Driver Information Systems According to ISO/TS 15007-2:2001 Christian Lange, Martin Wohlfarter, and Heiner Bubb Lehrstuhl für Ergonomie, Technische Universiät München Bolzmannstrasse 15, 85747 Garching {lange,m.wohlfarter,bubb}lfe.mw.tum.de
Abstract. First of all, the most important content of the ISO/TS 15007-2:2001 standard for performing eye tracking experiments will be described. The following text includes a detailed description of how eye/gaze experiments using the Dikablis eye tracking system are conducted according to the above mentioned standard and how recorded statistical evaluations can be automated and visualized. Keywords: ISO/TS 15007-2:2001, Eye tracking, Driver Assistance Systems, Driver Information Systems.
1 Introduction Adhering to the guidelines of the European Statement of Principles ESoP, the need for good and less distracting design of driver information systems will grow enormously. The future will yield the driver information systems which will hinder or distract the driver from the driving task being performed as least as possible. The advantage of the standardized experimental conduction lies in the improvements delivered by progressive experimental comparability, error reduced and faster processing. The duration of the analysis of uplifted data is therefore enormously reduced. Subsequently, the standard conduction of gaze experiments using the Dikablis Toolkit can be proved to comply with the ISO/TS 15007-2:2001 standard.
Fig. 1. Workflow schema for standardized testing and experimentation of Dikablis and ISO/TS 15007-2:2001 interferences/differences
With the Recording Software the gaze behavior of the test subject can precisely be recorded. The test subject wears the Dikablis Head Unit, on which two cameras are installed. One camera is directed towards the test subject’s eye and is used to recognize the subject’s gaze behavior (pupil movements). The other camera is directed straight on in front of the subject and monitors the subject’s environment. By processing both of these pieces of video data, the gaze direction of the test subject can almost precisely be determined. Offline re-calibration and post-processing of the eye-detection can be made with the Dikablis Analysis Software using the recorded data even after the recording software has finished recording. These possibilities to adjust and post-process guarantee clean and always analyzable test results under any circumstances. The D-LAB module contains a package for conducting experiments, from test planning and defining of special Areas of Interest (AOI) visible within the gaze region, to the automatic calculation of glance durations and graphical presentation of these results. Successively, the proceedings to standardized testing and experimentation will sbe described subsequently. Herein the capital and lower case letters are always related to figure 1’s demonstrated procedural (workflow) plan and the relationships between the small angled boxes located within the figure.
Automated Analysis of Eye-Tracking Data
107
2.1 Construction of an Experimental Plan The compliancy to test partitioning presented by ISO/TS 15007-2:2001 into „experimental condition“, „task“, and „subtask“ can be defined into one test plan with DLAB (see figure 2 left). This way a test can be represented in the form of intertwined and nested intervals in which: • „experimental condition“ an entire Experiment is unfolded (Ex. driving on country roads); • “task” defines the interaction between a determined system presented within the Experiment (ex. Operation of Navigation Systems); • A specification of a “task” as “subtask” will be considered. (Ex. Operation of a navigation system via a touch screen display). D-LAB offers the possibility to additionally define „subsubtasks“ as the fourth layer, ex. in order to mark the appearance of critical events presented within the experiment (ex. sharp breaking situations) or automatic analysis of the display screen (ex. an individual input screen within the navigation system, for example inputting a destination address).
Fig. 2. Left: constructing a test sequence; Right: Automatically created block shifting diagram/interface for online test segment sequence marking
2.2 Record Gaze Behavior and Mark the Beginning and End Points of Each Trial Interval Using one of the D-Lab integrated applications one can automatically create a shifting block diagram/interface using the trial definitions, which can be seen on the right side of figure 2. Each shifting block represents a test segment. By pressing a block, a network event is created, which marks the start or ending of a task interval directly in the Recording Software. The functionality of the Dikablis recording software is described
108
C. Lange, M. Wohlfarter, and H. Bubb
in written detail in Lange et al., 2006a und Lange et al., 2006b. These events mark the beginning respectively the end of a trial segment and are synchronously saved to the calculated gaze data. The events can also be started from another data recorder such as a driving simulator. 2.3 Validating Gaze Data and Trial Intervals The Dikablis analysis software is firm in validating the gaze data after a trial and in optimizing. In preparation, the gaze data if necessary is calibrated after the trial so as to optimize offline pupil recognition (see Lange et al., 2006a and Lange et al. 2006b). Further processing in D-LAB follows after the optional jointing is successful. The first move consists of testing whether all trial intervals would be marked correctly. The marked trial intervals located under the gaze player window synchronized to the playtime line shown on the user interface layer (see figure 3). D-LAB also offers a testing function which automatically identifies inconsistencies within the trial segment. The segment can be manually adjusted or changed, tasks can be added or deleted. Figure 3 shows the D-LAB interface for management of a trial interval.
Fig. 3. Validation and post-processing of a trial segment in D-LAB
2.4 Definition of Areas of Interest The areas of interest used in the calculation of glance durations in relevant gaze regions (ex. the navigation system display, the street view, the rear view mirror, the left angle mirror, the dashboard, etc.) as required by the ISO/TS 15007-2:2001 are defined by using D-LAB’s functionality of marking specific AOIs. Thus in D-LAB a random number of AOIs can be defined in the form of a polyline and labeled with names. Figure 4 demonstrates the Head-Up Display and Dashboard as defined AOIs.
Automated Analysis of Eye-Tracking Data
109
Fig. 4. Defining selected AOIs within the gaze region such as the Head-Up Display and the Dashboard
2.5 Automatic Calculation of Glance Metrics to the Defined AOIs In order to automatically calculate glance duration and glance frequency, the pupil of the test subject as well as his/her head position in relation to the test environment (relative to the defined AOIs) must be recognized. The determination of the head position is carried out with the help of the so called marker(s) (see the quadratic black and white object in figure 4), which describe the environment reference. These markers are found using image processing in the photo of the field camera and used in the head position calculation of the test subject. For every defined and established AOI, using the D-LAB function „Calculate Gaze Durations“ it is calculated when the test subject glanced at an AOI and when their glance was away from a certain AOI. The result of this calculation is displayed analogously to the graphical representation of the trial intervals located under the gaze film player window synchronous to the playtime line. Hence, the operator can impose calculation corrections at any time simply by testing and manually correcting if necessary. The gaze specific values from the ISO/TS 15007-2 standard can be calculated using the automated determination of glance durations via the defined AOIs. Thus the operator can input the values which need be calculated for the following gaze experiment in the form of an „Analysis Series“. Therefore the specific values pertaining to specific tasks and AOIs to be calculated must be defined. The following values are available to choose from: • • • • • • • •
Total Glance Time Glance Frequency Time off road-scene-ahead Total glance time as a percentage Fixation probabilities Link value probabilities Maximum Glance Duration Mean Glance Duration
110
C. Lange, M. Wohlfarter, and H. Bubb
D-Lab hereupon calculates the inputted values and saves the results in the form of text file. These automated value calculations can be conducted for all defined AOIs, all terms of the experiment plan as well as for all gaze metrics. The text file is built concurrently so that it can be opened for further statistical analysis using the SPSS statistics program without indirection. 2.6 Graphical Representation of Calculated Metrics Next to the internal calculation of values, D-LAB additionally offers a trial result(s) graphical representation capability. For this purpose several graphical diagrams are displayed which support the interpretation of the result(s). There are two examples in figure 5 that exemplify this. To the top one can find the progressive course of glance duration on defined AOIs for a single subtask presented by four independent test subjects. To the bottom the course of the users mean glance fixation duration in a critical situation is shown in order to allow conclusions on the basis of mental strain.
Fig. 5. Top: Glance duration on the defined AOIs on all trial experiments. Bottom: User glance fixation duration in critical situations.
References 1. ISO/TS 15007-2:2001: Road vehicles - Measurement of driver visual behavior with respect to transport information and control systems - Part 2: Equipment and procedures 2. Lange, C., Yoo, J.-W., Wohlfarter, M., Bubb, H.: Dikablis - Operation mode and evaluation of the human-machine interaction. In: Spring Conference of Ergonomics Society of Korea, Seoul, May 12 (2006) 3. Lange, C., Wohlfarter, M., Bubb, H.: Dikablis - engineering and application area. In: Proceedings IEA 2006 16th World Congress on Ergonomics, Maastricht the Netherlands (2006)
Brain Response to Good and Bad Design Haeinn Lee1, Jungtae Lee2, and Ssanghee Seo2 1
107A Kiehle Visual Arts Center, 720 Fourth Avenue South St. Cloud, MN 56301-4498, USA 2 6409-1 Dept of Computer Science & Engineering, Pusan National University Jangjeon-dong, Geumjeong-gu, Busan, 609-735, Republic of Korea
Abstract. This paper is about the decision of whether good or bad design is the result of the human brain process. Our research team has used the technique of functional MRI and Electroencephalogram (EEG) to address the question of how the brain answers while subjects viewed different designs. Classifying the good or bad designs, subjects chose a mouse button to decide their perception of good or bad design and we analyzed their patterns of EEG rhythms and fMRI. The results of fMRI showed that the perceptions of different feelings of designs are associated with the frontal lobe and the occipital lobe. After analyzing the EEG by the Event-related brain potentials (ERP) method, we also found that the amplitude of ERP components in perception of bad design is greater and latency is shorter than that of good design. Therefore, the human brain responds sooner and stronger in perception of bad feeling. Keywords: Human Behaviors, EEG, fMRI, ERP, Interaction and Interface Design, Usability Test, Brainwork, Visual Brain.
has explored diverse aspects of many different major areas such as medical science, engineering, psychology, biotechnology, linguistics, economics, music, etc. This research has provided new insight into underlying disease mechanisms and is beginning to suggest new treatment. Also using the Neuro-Lingustic Programming (NLP) method, researchers understand human behavior and cure the human mind to lead a better life. [2] The brain controls body activities, feels the five senses, and shapes thoughts, hopes, dreams, and imaginations. In short, the brain is what makes people human. [1] Recently, there are lots of brain researches relating to linguistics and music activities. For example, a person hears two different sentences; “He takes coffee with cream” and “He takes coffee with dog” and examined their brain movements. There are certain distinctions between the two sentences. The human brain has a strong reaction when people hear the latter sentence that is not normal. Likewise, there are certain differences of brain movements when people hear natural and unnatural melodies from music. [3,4,5] It can be linked to the art and design areas, so that research can be about how the human brain acts when people do artistic activities, and how people feel when they look at artistic works or designs. If people look at the smile of the Mona Lisa, they usually get a good feeling. On the other hand, if people look at a photo that shows a scene with a terrible car accident, they get a bad feeling. According to this fact, if our researchers can clarify and measure how the brain works when people get a good or bad feeling, we can come up with an objective evaluation of which part of design makes people get a good feeling or bad feeling. Also we can determine the main aspects of a design with good value. We have used the technique of functional MRI and EEG to address the question of how the brain answers while subjects viewed different designs. Twenty-five subjects participated in this research and viewed fifty design images randomly. Classifying the good or bad designs, subjects chose a mouse button to decide their perception of good or bad design. We analyzed their patterns of EEG rhythms and fMRI to find the characteristics of brain activity based on good and bad designs.
2 Brain Structure and Movement 2.1 Brain Structure As I mentioned before, the brain is divided into four sections: the occipital lobe, the temporal lobe, the parietal lobe, and the frontal lobe, and functions such as vision, hearing, and speech, are distributed in these selected regions. The occipital lobe is located in the back of the brain and plays a role in processing visual information. (Fig.1) The parietal lobe plays a role in sensory processes, particularly determining spatial sense and navigation, attention, and language. The frontal lobe has a role in controlling movement and in the planning and coordinating of behavior. The temporal lobe is involved in auditory processing and is home to the primary auditory cortex. [1].
Brain Response to Good and Bad Design
113
Fig. 1. Brain Classifications [1]
2.2 Brain Movement and Technology Functional magnetic resonance imaging (fMRI). People use the MRI, which provides a high quality, three-dimension image of organs and structures inside the body to examine a body structure. But to examine the brain activity people need to use fMRI and this is most popular neuroimaging technique today. This technique compares brain activity under resting and active conditions, and it combines the high spatial resolution, which is a correlate for neuronal activity. This technique allows for more detailed maps of brain areas underlying human mental activities in health and disease. Our team has used the technique of fMRI to find out which part of brain is more activated when subjects viewed good or bad design. [1]. Electroencephalogram (EEG). Many of the recent advances in understanding the brain are due to the development of techniques that allow scientists to directly monitor neurons throughout the body. Electroencephalogram is the recording of electrical activity along the scalp produced by the firing of neurons within the brain. In this method, electrodes placed in specific parts of the brain, which vary depending on which sensory system is being tested-make recordings that are then processed by a computer. [1]. EEG has several strong sides as a tool of exploring brain activity; for example, its time resolution is very high (on the level of a single millisecond). Other methods looking at brain activity, such as PET and fMRI have time resolution between seconds and minutes and it measures the brain’s electrical activity directly. EEG can be used simultaneously with fMRI so that high-temporal-resolution data can be recorded at the same time as high-spatial-resolution data. [6]. Our team has used the technique of EEG (Electroencephalogram) to address the question of how the brain answers while subjects viewed images. Most important thing to use EEG is which part of brain we need to place electrodes to measure the brain waves. Electrode location and names are specified by the 10-20 system for most research applications. This system is an internationally recognized method to describe and apply the location of electrodes in the context of an EEG test. [6] (Fig.2).
114
H. Lee, J. Lee, and S. Seo
Fig. 2. Electrodes location to check the EEG (10-20 System) [8]
Event-related potential (ERP). During the experiment, we used event-related brain potentials (ERP), which is a method in checking an electroencephalogram in certain moment when subjects received the event like viewed designs. ERP is any measured brain response that is directly the result of a thought or perception. It can be reliably measured using the EEG, a procedure that measures electrical activity of the brain through the skull and scalp. [7]. There are two important components in ERP waveform, which are P300 and N400. The N400 ERP component is described as a negative voltage deflection occurring approximately 400ms after stimulus onset, where as P300 component describes as a positive voltage deflection 300ms after stimulus onset. The presence, magnitude, topography and time of this signal are often used as metrics of cognitive function in decision making processes. While the neural substrates of this ERP still remain hazy, the reproducibility of this signal makes it a common choice for related researches. [6].
3 Brain Response to Good and Bad Design Our team has used the technique of functional MRI and EEG to address the question of how the brain answers while subjects viewed different designs. Twenty-five subjects participated in this research and viewed fifty design images randomly. First, as I am a professional graphic designer, I subjectively chose twenty-five designs that evoked good feelings and twenty-five that evoked bad feelings. (Fig.3) Because the author’s judgment of good or bad designs was purely subjective, the individual subjects were asked to categorize each design as good or bad. As they viewed each image, they either clicked the right mouse button to classify the image as good, or the left mouse button to classify the image as bad. As they did this, we analyzed their patterns of EEG rhythms and fMRI.
Brain Response to Good and Bad Design
Fig. 3. Design image examples for experiment
Fig. 4. The fMRI response of good design
115
116
H. Lee, J. Lee, and S. Seo
3.1 The Results of the fMRI Based on Good and Bad Design Our team used fMRI machine (ISOL fORTE) in the Brain Science Research Center at Korea Advanced Institute of Science and Technology (KAIST) and shooting variable is as follows: TR=3000ms, TE=35ms, Number of Slice=25, FOV=24cm, Image natrix=64x64, Slice Thickness=5mm. The scenario of the experiment is as follows: 1. Shows “+” on a screen for 3seconds. 2. Shows good and bad design randomly for 3seconds. 3. Shows “+” on a screen for 3seconds again. We kept repeating this process to show all fifty designs. At the same time, subjects chose a mouse button to decide their perception of good or bad designs. Fig 4 presents the part of brain that showed activity when one subject saw a good design. The top left brain image shows the midsagittal section, the right image is showing the coronal section, and the bottom image shows a horizontal section. The left part of the midsagittal is the location of the occipital lobe and the right part is the location of the frontal lobe. The top part of horizontal section is the right part of midsagittal section and bottom part is the left part of midsagittal section.
Fig. 5. The fMRI response of bad design
Brain Response to Good and Bad Design
117
The bottom table of fig4 shows the coordinates of where the brain was activated. Using these coordinates we can indicate the broadmann area in the activated brain areas. After observing the broadmann area, the most activated areas were points 19,7,11,and 6. These areas are the occipital lobe, the parietal lobe, the prefrontal lobe, and the frontal lobe. The decisions to judge images either good or bad are made in the occipital lobe and the frontal lobe. The broadmann 7 area, which controls visualmotor coordination, is located in the parietal lobe, but it is close to the occipital lobe. Fig 5 presents the part of brain that showed activity when one subject saw a bad design. When subjects judged the design as bad, their brain was activated in broadmann area 18, 19,7,11,6, and 20 and these areas are location of the occipital lobe, parietal lobe, prefrontal lobe, frontal lobe, and temporal lobe. The broadmann 20 area, which controls the high-level visual processing and recognition, is located in the temporal lobe. Similarly, The occipital lobe, frontal lobe, and parietal lobe were activated while subjects looked a bad design. We noticed that there were differences in specific activated areas, for example, the brain action when subjects had a bad feeling was much stronger than when subjects had a good feeling. Also, there was little difference in brain action between the left and right brain when the subject perceived the design as good, but left-brain action was strong when subjects perceived the design as bad. 3.2 The Results of the EEG Based on Good and Bad Design Based on the fMRI result, our researchers decided to attach electrodes in the area of the occipital and the frontal lobe, which is Fp1, Fp2, O1, and O2 from 10-20 system, and examine the EEG in the areas. (See Fig.2.) Twenty-five subjects participated this research and the scenario of the experiment is as follows: 1.Shows “+” on a computer screen for 1minute to be a steady state of brain. 2. Shows good and bad design randomly for 3seconds. 3. Subjects choose a mouse button to decide their perception of good or bad designs. 4. Shows “+” on a computer screen for 1minute to be a steady state of brain again. We kept repeating this process to finish all fifty designs. During the experiment, we used event-related brain potentials (ERP), to check EEG in certain moment when subjects received the different feelings. Fig6 image shows the amplitude of the occipital lobe when the subjects perceived the design as good or bad. The latency of P300 component from channel O1 and O2 when subjects had a bad feeling appeared faster than when subjects had a good feeling. Also the amplitude of the perception of bad feeling is higher than the amplitude of the perception of good feeling. Fig 7 shows the result of difference of latency based on subjects’ perception of good or bad design. From this figure, the x-axis represents the result of the latency (G) of a bad design subtracts from the latency (B) of a bad design, and the y-axis is the subjects’ number. Most points placed in plus area which means the latency (G) of a good design is longer than the latency (B) of a bad design. Therefore, the human brain responds sooner and stronger action when subjects perceived the design as bad. We also examined the latency of the frontal lobe when the subjects perceived the design as good or bad, and the result is similar with the occipital lobe. As I mentioned before, the human brain responds sooner to in perception of bad feeling. However, these results are the mean value of subjects, there are people responded sooner in perception of good feeling.
118
H. Lee, J. Lee, and S. Seo
Fig. 6. Comparison of the occipital lobe characteristic
Fig. 7. Comparison of the occipital lobe latency
4 Conclusion With advances in computer video display technologies such as EEG, fMRI, PET, and SPECT, brain research has become a contemporary issue. Based on the issue, many different major areas such as Medical Science, Engineering, Psychology, Linguistics,
Brain Response to Good and Bad Design
119
Biotechnology, Economics and Music explored diverse aspects of brain research. Hence, there should be lots of possibilities to study brain research relative to Art and Design. When one makes the decision that something is good or bad his/her decision is a result of the human brain process. It is obvious that the brain works differently depending on a good feeling or a bad feeling. In other words, the brain will answer different ways when people look at good design or bad design. Although there would be a difference in determining design values, it depends on the individual. In the case of a masterpiece, most people admit its artistic value. According to this fact, if our researchers clarify how the brain works when people look at good and bad design, we come up with an objective evaluation of how people judge the design value. Based on the result, we also determine what can be the main aspects of a design with good value. We have used the technique of functional MRI and EEG to address the question of how the brain answers while subjects viewed different designs. The results of fMRI show that the perceptions of different feeling for designs are associated with the frontal lobe and the occipital Lobe. The occipital lobe is placed in back of the brain part and has a visual cortex so controls with the organ of vision. The frontal lobe is placed in front of the brain part and takes care of complicated functions such as thinking, planning, and deciding. Based on the fMRI result, our researchers decided to attach electrodes in the area of the occipital and the frontal lobe and examine the EEG in the areas. During the experiment, we used the ERP, which is a method in checking the EEG in certain moment when subjects received the event like viewed designs. After analyzed the EEG by the ERP method, we also found that the amplitude of ERP components in perception of bad design was higher and latency was shorter than that of good design. From all examinations, we found there is a significant distinction between an individual because of a characteristic of brain. Some people have strong action in leftbrain and some are not. But in general, the human brain responds sooner and stronger to action in perception of bad feeling. Thus, we assume it can be an objective fact to determine good value of design when brain responds slower and exhibit a long latency. Considering this issue, we also can determine what aspect that makes good value of design and apply as a solution to create a better design. In the future, the research includes not only pure design but also can be applied into the human interface design. One of the most important issues in the human interface design is a usability test to grasp a person’s individual feeling and measure the emotional satisfaction. The idea of this paper will show as a great method of usability test to classifies the good or bad interface design as well as link with ease of use. Acknowledgments. I want to give a special thank you to my research team and family: my father, mother, and brother for their constant and unconditional love, encouragement, and support. I am deeply indebted to them and dedicate this study to them.
References 1. Society for Neuroscience: Brain Facts, A primer on the Brain and Nervous System (2005), http://www.sfn.org 2. Mark, F., Barry, W., Michael, A.: Neuroscience: Exploring the Brain, 3rd edn. Lippincott Williams & Wikins (2006)
120
H. Lee, J. Lee, and S. Seo
3. Lee, J.Y.: Neurophysiology and Brain-imaging study of Music-music& language, music & emotion. Nangman Music Magazine 18(3) (2006) 4. West, W.C., Rourke, T., Holcomb, P.J.: Event-Related Brain Potentials and Language Comprehension: A Cognitive Neuroscience Approach to the Study of Intellectual Functioning, Tufts University (1998) 5. Lu, H., Wang, M., Yu, H.: EEG Model and Location in Brain when Enjoying Music. In: Proceedings of the 2005 IEEE Engineering in Medicine and Biology, Shanghai, China, pp. 2695–2698 (2005) 6. Wikipedia, encyclopedia: http://en.wikipedia.org/wiki/EEG 7. Coles, Michael, G.H., Rugg, M.D.: Event-related brain potentials: an introduction, Electrophysiology of Mind, Oxford scholarship Online Monographs, pp. 1–27 (1996) 8. Kim, D.S.: Electroencephalogram, Korea Medical (2001)
An Analysis of Eye Movements during Browsing Multiple Search Results Pages Yuko Matsuda, Hidetake Uwano, Masao Ohira, and Ken-ichi Matsumoto Graduate School of Information Science, Nara Institute of Science and Technology 8916-5,Takayama, Ikoma, Nara, Japan {yuko-m,hideta-u,masao,matumoto}@is.nasit.jp
Abstract. In general, most search engines display a certain number of search results on a search results page at one time, separating the entire search results into multiple search results pages. Therefore, lower ranked results (e.g., 11thranked result) may be displayed on the top area of the next (second) page and might be more likely to be browsed by users, rather than results displayed on the bottom of the previous (first) results page. To better understand users' activities in web search, it is necessary to analyze the effect of display positions of search results while browsing multiple search results pages. In this paper, we present the results of our analysis of users' eye movements. We have conducted an experiment to measure eye movements during web search and analyzed how long users spend to view each search result. From the analysis results, we have found that search results displayed on the top of the latter page were viewed for a longer time than those displayed on the bottom of the former page. Keywords: Eye tracking, Web search, User activity, Search results page.
results exist, most search engines separate the entire results into multiple search results pages and display one results page at a time (e.g., 10 search results on a page. The numbers of results displayed on one page depends on user preference.) In this case, lower ranked results (e.g., 11th result) may be displayed on the top area of the next page and might be more likely to be browsed by users, rather than results displayed on the bottom of the previous (first) results page. Therefore, it is important to analyze and understand user activities in searching multiple search results pages in order to provide users with a means or new interface that enables them to more naturally browse search results in a ranked order recommended by a search engine. In this paper, we have conducted an experiment to observe effects of search result positions on user activities in web search. We measured eye movements during web search tasks and analyzed them to understand how long users spend to browse each result.
2 Related Work There have been many studies on user activities in web search. A popular approach to analyzing user activities in web search is to use data of browsing histories or access logs [6][7][8]. Although using history data helps us understand a user’s access paths or interests in a specific web page, it cannot be used for analyzing a user’s attention to search results during web search and influences of displayed positions of search results on web search activities. Another approach is to use eye tracking instruments to capture user activities based on eye movements. Cutrell, et al. [1] used eye tracking to analyze the influence of the length of a site summary (a snippet text) presented in web search results on a user’s search activities. The results of their experiments indicated that a long site summary has an effect of decreasing search time and increasing a user’s search correctness in informational search tasks where users are trying to find a specific web site or homepage. In contrast to informational search tasks, in navigational search tasks where users are trying to find web pages that include some kind of information, it has an effect on the increase of search time and the decrease of user’s search correctness. Guan et al. [2] also measured eye movements and then analyzed the influence of positions of target results that users are looking for. They found that users took longer to search the target results and were less successful in finding the targets when the search targets were placed in low positions on a search results page. Although the study provides useful insights into the design of a new interface for web search engines, it only focused on the first search results page. So it still fails to capture user activities in browsing multiple search results pages. Lorigo et al. [9] analyzed differences of task types in web search. They reported that users performing informational search tasks took longer to complete those tasks than navigational ones and spent more time to stay at pages linked by search results. However, the study also did not focus on time consumption for each search result and user interactions with multiple search results pages. In this paper, using multiple search results pages and two types of tasks (i.e., informational and navigational tasks), we measure and analyze users’ eye movements for each set of search results to provide a new understanding of user activities in web search.
An Analysis of Eye Movements during Browsing Multiple Search Results Pages
123
3 Experiment 3.1 Overview To observe how users look at each set of search results during browsing multiple search results pages, we analyzed total time of eye movements on each search result. In the experiment, participants were asked to search for appropriate web pages (target results) from the search results pages of Google to find particular information with predetermined words. The information and words were specified by experimenters. The experimenters measured eye movements of the participants during the tasks. To analyze user’s eye movements helps us understand how users browse search results. In the experiment, WebTracer [10] was used as an eye tracking system. WebTracer allows us to collect and analyze data of a user’s eye movements and operations (e.g., mouse and keyboard operations) during web search. After the tasks, participants answered a questionnaire about their usual search activities and were interviewed about observed interactions in the tasks. Participants of the experiment were 21 undergraduate students studying information science. All participants used web search in daily life and used Google as their main search engine. 3.2 Apparatus In the experiment, the following equipment was used. • Display: 21-inch LCD monitor (Viewable screen size: H30 x W40cm, Resolution 1,024 x 768 pixels) • Distance from subject’s face to display: approx. 50cm • Device for measurement of sight-line: NAC, EMR-NC (View angle: 0.28 degrees, resolution on the screen: approx. 2.4mm) • Recording and playing of sight-line data: WebTracer 3.3 Task The tasks performed by participants were (1) to search for appropriate web pages linked by search results from search results pages, (2) to find particular information specified by the experimenters and (3) to bookmark the target pages.
Fig. 1. Example of Rearrangement Search Results
124
Y. Matsuda et al.
The time limit of each task was ten minutes whether participants could complete the task or not. In the experiment, participants needed to use predetermined words and they were prohibited from changing search words during the tasks. Since the purpose of this experiment was to observe user’s activities in using multiple web search results pages, participants were only permitted to move to web pages linked by Google’s search results. The order of the tasks was counterbalanced to consider the learning effect. The tasks themselves were based on the test collection provided by NTCIR (NTCIR4 WEB) [11][12]. In this experiment, the two types of tasks were selected as follows. • Informational Task: required participants to find specific information (e.g., web pages including information on university entrance exams). The task was completed by finding three web pages linked by target results and bookmarking them. • Navigational Task: required participants to find specific web pages (e.g., official web page of the university). The task was completed by finding a web page linked by a target result and bookmarking it. Each participant performed ten tasks (five tasks for each task type). 3.4 Design of Web Search Results Pages To prevent the bias effect of the numbers of target results and their positions, we modified the results pages that were saved in a local computer when we searched with Google. The participants performed the search tasks with the modified search results pages. The previous study showed users search about 2.35 pages [3]. Therefore, we prepared three search results pages and allocated target results randomly on the search results pages. Advertisements and information irrelevant for web search were removed. We used Google’s default setting in which 10 search results are displayed at a time. In addition, Google’s original (unmodified) search results pages were used for fourth or later search results pages. Note that each search result and display position of search results followed Google’s page rank. To prevent participants from finishing their search only on the first page, we allocated target results on the second or third page. Figure 1 is an example of a way of inserting target search results.
Fig. 2. Inserted Positions of Target Search Results
An Analysis of Eye Movements during Browsing Multiple Search Results Pages
125
For the design of search results pages, we prepared four rearrangement patterns of search results (Figure 2). In the Informational Task, we displayed target results in top (I-1), middle (I-2), bottom (I-3) and even (I-4) in second and third search results page. In Navigational Tasks, we displayed target results in the top (N-1, N-3) and bottom (N-2, N-4) on the second and third search results pages. Since participants may notice the experimenters’ intention (i.e., target results were displayed only after the second page), we insert a dummy task (original search results of Google) into each task. 3.5 Experimental Procedure 1. Explanation of the experiment and preparation: Experimenters explained the experiment and the eye tracking system to participants. 2. Configuration of the eye tracking system: We configured devices for measurement of sight line and checked sight line (calibration). 3. Task for practice: To understand the flow of the experiment, participants practiced a task. The task was an Informational Task. Original search results pages of Google were used. 4. Performing tasks: Experimenters explained each search task and participants started to search. This was repeated until all the tasks were finished. 5. Questionnaire: At the end of the experiment, participants were asked to answer a questionnaire about their daily use of web search engines. 6. Interview: Participants were also asked to answer an interview about observed characteristic activities during the tasks.
4 Results 4.1 Eye Movement Figure 3 shows the example of eye movements gathered in the experiment. The vertical axis shows the position of search results and the horizontal axis shows the time of appeared sight line. In the Figure, horizontal line describes eye movements on search results and the circle shows user click to the search result. The figure showed this user searched the results from top to bottom. The total time of eye movements on the clicked result was longer than other results. 0
Eye Movements Mouse Click
)k nar 10 (lt us eR hrc ae 20 S 30 0:03
0:37
1:13
1:57 2:14 2:41 3:41 Search Time(min:sec)
4:01
4:16
Fig. 3. Eye Movement and Clicked Search Result during the Task
5:05
126
Y. Matsuda et al. Table 1. Classification of Search Completion Pages
Group
Last Search Result
Last Search Results Page
Number of Task
G0
~5
Less than 1
Informational 6
Navigational 5
G1
6~15
1
18
21
G2
16~25
2
33
38
G3
26~
Over 3
27
20
4.2 Analytic Procedure To calculate the total time of eye movements, we separated the tasks by search completion pages. The search completion page is calculated from the position of the lowest search result looked at by each subject. Table 1 shows the classification of search completion pages by group. In this paper, we would like to analyze users’ activities searching multiple search results pages. Hence, we analyze the tasks which finish at the second page (G2) and after the third page (G3). We use the length of time to analyze the eye movements. Even if a users’ gaze appears at a certain search result, it does not necessarily mean that the reviewer has interest that line. Hence, we have to distinguish a focus (i.e., interest) from users' eye movements. In this paper, we defined focus as the eye remaining on a certain search result more than 100ms. To increase correctness of the analysis, we also remove eye movements that stay a long time on a particular position. When the user reads a search result intensively, the time of eye movements to the results greatly increases. However, this increase is not the effect of display position but the effect of the content of the result itself, that is, title of the web page, snippet (description of the web page), and URL. To distinguish a user’s intensive reading, the average reading time of clicked search results was adopted. Basically, users read the results before the clicking, to decide to move to the web page. Hence, it is reasonable to remove the eye movements to search results that stay more than the average time of clicked search results. In this experiment, the average time in the Informational Task was 3.18 seconds, and 2.44 seconds for the Navigational Task. 4.3 Analysis Result Figures 4 and 5 describe the mean time of eye movements on each search results classified as G2 and G3, respectively. The vertical axis shows the mean time of eye movements and the horizontal axis shows the rank of results. In both groups, users tend to view search results longer in informational tasks than navigational tasks. To evaluate effects of search result positions on user activities in web search, we calculated the total time of eye movements on top search results and bottom search results. Table 2 shows the average time of eye movements for the three results that were ranked high and displayed at bottom of the page (HB), and for the three results that were ranked low and displayed at top of the page (LT) in G2. Table 3 shows the average time of eye movements for the three HB results and that for the three LT results in G3. The table describes that the mean time of eye movements to LT is
An Analysis of Eye Movements during Browsing Multiple Search Results Pages
yeE fo e im T
)c 3 es 2.5 ( 2 st ne 1.5 1 em vo 0.5 M 0
Informational Task
1
3
5
7
9
11
13 15 17 19 Search Result (rank)
21
127
Navigational Task
23
25
27
29
Fig. 4. Mean Time of Eye Movements on each Search Result in Informational Task (square) and Navigational Task (triangle) of G2
yeE fo em iT
c)e 3 s( 2.5 st 2 ne 1.5 m ev 0.51 oM 0
Informational Task
1
3
5
7
9
11
13 15 17 19 Search Result (rank)
21
Navigational Task
23
25
27
29
Fig. 5. Mean Time of Eye Movements on each Search Result in Informational Task (square) and Navigational Task (triangle) of G3 Table 2. Mean Time of Eye Movements for High-Rank Results Displayed in Bottom Area 3 and Lower-Rank Results Displayed in Top Area 3 in G2
Group
Search Results
LT_1 HB_1 LT_2 HB_2
1~3 8~10 11~13 18~20
Mean Time of Eye Movements (sec) Informational Navigational 1.67 1.24 1.46 1.04 1.51 0.96 0.66 0.79
Table 3. Mean Time of Eye Movements for High-Rank Results Displayed in Bottom Area 3 and Lower-Rank Results Displayed in Top Area 3 in G3
Group
Search Results
LT_1 HB_1 LT_2 HB_2 LT_3 HB_3
1~3 8~10 11~13 18~20 21~23 28~30
Mean Time of Eye Movements (sec) Informational 1.38 1.56 1.43 1.25 1.48 0.75
Navigational 0.95 0.88 0.84 0.83 0.95 0.61
almost the same as HB, or longer than HB in some cases (e.g. between HB_2 and LT_3 at G3). This result indicated users did not view the search results in proportion with page rank.
128
Y. Matsuda et al.
In particular, we focused on the top-three search results (LT) of each page (see Figures 4 and 5). In Figure 4 (eye movements of G2), the mean time of eye movements to the first result of each page (ranks 1 and 11) was shorter than the second and third results (ranks 2, 3, 12, and 13) at both task types. Also at G3 (Figure 5), eye movements to the first result of the page were shorter than the second and third search results.
5 Discussion 5.1 Effect of Task Differences The result of the experiment describes that users tend to view search results longer in informational tasks than navigational tasks. In the Navigational task, users read the snippet of the search result, then decide to click the search result or not. On the other hand, in the Navigational Task, users read the title and URL of the result instead of the snippet to decide to click the result. Reading the title and URL of the result requires less time than reading the snippet, therefore the length of the eye movements in Informational Tasks is longer than Navigational Tasks. The result suggests that when users browse multiple web search results pages, they adopt different reading patterns for each task. 5.2 Effect of Position within a Result Page/Screen The result of the experiment describes that time of eye movements on LT is longer than HB. The result shows that users are impressed not only by the rank but also the position of the search results within the results page. That is, the time length of eye movements on the search results is influenced by the position within a results page. The detailed analysis showed the time of eye movements on second and third results in each page are longer than the first result. This result suggests users’ eye movements are attracted by the position on the screen. In the experiment, users viewed the middle area of the screen more than the top area. The second and third results are displayed on the middle of the screen when the user went to the search results page. On the other hand, the first result is displayed on the top of the screen. The first result is turned out by the scroll, hence, the user interest moved to the second/third results. This assumption was verified from interviews of the subjects. 5.3 Design Implications Using the results of the experiment, we propose a design encouraging users to browse the search results based on the rank. To increase the time of the eye movements of the users, the results displayed on the bottom of the page should be emphasized to get more attention from the users. In the Navigational Task, users concentrated their eye movements on the title of the web page or URL since the result page was a goal of the task. Hence, a thumbnail and/or attribute (e.g. the official page or blog) of the Web page are useful information for the users searching a specific web site.
An Analysis of Eye Movements during Browsing Multiple Search Results Pages
129
6 Conclusion In this paper, we experimentally analyzed the effect of the result position on the results pages. In the experiment, we measured users’ eye movements during web search tasks to analyze how long users spend on each result of the results pages. As a result, we found the results displayed on the bottom of the page were viewed for a shorter time than the results displayed on the top of the next page. Also, we found the tendency that the second/third results of each page were viewed longer than the first result of the results page. As a future work, we will analyze the effect of the display position on the screen. Acknowledgments. We would like to thank all the participants. In this paper, we used a part of the data of the NTCIR-4 WEB task that is sponsored by the National Institute of Informatics, as organizers of the NTCIR-4 WEB task project.
References 1. Cutrell, E., Guan, Z.: What Are You Looking For?: An Eye-tracking Study of Information Usage in Web Search. In: CHI 2007: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 407–416 (2007) 2. Guan, Z., Cutrell, E.: An Eye Tracking Study of the Effect of Target Rank on Web Search. In: CHI 2007: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 417–420 (2007) 3. Jansen, B.J., Spink, A., Saracevic, T.: Real life, real users, and real needs: a study and analysis of user queries on the web. Inf. Process. Manage. 36(2), 207–227 (2000) 4. Broder, A.: A taxonomy of web search. SIGIR Forum 36(2), 3–10 (2002) 5. Rose, D.E., Levinson, D.: Understanding User Goals in Web Search. In: WWW 2004: Proceedings of the 13th international conference on World Wide Web, pp. 13–19 (2004) 6. Murata, T., Saito, K.: Extraction and Visualization of Web Users’ Interests Using SiteKeyword Graphs. Journal of Japan Society for Fuzzy Theory and Intelligent Informatics 18(5), 701–715 (2006) (in Japanese) 7. Clarke, C.L.A., Pan, B., Agichtein, E., Dumais, S., White, R.W.: The Influence of Caption Features on Clickthrough Patterns in Web Search. In: SIGIR 2007: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 135–142 (2007) 8. Otsuka, S., Toyoda, M., Kitsuregawa, M.: A Study for Analysis of Web Access Logs with Web Communities. Transactions of Information Processing Society of Japan 44(18), 32– 44 (2003) (in Japanese) 9. Lorigo, L., Pan, B., Hembrooke, H., Joachims, T., Granka, L., Gay, G.: The influence of task and gender on search and evaluation behavior using Google. Inf. Process. Manage. 42(4), 1123–1131 (2006) 10. Sakai, M., Nakamichi, N., Shima, K., Nakamura, M., Matsumoto, K.: WebTracer: A New Web Usability Evaluation Environment Using Gazing Point Information. Transactions of Information Processing Society of Japan 44(11), 2575–2586 (2003) (in Japanese)
130
Y. Matsuda et al.
11. Eguchi, K., Oyama, K., Aizawa, A., Ishikawa, H.: Overview of the Informational Retrieval Task at NTCIR-4 WEB. In: Proceedings of the Fourth NTCIR Workshop on Research in Information Access Technologies Information Retrieval, Question Answering and Summarization (2004) 12. Oyama, K., Eguchi, K., Ishikawa, H., Aizawa, A.: Overview of the NTCIR-4 WEB Navigational Retrieval Task 1. In: Proceedings of the Fourth NTCIR Workshop on Research in Information Access Technologies Information Retrieval, Question Answering and Summarization (2004)
Development of Estimation System for Concentrate Situation Using Acceleration Sensor Masashi Okubo and Aya Fujimura Doshisha University, 1-3 Miyakodani, Tatara, Kyotanabe, Kyoto, 610-0321, Japan [email protected], [email protected]
Abstract. Recently, to discipline to increase powers of concentration is popular. One of the reason, it is difficult to concentrate something in these days because of a flood of information. However we discipline our concentration by using the how-to books and the portable games, we cannot evaluate the training effect on the practical life. In this paper, we propose an evaluation system for user’s powers of concentration in which the method for the estimate user’s sitting situation is utilized. This system is constructed by two kinds of method, one is the method which estimates the sitting situation and the other is the evaluation method for user’s powers of concentration situation. These methods use user’s motion that is obtained from the acceleration sensor that is fixed on the chair. And we prepare the three kinds of Graphical User Interface (GUI) which presents the concentration situation to the user. Keywords: Powers of concentration, GUI, Sensory evaluation and Selfmanagement.
However, even our thinking ability and concentration powers can be improved in such games and trainings, we cannot evaluate the effects in the practical life. Currently, the method using biosignals including breathing, heartbeat, and brain waves…etc. is proposed in order to measure person’s powers of concentration [1]. For example, it is found that a brain wave called Frontal Midline theta is generated when solving a problem and concentrating on abstract thinking. Also it has been researched that transition of learner’s skin potential level meets his interest in lectures. However, this sort of method put a heavy burden on a measured person as well as it is costly and time-consuming. Therefore in this study, assuming the situation as if users are working with a computer, studying and doing some paper works in sitting situation, we propose an evaluation system for user’s powers of concentration by processing information obtained by motion sensor fixed on a chair and presenting the concentration situation to the user. Along with development of the actual system, we also prepare the three kinds of Graphical User Interface (GUI) and perform the sensory evaluation experiments for the proposed system and the interface evaluation using these GUI.
2 Hardware Configuration of Estimation System in Sitting Situation Hardware configuration of proposed system is showed as Fig. 1. The motion sensor sends acceleration data to Personal Computer (PC) through Bluetooth communication after measuring a movement of a chair. User’s sitting or standing up situation and his power of concentration are estimated on PC according to the received data, and the estimation result is presented to the user. Monitor
User Bluetooth communication Acceleration sensor
PC
Chair
Fig. 1. System configuration
In the proposed system, the user’s sitting situation on a revolving chair that is generally used in households and offices as shown in Fig. 2 is estimated. Since the revolving chair reflects the various users’ movements, we can estimate the user’s movements by measuring the acceleration of the movements and swings using the motion sensor.
Development of Estimation System for Concentrate Situation
133
We use a remote controller of Nintendo Wii games as the motion sensor. This remote controller uses Bluetooth for connecting to the main unit so that we can easily connect to PC. Since Wii game is available at a relatively low cost, it has already been popular among many households. Fig.2 shows that we fix the remote controller on the upper side of the backrest of the chair because the inclination of the backrest and the revolution of seating face appear prominently there and yet it doesn’t disturb the user. The three dimensional motion sensor is installed near A button located around the center of the remote controller surface. Fig.2 also shows the coordinate axes with putting it lengthwise: X axis is horizontal direction, Y axis is vertical direction and Z axis is depth direction [2].
Y X Z
Fig. 2. Coordinate of 3-axis accelerometers
3 Software Configuration of Estimation System in Sitting Posture 3.1 Estimation of Time of Sitting and Leaving This proposed system estimates if the user is sitting down or standing up and user’s powers of concentration in sitting situation. At first we propose the method for estimating if the user is sitting down or standing up. There are the following two kinds of methods for estimating user’s situation. The first one is a method for estimation by momentary change of acceleration at the time of sitting down and at the time of standing up. The second one is a method for estimation by the change of acceleration along with time. In the proposed system, we use these two methods to ensure the reliability. Estimation by Momentary Change of Acceleration Speed. We sample the acceleration around 100Hz from Wii remote controller via Bluetooth. We instruct several subjects to sit down, work and stand up the chairs with the remote controller fixed on and perform the preparatory experiment to measure the acceleration of these movements. As a result, we find that the accelerations significantly change at the time of sitting down and standing up in the all axes directions. However, we also find that relatively greater acceleration is measured in X-axis and Z-axis at the time of sitting down. Therefore we estimate whether sitting down or standing up by Y-axis acceleration which shows less change. Moreover we focus on momentary change of
134
M. Okubo and A. Fujimura
acceleration at the time of sitting and standing up, respectively. Fig. 3 shows an example of transition of Y-axis acceleration at the time of sitting down and standing up. It shows that the acceleration shifts to a negative direction after shifting greatly to a positive direction. This occurs due to the rise of the seat front as a reaction when the subject is being seated and the seat front sinks down once. On contrary, the acceleration shifts to a positive direction when standing up after shifting greatly to a negative direction. This occurs due that the seat front sinks down as a reaction when the subject is standing up and the seat front rises, which means that the Y-axis acceleration shifts from a positive direction to a negative direction when sitting down and shifts from a negative direction to a positive direction when standing up. Moreover, these shifts of the acceleration such as positive to negative by sitting down and negative to positive by standing up occur within 50ms. Using these shift patterns of the acceleration, we estimate the subjects’ sitting down and standing up situations. Acceleration G(×9.8m/s2)
Time(ms) (a) When sitting down
Time(ms) (b) When standing up
Fig. 3. Transition of the Y-axis acceleration by (a) sitting down and (b)standing up
Now we perform the preparatory experiment for setting the threshold value against the acceleration in order to distinguish the movements among sitting down, standing up and working in sitting situation. Among the nine subjects, maximum value of 0.27G is measured by two subjects and minimum value of -0.23G is measured by another subject. Therefore we set the threshold value of the acceleration to estimate sitting down and standing up as 0.3G in the positive direction and -0.3G in the negative direction. We summarize the estimation method for sitting down and standing up using the momentary acceleration change. At first if more than 0.3G acceleration is measured, and if less than -0.3G is measured within 50ms during standing up, we estimate the subject is sitting down. Also if less than -0.3G acceleration is measured during sitting down and if more than 0.3G acceleration within 50ms is measured during sitting down, we estimate the subject is standing up. Estimation Using the Acceleration Changes with Time Course. The acceleration measured at the time of sitting down and standing up varies with each individual. Sometimes it is difficult to estimate if the acceleration occurred by sitting down and standing up or movements generated during sitting down. According to the maximum and minimum value of Y-axis acceleration when sitting down and standing up, the
Development of Estimation System for Concentrate Situation
135
acceleration of one subject shows less change when sitting down; at 0.11G and -0.19G. If the acceleration is more than 0.1G and less than 0.3G, and if it is less than -0.1G and more than -0.3G when sitting down and standing up, the estimation will be inaccurate simply by using the momentary acceleration. Therefore, we also use the estimation method using the power spectrum sequentially obtained by sampling the acceleration in a certain period of time. At first we seek the power spectrum by executing Fourier transform of 256 pcs sample data among about 2.5s acceleration in the three axes. The frequency component 0.01Hz being regarded as noise is eliminated. We can find that the power spectrum is barely measured in the all frequency domains when absence but is measured in many frequency domains during sitting. Then we add up the value of power spectrum except the spectrum at 0.01Hz frequency. We will judge between absence and sitting by sum of the power spectrum obtained as above. Fig. 4 shows the transition of sum of 3-axis acceleration power spectrum when absence, sitting, and working by using the acceleration data obtained by preparatory experiment Sitting Not available down
Working
Sum total of diff of acceleration
Rotation of the seat
Inclining the rear
Standing up
of
X-axis Y-axis Z-axis
Time(s)
Fig. 4. Transition of sum of 3-axis acceleration power spectrum
Comparing between absence and working in Fig.4, we find that the sum of Y axis accelerations power spectrum does not change very much comparing to that of other axes. On the other hand, the sum of X and Z axis acceleration power spectrum changes greatly depending on the user’s motion. For example, the sum of X axis acceleration power spectrum increases by the user’s motion like rotating the seat front and the sum of Z axis acceleration power spectrum increases by the user’s motion like inclining the backrest of the seat. We summarize the estimation method for sitting down and standing up using the power spectrum. In case more than -0.3 G and less than -0.1G acceleration is measured when absence and the sum of power spectrum exceeds 50 afterwards, we estimate for “sitting down”, otherwise we estimate for “absence”. On contrary, in case more than -0.3G and within -0.1G acceleration, or more than 0.1G and under 0.3G acceleration are measured and the sum of power spectrum is between 5s and 10s, we estimate for “standing up”, otherwise we estimate for “working”.
136
M. Okubo and A. Fujimura
3.2 Relationship of Sitting Situation and Power Spectrum Experiment Objective. Generally, it is considered that a person’s motion strongly tends to lessen when concentrating. That is, when concentrating, it is highly possible that the sum of X-axis and Z-axis acceleration power spectrum obtained by the motion sensor fixed on the chair (hereinafter called the sum of the acceleration power spectrum) become smaller. We perform the preparatory experiments in order to examine this assumption. In the experiments the subjects are instructed to type on the computer for quantitative evaluation for concentration. Experiment ・ Evaluation Method. In this experiment we examine the relationship between the sum of the acceleration power spectrum and the typing speed per unit of time while three subjects engage in typing on the computer for 30 minutes. The subjects are instructed to sit down the chair for 30 minutes and engage in either typing or taking a rest. Except the first 1 minute and last 1minute in each of 30-minutes data, we divide the remaining 28-minutes data into 1 minute to find the cross-correlation functions. Evaluation result. The transition of the sum of power spectrum and typing speed, and the examples of the cross-correlation function is as shown in Fig.5. According to Fig. 5 (a), we find that the sum of power spectrum increases when the typing speed becomes 0, when taking a rest. Also Fig.5 (b) shows the fact that approximately -0.7 negative correlation is obtained due to the result as mentioned above. Fig.5(c) shows the decrease of the sum of power spectrum when increasing the typing speed. Fig.5 (d) shows that approximately -0.7 negative correlation is obtained likewise. The average and standard deviation of the minimum of cross-correlation function of the three subjects are as shown in Fig.6. The average of over -0.4 negative correlation function can be seen in Fig.6. Thus we understand that the strong negative crosscorrelation exist in the sum of the power spectrum of user’s motion and the typing speed per unit of time. Also, the validity of estimation that the subjects are concentrating when the sum of the power spectrum become smaller is clarified. Power spectrum Time(s) (b)
60
Time(s) (a)
30
60
Time(s) (c)
Power spectrum
0
Typing speed Power spectrum of acc.
Time(s) (d)
Fig. 5. Cross-correlation functions (b),(d) between power spectrum of user’s motion and typing speed(a),(c)
Development of Estimation System for Concentrate Situation
Subject A
Subject B
137
Subject C
Fig. 6. Average and standard deviation of the minimum of cross-correlation function
3.3 Interface Set-Up Recently, cars installed instantaneous fuel consumption meters are popular in the market. Drivers are likely to loosen gas pedals because of this meter by knowing his situation at the moment and by reviewing his own behavior. That is also the aim of this proposed system. The situations of users who are sitting are described with avatar so that the users can monitor their own situation objectively. The four types of avatars are prepared: “NOT AVAILABLE”,“AVAILABE”, “WORKING”, “PLAYING” and two patterns such as “male”(upper) and “female”(lower) are created. If the system is used as a self-management tool, it becomes important not only to present the user’s situation whether concentrating or not concentrating, but also to present the powers of the concentration. Therefore, as well as displaying the avatar, we create a line graph that present the powers of concentration in the long term and three patterns of GUI using a level meter to describe momentary powers of concentration. Fig.7 (a) shows the GUI displaying the avatar only. In the Fig. 7 (b), GUI that presents transition of the powers of the concentration is added to the avatar. As we assume that the less the sum of the power spectrum decreases, the more the degree of the concentration increases, we set the upper limit of the sum of the power spectrum at 500. Then we reduce the sum of the power spectrum from 500 so that the line goes up when the Graph of power of concentration while long term
Line graph of power of concentrations while short term.
Level meter (a) GUI (avatar only)
(b) GUI (with line graph)
Fig. 7. Graphical User Interface
(c) GUI (with level meter)
138
M. Okubo and A. Fujimura
degree of the concentration becomes high. The graph in the lower right of GUI presents the powers of the concentration in the previous 30s. Also the graph in the upper side of GUI presents the powers of the concentration for a last few minutes for users to check the record. In Fig. 7 (c), the level meter which displays the momentary change of the powers of concentration nonlinearly is added. Even if the sum of the power spectrum is small when concentrating, up down of the scale of the meter is visually confirmed by decreasing the sum per one scale.
4 System Sensory Evaluation Experiment 4.1 Experiment Objective We examine the usefulness and usability of the proposed system and understandably of the powers of concentration presented in three types of GUI. To be more specific, we confirm the operation capability of the estimation system for powers of concentration and perform the sensory evaluation experiments for aforementioned three types of GUI. 4.2 Experiment Conditions Methods In the experiment, ten subjects ( five males and five females in their twenties) are instructed to sit down the chairs with Wii remote controller fixed on, and to engage in typing on computers which the proposed system are running (Fig.8). They are asked to type letters printed on papers in word processing program for five minutes. The GUI is displayed in the upper left of the screen so that it will not disturb their works and the subjects can check it during the works. GUI
Fig. 8. Experimental scene
The experiments are conducted under the four kinds of conditions as follows: (1) without the proposed system, (2) with GUI using only avatar, (3) with GUI using avatar and line graph, and (4) with GUI having avatar and level meter. Taking into account of the order effect, the order of the experiment conditions changes every subject. After the experiment, we give the subjects evaluation questionnaires by five grades. 4.3 Result of Experiment The average and standard deviation obtained by scoring the answers of questionnaire on understandability of the powers of concentration and the interface is shown in
Development of Estimation System for Concentrate Situation
139
Fig.9. These results show that comparing to when having no system and having GUI using avatar only; the subjects finds easier to understand their degree of the concentration when having GUI with the line graph and level meter. Also we conduct a sign test for all combinations in all questionnaires. About the understandability of the powers of the concentration, significant differences at 5% significant level are seen among the combination between having no system and having the line graph, and between having no system and having the level meter. This result also shows that GUI using the line graph and level meter is easier to understand the powers of concentration than having no system. 4.4 Discussion The GUI with the line graphs and the one with level meters were preferred alike as mentioned previously. The subjects who prefer the line graphs evaluated the ability to check the archival record and the others who prefer the level meter evaluated the ability to understand the powers of concentration. However, some said that they were preoccupied by the system and couldn’t concentrate. So the system which users can choose their favorite GUI and save the archival record to look back the day is desirable. Simplicity of understanding for power of concentration
P<0.05
Good
Neutral
No good No system Only avatar Line graph Level meter
Fig. 9. Result of questionnaire
5 Conclusion This study aims that users conduct self-management by presenting their situations while they are using the systems. We develop the systems which estimate user’s powers of the concentration and sitting down and standing up situation, and also present the powers of concentrations using interface to the users. We conduct further experiments on the sensory evaluation to evaluate the operation effectiveness of the systems and interface. As a result, interface using the line graphs which presents the transition of the long-term powers of concentration and interface using the level meter which presents transition of the momentary powers of concentration are preferred equally by the users.
140
M. Okubo and A. Fujimura
The proposed system is developed in order to review users’ own behaviors by understanding their own situations. Although we consider that this system enables to facilitate the concentration, long hours of concentration turns out to be stress and have a bad influence on mental and health. Therefore, we think it is possible to build up the system to facilitate taking appropriate rests during long hours of concentration. We will validate its effectiveness in the long-term experiments along with improvement of the existing system and interface.
References 1. Tamura, H.: Human Interface. Ohmsha, 44–68 (1998) (in Japanese) 2. http://www.wiili.org/index.php/Wiimote
Psychophysiology as a Tool for HCI Research: Promises and Pitfalls Byungho Park Graduate School of Information and Media Management Korea Advanced Institute of Technology (KAIST) 207-43 Cheongryangri 2-dong, Dongdaemun-gu, Seoul, 130-722, Korea [email protected]
Abstract. Psychophysiology, an area of psychology that measures individual's physiological responses to refer to one's psychological state, can provide a set of useful measures HCI researchers can take advantage of. However, there are limitations to the method itself and room for misinterpretation. This paper introduces psychophysiology, and also shows how research methods psychophysiology offer can be used for HCI research, advantages and disadvantages of using research tools from psychophysiology.
2 Psychophysiology and HCI 2.1 Psychophysiology as an Academic Discipline Psychophysiology as an academic discipline that studies the interrelationships between the physiological and psychological aspects of human behavior [15]. A typical study in this area observes human’s physiological responses to understand human’s psychological status and/or changes. Due to its interdisciplinary nature, research from various disciplines including, but not limited to, psychology, cognitive science, medicine, anatomy and neuroscience are incorporated. The type of physiological responses used for psychophysiology studies include blood flow pattern in the brain (functional magnetic resonance imaging; fMRI), heart rate variability, sweat production (skin conductance; SC), respiration, Electroencephalography (EEG; commonly known as ‘brain wave’), muscle contraction (Electromyography; EMG), eye movement, and much more. 2.2 Psychophysiology Providing Tools for HCI Research A wide variety of techniques ranging from qualitative to quantitative are used in HCI studies. The qualitative ones such as in-depth interviews and focus group interviews can provide great depth of insight. However, they are open to misinterpretation and requires a great deal of experience to avoid this. The quantitative ones can be further divided to self-reports and observational ones. Former includes questionnaire and survey, and latter includes response time, counting number of errors (or attempts), and so on. The advantage of using quantitative methods is that the data produced can be used for inferential statistical analysis, providing insight beyond descriptive statistics. However, many quantitative techniques, especially the self-report types, are also open to serious errors. The data may be less-than-accurate because the subject cannot remember certain things clearly, or even motivated to lie (e.g., “have you voted during the last election?”). The research tools psychophysiology offers are quantitative and observational. What makes these tools interesting is that they provide a way to tap into human mind. A large body of literature in psychophysiology provides a fairly solid ground for interpreting the data. In other words, it can provide an alternative way to find how much attention is being paid to the task or how difficult the interface is to use – both which subjects might provide inaccurate answers to when asked using a self-report questionnaire, depending on circumstances.
3 Useful Psychophysiological Research Techniques As mentioned above, a large variety of physiological responses are subject for study in psychophysiology. This section will introduce a few of them that HCI researchers are more familiar with, or may find more useful than others.
Psychophysiology as a Tool for HCI Research: Promises and Pitfalls
143
3.1 Sweat Production (SC) and Stress It is known that people sweat under stress. This is also true when computer users are stressed out from the difficulty of the task given. In order to find how easy or difficult an interface is to use, HCI researchers may measure the amount of sweat production, or skin conductance (SC) of the subject [11, 16]. Deep down, skin conductance is directly related to the activation of sympathetic system of the central nervous system. This is good because it means skin conductance is independent from the activation of the parasympathetic system, which can cause misinterpretation of the activity of many human organs or systems which are under influence of both central nervous systems (for example, human heart may beat faster because the sympathetic system is activated, or because the parasympathetic system is deactivated, or both). However, there are downsides for using skin conductance. One of them is the sensor (electrode) placement. Sweat glands in humans are concentrated in palms and soles. Many computer-related tasks and their interfaces require users’ hands. Collecting skin conductance data from the palm may restrict subjects from using both hands freely. Furthermore, in certain tasks that require both hands (such as typing text with a keyboard), using palm is out of question. Alternatively, sole of the foot may be used, but this is inconvenient since subjects are required to remove their socks and have their foot kept lifted through the whole session to prevent electrodes touching the floor. Using skin conductance as an index of task (or interface) difficulty is it’s slow response speed. Rather than constantly producing sweat, human sweat glands ‘spout out’ sweat. It takes about six seconds for sweat glands to respond to an arousing event, though exact timing varies by individual. This makes it hard for researchers to pinpoint the exact timing of a particular event that causes stress in the human subject. For this reason, comparing the total amount sweat or average skin conductance level over time within a subject during different tasks (which takes longer than 10 seconds to complete) is recommended rather than attempting to identify the exact moment that causes stress [15]. 3.2 Heart Rate (HR) Variability and Attention Human heart is affected by both sympathetic and parasympathetic systems of the central nervous system. This means that its activity is affected by both arousal and paying attention to external stimuli. Research shows that the speed of heart beat slows down when one is paying attention to a stimulus presented [17]. This can be explained from an evolutionary psychology perspective: when an unknown change is noticed in the environment, the body automatically responds to it by calming down (or slowing down) the activities of the internal organs until proper assessment of the situation is made. If the change turns out to be life-threatening, then the famous ‘fight or flight’ reaction, which includes intensive activity in sweat glands and quick acceleration of heart rate follows. If the change does not pose any harm to human, then the heart rate gets faster, but no faster than the normal speed.
144
B. Park
It is important to acknowledge that deceleration of heart rate is an index of external attention. Measurement of external attention is useful for HCI research since it serves as an index of whether the subject is paying attention to the menu of the computer screen, or whether that gentle chime has actually got the subject’s attention. The other type of attention is called internal attention, which happens when one puts mental effort to solve math questions [9]. When this happens, heart rate acceleration is observed. Hence, it is important to see the context of the settings heart rate data has been collected and determine whether the heart rate getting fast is a heart rate acceleration which is a byproduct of internal attention, or simply a heart rate deceleration from not paying attention. Also to put into consideration is the subject’s arousal level – excitation leads to arousal in the sympathetic system, which triggers heart to beat faster. When analyzing heart rate data, context matters a lot. There are two ways to measure heart rate. One is to measure the electrical pulse produced by the human heart every time it contracts and pumps blood out to the whole body. This measurement is called electrocardiogram, or ECG. The other way is to measure the blood flowing in and out to the tip of the finger and/or toe, which is called photoplethysmography (PPG). Photoplethysmography is typically measured by emitting infrared light into the skin. The level of light absorption changes by the amount of blood flowing underneath the skin, and this is used to measure heart rate. ECG monitors the electrical activity of heart, while PPG monitors the mechanical activity of heart. In a perfect world, both has to be able to serve as equally perfect indices of heart activity. But neither is perfect. ECG requires three electrodes, attached on both arms, both legs, or above the chest. Chest placement is rarely used outside of a hospital setting, though it provides the clearest ECG. Attaching electrodes on arms or legs is more practical for HCI research, but the distance from the heart tends to make it more vulnerable to noise (caused by subject’s body movement and internal organ activities) and weak pulse. From my personal experience, arms tend to provide cleaner ECG than legs. PPG needs to have one sensor attached to a finger or a toe. Generally, fingers are better than toes for collecting PPG data because they are closer to the heart. Nevertheless, if the subject’s heart is relatively weak and the hand or foot is in a position that makes it hard for blood to flow in, there may not be enough blood flowing in and out for the infrared light sensor to notice and record the PPG. For these reasons, both are somewhat vulnerable to errors, especially by missing individual heart beats. However, many of these issues can be prevented by preparing the best possible settings for the subject. As long as the study condition is fine, ECG is as reliable as PPG [13]. Sensor placement again can be a problem: ECG requires three locations on the arm or the leg. PPG may require only one but it has to be either a finger or a toe. And it is best to avoid the sensors touching hard surface. As a result, type of data to be collected for heart rate and location for sensor placement depends on the type of task for the subjects. 3.3 Eye Tracking and Attention As the maxim “eye is the window to mind” suggests, it has been long thought that human gaze reflects the top priority of cognitive processes [6]. This is one of the
Psychophysiology as a Tool for HCI Research: Promises and Pitfalls
145
reasons HCI researchers find eye tracking to have great potential to be an important research tool. In the past, electrooculography (EOG) that measures resting potential of the retina was used to track the eye movements. However, EOG can only show to which general direction the eye has moved to; not exactly where the eye gaze if fixed to [6, 15]. With advances in real-time optical data processing, monitoring of the retina by using infra-red light became more sophisticated. Data produced by eye tracking is useful not only since it provides quantitative data that can be subjected for statistical analysis, but also because it can be visualized in ways that are intuitive for audience when presented. One is heat map, which colorcodes each area of the computer screen where usually red is colored to the area that the eye fixation happened longest and dark blue (or no color) represents little or no eye gaze in the area. The other is an animation of which part of the screen the gaze has been moved to (which, some companies call “gaze replay”) in accordance to time. When used properly, both visualization techniques are powerful enough to convince audience with little or no knowledge in statistics [3, 4, 6]. In the past, eye tracking equipments required a head-mounted camera that monitors the retina movement and alto a head-mount that fixes the subject’s head since the equipment was not capable of making adjustments according to head movements. Today, there are equipments in the market that does not require head-mounted cameras. Some of them are capable to make adjustments to minor (and gentle) head movements. However, eye tracking technology available today still has its own limitations. The largest challenge is that it is hard to use on subjects with seriously bad vision. This is not because of the vision itself, but because of the lenses (both glasses or contact lenses). In theory, the equipment should be able to be adjusted for such lenses, but in practice, virtually no machine in the market provides such calibration. However, there are indeed quite a few equipments that automatically adjust to subjects who wear glasses for relatively light myopia. Incapability to track vision to sudden and/or fast head movement is another limitation for contemporary eye tracking technology. 3.4 Facial Muscle Activity (EMG) and Emotion Muscle fibers of human (as well as other animals) contract when they are triggered by electrical current. Electromyography (EMG) is a measure of the electrical activity of muscles, which directly index muscle activities. Facial EMG refers to indexing of facial muscle activities. Psychophysiology researchers have been using facial EMG to monitor emotional changes in human [5, 12]. There are three muscle groups monitored for this purpose: Currugator supercilii, Zygomatic major, and Orbicularis oculi [15]. The Corrugator supercilii muscle group is the “frowning” muscle. It is located between the nose and the forehead and when activated, it draws the eyebrow slightly downward and creates vertical wrinkles between the eyebrows. Corrugator activity is typically used as an index of negative emotion. The Zygomatic major is a muscle group that is located at both ends of the lips. When activated, it pulls the lips from both sides and mostly used when one is smiling. For this reason, it is used as an index of positive emotion. However, it is also
146
B. Park
activated when one is experiencing other types of non-positive emotion (e.g., One screaming “Eeeek!” out of disgust), so the data has to be analyzed with caution. The Orbicularis oculi is another muscle group that is getting attention as an alternative for the Zygomatic major muscle group as an index for positive emotion. The Orbicularis oculi muscle group is located right below the lower eye lids, and it is primarily responsible for eye blinks. It is also responsible for gathering of skin around the eyes when one is smiling, which is called the Duchenne smile. A Duchenne smile involves contraction of both the zygomatic major muscle group and the orbicularis oculi muscle group, and it is seen as an indicator of one experiencing true happiness (a Non-Duchenne smile only involves the zygomatic major muscle group and interpreted as an indicator of social smiling which does not involve happiness; for review, see Ekman, Davidson, and Friesen [2]). One of the challenges using facial EMG is that the muscle groups are small and the electrical activity is very weak. Especially, orbicularis oculi muscle group is so small that researchers are forced to place two mini electrodes (diameter of 4-millimeters) very close (about 5-millimeters), which makes it vulnerable to errors during data collection.
4 Conclusion: No One Measure Is Perfect This paper has reviewed skin conductance, heart rate, eye tracking, and facial EMG as tools for HCI research. There are, of course, other tools such as Electroencephalography (EEG; also known as ‘brain wave’), functional Magnetic Resonance Imaging (fMRI; also known as ‘brain imaging’), and many more. Each has its own unique advantages and disadvantages as a tool for research. Though psychophysiology promises to provide useful research tools for HCI, there are limitations that generally apply to them [8]. The largest one is external validity. As mentioned above, skin conductance, heart rate, facial EMG all requires some kind of sensor to be attached on parts of the subject’s body. And this adds to the awkward feeling to already unnatural setting (e.g., being put into a usability lab with a one-way window and, possibly, video camera rolling) the subject may experience during a session. Eye tracking is no less artificial than other measurements. The least intrusive technology still requires participants keep their head movement minimal and if needed, move gently. The subject has to be reminded of this, and this inevitably causes unnatural tension in the subject which, depending on the type of study, may have fairly large negative impact on the data collected. Another limitation is that the data collected is open to interpretation. Though many people tend to believe psychophysiology provides an absolutely objective data and interpretation. For example, Hornbaek (2005) argues that physiological measures are objective, using "physiological measures of fun in playing computer games (p.92)" as an example. Unfortunately, this view can be wrong – especially in the context of HCI research. Studies in psychology tend to use simple stimuli, which helps avoid unexpected factors to interfere. But in many cases, HCI research has to use real-life equipments and/or interfaces as stimuli that are much more complex than those stimuli used in a
Psychophysiology as a Tool for HCI Research: Promises and Pitfalls
147
typical psychology experiment. So the data collected always have to be interpreted with the context in mind. Could the heart rate have picked up speed because the subject was not paying attention, or because the subject was puzzled by the menu system and had to think about it? Was the subject sweating more because the interface was hard to use or because there was an image of a huge spider shown on the screen? When designing the session, it would be the best to eliminate all factors that may result unwanted interference during data collection, but even after data has been collected, it would be the best to go back and think if there is any room for alternative explanation for certain physiological responses. And researchers should keep in mind that though research tools provided by psychophysiology may offer new insights, none of them is perfect by itself and it is best to be used in combination with other research methods. Through his online column, Jacob Nielsen, a well-known HCI guru and consultant, advices design agencies who are seeking ways to convince clients to pay for usability testing, "using sound methodology is the true sign of professionalism" and they have to "point out usability's astounding return on investment."[10] Including psychophysiology to the set of tools available may also be a good idea for design agencies for improving usability of the final product. Psychophysiology should be able to make good friends with both HCI researchers and HCI practitioners. It’s just the matter of choosing the right tool and applying it the right way to get the right answer to the right question.
References [1] Allanson, J., Wilson, G.M.: Physiological Computing. In: Proceedings of the Conference on Human Factors in Computing Systems (CHI 2002), pp. 912–913 (2002) [2] Ekman, P., Davidson, R.J., Friesen, W.V.: The Duchenne Smile: Emotional Expression and Brain Physiology II. Journal of Personality and Social Psychology 58(2), 342–353 (1990) [3] Hewig, J., Trippe, R.H., Hecht, H., Straube, T., Miltner, W.H.R.: Gender Differences for Specific Body Regions When Looking at Men and Women. Journal of Nonverbal Behavior 32, 67–78 (2008) [4] Jacob, R.J.K., Karn, K.S.: Eye Tracking in Human–Computer Interaction and Usability Research: Ready to Deliver the Promises. In: Hyona, Radach, Deubel (eds.) The Mind’s Eye: Cognitive and Applied Aspects of Eye Movement Research, Oxford, England (2003) [5] Jones, C.M., Dlay, S.S.: The Face as an Interface: The New Paradigm for HCI. In: Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics (SMC), vol. 1, pp. 774–779 (1999) [6] Just, M.A., Carpenter, P.A.: Eye fixations and cognitive processes. Cognitive Psychology 8, 441–480 (1976) [7] Kim, S., Godbole, A., Huang, R., Panchadhar, R., Smari, W.: Toward an Integrated Human-centered Knowledge-based Collaborative Decision Making System. In: Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, LasVegas, NV, November 2004, pp. 394–401 (2004) [8] Lin, T., Hu, W., Omata, M., Imamiya, A.: Do physiological data relate to traditional usability indexes? In: Proceedings of the 17th Australia conference on Computer-Human Interaction: Citizens Online: Considerations for Today and the Future (OZCHI), vol. 122, pp. 1–10 (2005)
148
B. Park
[9] Mansell, W., Clark, D.M., Ehlers, A.: Internal versus external attention in social anxiety: An investigation using a novel paradigm. Behaviour research and therapy 41(5), 555–572 (2003) [10] Nielsen, J.: Convincing Clients to Pay for Usability (May 19, 2003), http://www.useit.com/alertbox/20030519.html [11] Pecchinenda, A., Smith, C.A.: The affective significance of skin conductance activity during a difficult problem-solving task. Cognition and Emotion 10(5), 481–504 (1996) [12] Prendingera, H., Mori, J., Ishizuka, M.: Using human physiology to evaluate subtle expressivity of a virtual quizmaster in a mathematical game. International Journal of Human-Computer Studies 62, 231–245 (2005) [13] Selvaraj, N., Jaryal, A., Santhosh, J., Deepak, K.K., Anand, S.: Assessment of heart rate variability derived from finger-tip photoplethysmography as compared to electrocardiography. Journal of Medical Engineering & Technology 32(6), 479–484 (2008) [14] Spiegel, D.: Human Computer Interaction (October 22, 1998), http://xenia.media.mit.edu/~spiegel/papers/HCI.pdf [15] Stern, R., Ray, W.J., Quigley, K.S.: Psychophysiological Recording, 2nd edn. Oxford University Press, New York (2001) [16] Svebak, S.: The effect of task difficulty and threat of aversive electric shock upon tonic physiological changes. Biological Psychology 14(1-2), 113–128 (1982) [17] Ward, R.D., Marsden, P.H.: Physiological responses to different WEB page designs. International Journal of Human-Computer Studies 59, 199–212 (2003)
Assessing NeuroSky’s Usability to Detect Attention Levels in an Assessment Exercise Genaro Rebolledo-Mendez1,3, Ian Dunwell1, Erika A. Martínez-Mirón2, María Dolores Vargas-Cerdán3, Sara de Freitas1, Fotis Liarokapis4, and Alma R. García-Gaona3 1
Serious Games Institute, Coventry University, UK 2 CCADET, UNAM, Mexico 3 Facultad de Estadistica e Informatica, Universidad Veracruzana, Mexico 4 Interactive Worlds Applied Research Group, Coventry University, UK {GRebolledoMendez,IDunwell,Sfreitas, F.Liarokapis}@cad.coventry.ac.uk, [email protected], {dvargas,agarcia}@uv.mx
Abstract. This paper presents the results of a usability evaluation of the NeuroSky’s MindSet (MS). Until recently most Brain Computer Interfaces (BCI) have been designed for clinical and research purposes partly due to their size and complexity. However, a new generation of consumer-oriented BCI has appeared for the video game industry. The MS, a headset with a single electrode, is based on electro-encephalogram readings (EEG) capturing faint electrical signals generated by neural activity. The electrical signal across the electrode is measured to determine levels of attention (based on Alpha waveforms) and then translated into binary data. This paper presents the results of an evaluation to assess the usability of the MS by defining a model of attention to fuse attention signals with user-generated data in a Second Life assessment exercise. The results of this evaluation suggest that the MS provides accurate readings regarding attention, since there is a positive correlation between measured and selfreported attention levels. The results also suggest there are some usability and technical problems with its operation. Future research is presented consisting of the definition a standardized reading methodology and an algorithm to level out the natural fluctuation of users’ attention levels if they are to be used as inputs.
person interacting with it. The MS1 also provides a measurement of the user’s meditative state (derived from alpha wave activity). In this paper, however, only the levels of attention are used, given their role and importance in educational settings. The objective of this study is threefold: firstly, the MS general usability is examined. Secondly, an analysis of how well it is possible to fuse information generated as part of normal interactions with brain activity. Thirdly, an analysis of the MS adaptability to different ableusers is provided. The significance of this work lies in that it presents evidence of the usability of a commercially available BCI and its suitability to be incorporated into serious games. The paper is organized in five sections. Section two presents a literature review about Brain Computer Interfaces and their use for learning. Section three describes the Assessment exercise used as test bed and presents the materials, participants and methodology followed during the evaluation. Section four presents the results of the evaluation and, finally, section five provides the conclusions and future research.
2 Brain Computer Interfaces (BCI) Brain Computer Interface (BCI) technology represents a rapidly emerging field of research with applications ranging from prosthetics and control systems [6] through to medical diagnostics. This study only considers BCI technologies that use sensors that measure and interpret brain activity (commonly termed neural bio-recorders [14]) as a source of input. The longest established method of neural bio-recording, developed in 1927 by Berger [3], is the application of electrodes that measure the changes in field potential over time arising from synaptic currents. This forms the basis for EEG. In the last two decades, advances in medical imaging technology have presented a variety of alternative means for bio-recording, such as functional magnetic resonance imaging (fMRI), magnetoencephalography (MEG), and positron emission tomography (PET). A fundamental difference between bio-recording technologies used for diagnostic imaging, and those used for BCI applications, is a typical requirement for real or quasi-real time performance in order to translate user input into interactive responses. In 2003, a taxonomy by Mason and Birch [8] identified MEG, PET and fMRI as unsuitable for BCI applications, due to the equipment required to perform and analyze the scan in real-time, but more recent attempts to use fMRI as a BCI input device have demonstrated significant future potential in this area [12]. Bio-recording BCIs have become a topic of research interest both as a means for obtaining user input, and studying responses to stimuli. Several studies have already demonstrated the ability of an EEG-based BCI to control a simple pointing device similar to a mouse [9, 12] and advancing these systems to allow users more accurate and responsive control systems is a significant area for research. Of particular interest to this study is the use of BCI technologies in learning-related applications. The recent use of fMRI to decode mental [4] and cognitive [11] states illustrates a definite capability to measure affect through bio-recording, but the intrusiveness of the scanning equipment makes it difficult to utilize the information gained to provide feedback to a user performing typical real-world learning activities. In this study, the effectiveness of one of the first commercially available lightweight EEG devices, NeuroSky’s MS, is considered. Via the application of a single 1
The MB is a developer-only headset. NeuroSky’s newest headset has been designed to address comfort and fitting problems and is available to both developers and consumers.
Assessing NeuroSky’s Usability to Detect Attention Levels
151
electrode and signal-processing unit in a headband arrangement, the MS provides two 100-state outputs operating at 1Hz. These outputs are described by the developers as providing separate measures of ‘attention’ and ‘meditation’, and it is thus assumed these readings are inferred from processing beta and alpha wave activity respectively. Although the MS provides a much coarser picture of brain activity than multielectrode EEG or the other aforementioned technologies, the principle advantage of the MS is its unobtrusive nature, which minimises the aforementioned difficulties in conducting accurate user studies due to the stress or distraction induced by the scanning process. Research into EEG biofeedback as a tool to aid individuals with learning difficulties [5] represents an area for ongoing study, and the future widespread availability of devices similar to the MS to home users presents an interesting opportunity to utilize these technologies in broader applications.
3 An Assessment Exercise in Second Life An assessment exercise was developed to examine the MS. The exercise works in combination with a model of attention [10] built around dynamic variables generated by the learner’s brain (MS inputs) and the learner’s actions in a computer-based learning situation. The combination of physiological (attention) and data variables is not new [7, 1]. Our approach, however, fuses MS readings (providing a more accurate reading of the learner’s attention based on neural activity) with user-generated data. In our model, attention readings are combined with information such as the number of questions answered correctly (or incorrectly), or the time taken to answer each question, to model attention within the assessment exercise. The MS reads attention levels in an arbitrary scale ranging from 0 to 100. There is an initial delay of between 7 and 10 seconds before the first value reaches the computer and newer values of attention are calculated at a rate of 1Hz (one value per second, see Figure 1). A value of -3 indicates no signal is being read and values equal to or greater than 0 indicate increasing levels of attention with a maximum value of 100. Given the dynamic nature of the attention patterns and the potentially large data sets obtained, the model of attention underpinning the assessment exercise is associated to a particular learning episode lasting more than one second. The model of attention not only determines (detects) attention patterns but also provides (reacts) feedback to the learner [10]. The assessment exercise consists of presenting a Second Life2, AI-driven avatar able to pose questions, use a pre-defined set of reactions and have limited conversations with learners in Second Life. The AI-driven avatar was programmed using C# (C-sharp) in combination with the lib second life library3. Lib Second Life is a project aimed at understanding and extending Second Life’s client to allow the programming of features using the C# programming languages. This tool enables the manipulation of avatars’ behaviors so that they respond to other avatars’. To do so, the AI-driven avatar collects user-generated data during the interaction including MS inputs. The current implementation of the AI-driven avatar asks questions in a multiple-choice 2 3
format, while dynamically collecting information (answers to questions, time taken to respond, and whether users fail to answer). The data generated by the MS is transmitted to the computer via a USB interface and organized via a C# class which communicates with the AI-driven avatar. In this way, the model of attention is updated dynamically and considers input from the MS as well as the learner’s performance behavior while underpinning the AI-driven avatar’s behavior. For the purposes of assessing MS’s usability, the assessment exercise consisted of ten questions in the area of Informatics, specifically for the area of Algorithms. This area was targeted since it has been noted first year students in the Informatics department often struggle with the conception and definition of algorithms, a fundamental part of programming. The assessment exercise asked Fig. 1. Attention readings as read by the NeuroSky nine theoretical questions and presented three possible answers. For example, the avatar would ask ‘How do you call a finite and ordered number of steps to solve a computational problem?’ while offering ‘a) Program, b) Algorithm, c) Programming language’ as possible answers. The assessment exercise also includes the resolution of one practical problem, answered by the learner by hand while still wearing the MS. 3.1 Materials To evaluate the MS’s reliability, two adaptations of the Attention Deficit and Hyperactivity disorder (ADHD) test and a usability questionnaire were defined. The attention tests consisted of seven items based on the DSV4-IV criteria [2]. The items chosen for the attention test were: 1. Difficulty to stay in one position, 2. Difficulty in sustaining attention, 3. Difficulty to keep quiet often interrupting others, 4. Difficulty to follow through on instructions, 5. Difficulty to organize tasks and activities, 6. Difficulty or avoidance of tasks that require sustained mental effort and 7. Difficulty to listen to what is being said by others. Each item was adapted to assess attention both in the class and at interaction time. To answer individual questions, participants were asked to choose the degree which they believed reflected their behavior in a Likert type scale with 5 options. For example, question 1 of the attention questionnaire asked the participant: ‘How often is it difficult for me to remain seated in one position whilst working with algorithms in class/during the interaction?’ with the answers 1) all the time, 2) most of the time, 3) some times, 4) occasionally and 5) never. Note that for both questionnaires the same seven questions were asked but were rephrased considering the class for the pre-test 4
Diagnostic and Statistical Manual of Mental Disorders.
Assessing NeuroSky’s Usability to Detect Attention Levels
153
or the interaction for the post-test. The usability questionnaire consisted of adapting three principles of usability into three questions (a) comfort of the device; (b) easiness to wear; and (c) degree of frustration. To answer the usability questionnaire participants were asked to select the degree to which they felt the MS faired during the interaction via a Likert type scale with 5 options. For example, question 1 of the usability questionnaire asked the student: ‘Was using Neurosky’ 1) Very uncomfortable, 2) Uncomfortable, 3) Neutral, 4) Comfortable, 5) Very comfortable. Note that to report the usability of the MS, other factors were also considered such as battery life, light indicators and data read/write times and intervals. 3.2 Participants and Methodology An evaluation (N=40) to assess the usability of the MS was conducted among firstyear undergraduate students in the Informatics Department at the University of Veracruz, Mexico. The population consisted of 28 males and 12 females, 38 undertaking the first year of their studies and 2 undertaking the third year. 26 students (65%) of the population were 18 years old, 12 students (30%) were 19 years old and 2 students (5%) were 20 years old. The participants interacted with the AI-avatar for an average of 9.48 minutes answering ten questions posed by the avatar within the assessment exercise (see previous section). During the experiment, the following procedure was followed: 1) students were asked to read the consent form, specifying the objectives of the study and prompted to either agree or disagree, 2) students were asked to solve an online pre-test consisting of the adaptation of the attention deficit and hyperactivity disorder (ADHD) questionnaire to assess their attention levels in class, 3) students were instructed on how to use the learning environment, and finally 4) the students were asked to answer an online post-test consisting of the usability questionnaire and the adaptation of the ADHD questionnaire to assess their attention levels during the interaction in the assessment exercise. Individual logs registering the students’ answers and attention levels as read by the MS were kept for analyses. All students agreed to participate in the experiment but in some cases (N=6) the data was discarded since the MS did not produce readings for these participants. See the results section for a description of these problems. Cases with missing data were not considered in the analysis.
4 Results The results of this evaluation are organized to consider the MS’s usability, how well the model fuses user-generated data and attention readings and the MS’s adaptability. 4.1 Usability and Appropriateness of MS for Assessment Exercises The main aspect of interest was MS’s usability considering the responses to three questions (see materials section). This questionnaire considered three aspects to assess the usability of new computer-based devices: Comfort, Ease of Use, and the Degree of Frustration. The answers to the questionnaire are organized around each aspect considered. There was one question associated to every usability aspect.
154
G. Rebolledo-Mendez et al.
Comfort The results showed that for 5% (N=2) the MS was uncomfortable, for 10% (N=4) somewhat uncomfortable, for 35% (N=14) neither comfortable nor uncomfortable, for 25% (N=10) somewhat comfortable and for 25% (N=10) comfortable. Ease of Use The results showed 15% (N=6) students found the MS difficult to wear, 12.5% (N=5) found it somewhat difficult to wear, 37.5% (N=15) thought it was neither easy nor difficult to wear, 12.5% (N=5) found it somewhat easy to wear and 22.5% (N=9) thought it was easy to wear. Degree of Frustration The answers showed 2.5% (N=1) found the experience frustrating, 2.5% (N=1) thought it was somewhat frustrating, 22.5% (N=9) found the experiment neither frustrating nor satisfactory, 25% (N=10) thought it was somewhat satisfactory and 47.5% (N=19) had a satisfactory experience using the MS. There were three aspects that only became apparent once the evaluation was over. The first aspect of interest was in relation to the pace and the way readings were collected. The attention model [1] considered readings in the space of time used by learners to formulate an answer for each question. The pace in which data was collected by the model was 10Hz which produced repeated measurements in some logs. This method of collecting data is inefficient as plotting attention fluctuations considering fixed, regular intervals might be difficult. People interested in programming the MS device should consider that, due to a hardware processing delay, the MS outputs operate at 1Hz, and need to program their algorithms accordingly. The second aspect of interest is in relation to difficulties wearing the device. When connection is lost, there is a delay of 7-10 seconds before a new reading is provided. Designers should consider this as a constant input might not be possible. The third aspect refers to MS’s suitability as an input device for interface control. Developers need to consider that attention levels (and associated patterns) vary considerably between users (see Figure 2), as expected. If developers employ higher levels of attention as triggers for interface or system changes, they should consider some users normally have higher levels of attention without being prompted to put more attention. This normal variability creates the need to research and develop an algorithm to level-out initial differences in attention levels and patterns. On a related topic, MS’s readings vary in a scale from 0 to 100 (see Figure 1): however, it is not yet clear what relationship exists between wave activity and processed output, whether the scale is linear, or whether the granularity of the 100-point scale is appropriate for all users. Finally, there were some usability problems that caused data loss, in particular: 1. In 3 cases the MS did not fit the participant’s head properly leading to adjustments by the participants leading to not constant and unreliable readings. Another problem was people with longer hair having problems wearing the device to allow sensors touch the skin behind the ears at all times. During the experiment, extra time was required to make sure people with longer hair placed the device adequately. 2. In other 3 cases the MS ran out of battery. The battery was checked before each participant interacted with the assessment exercise using NeuroSky software via its associated software. However, despite the precautions taken and after having checked the green light on one of the device’s side, battery life was very short. The
Assessing NeuroSky’s Usability to Detect Attention Levels
155
device does not alert the user when battery levels are low, so it was not clear when batteries needed to be replaced. This was a problem at the beginning of the experiment but later on batteries were replaced on daily basis. 4.2 Adaptability to Different Users One of the characteristics of the MS reader is that it can be worn by different users producing different outputs. This would allow for adaptation of the model [10] in the frame of the assessment exercise. It was expected MS outputs would vary for different users reflecting varying levels of attention. Furthermore, this adaptation would be fast and seamless without the need to train the device for a new user. To throw some light onto the issue of adaptability, it was speculated attention readings would be different for individuals. It was also hypothesized there would be a positive correlation between the readings and the self-assessment attention test (see materials section). To assess variability among participants, a test of normality was done to see the distribution of the participants’ average attention levels. Table 1 shows descriptive statistics of the readings for the population (N=34). The results of a test of normal distribution showed that the data is normally distributed (Shapiro-Wilk = .983, p = .852) suggesting there is not a tendency to replicate particular readings. Figure 2 illustrates the Q-Q plot for this sample suggesting a good distribution of average attention levels during the assessment exercise. Table 1. Descriptive statistics
Another test designed to see whether MS readings adapted to individual participants, was a correlation between the readings and the self-reported attention using the post-test questionnaire. A positive correlation was expected between these two variables. Table 1 shows the descriptive statistics for the two variables. To calculate self-reported attention levels, the mean of the answers to the 7 items of the attention posttest was calculated per participant; lesser values indicate lesser attention levels. The results of a PearObserved Value son’s correlation between the two variables indicated a significant, Fig. 2. Q-Q plot of students’ average attention positive correlation (Pearson’s = levels during the assessment exercise -.391, p = .022). .3
.2
.1
0.0
Dev from Normal
-.1
-.2
-.3
-.4
20
30
40
50
60
70
80
90
156
G. Rebolledo-Mendez et al.
4.3 Fusing User-Generated Information with MS Readings One way to analyze whether the data was fused correctly was to check the logs for missing or incorrect data. The results of this analysis showed that there were six participants (15%, original sample N=40) for which the MS did not produce accurate readings. An analysis of the logs for the remaining participants (N=34) showed the device produced readings throughout the length of the experiment (average time = 9.48 minutes) without having an erroneous datum (attention = -3). The causes for the lack of readings in 6 cases were due to usability problems (see following section). Another way of throwing some light on how well the MS readings and usergenerated data were fused consisted of analyzing the logs to see whether there was a variation on the model’s reactions for the sample. Since the reactions given by the AI avatar could be of six types [1], the frequency was calculated for each reaction type for the entire population with correct NeuroSky readings (N=34), see Table 2. Table 2. Frequencies associated to the model’s reaction types for the population (N=34)
Reaction Type Frequency
6 128
5 172
4 0
3 77
2 13
1 0
It was expected the frequencies for reaction types 4 and 1 would be 0 given the averages of the four binary inputs. Reaction Types 5 was the most common type followed by Reaction Types 6, 3 and 2. Given the 8 possible results of averaging out the four binary inputs [10], it was expected Reaction Type 3 would be the most frequent. However, this was not the case suggesting the model did vary and the reactions type provided were in accordance to the variations in attention, time, and whether answers were correct. Finally, the responses to two questions in the post-test questionnaire gave an indication of students’ subjective perceptions about how well the reaction types were adequate to their attention needs. The first question asked: ‘how frequently the reactions helped you realize there was something wrong with the way you were answering the questions?’ The answers showed 25% (N=10) of students felt the avatar helped them all the time, 20% (N=8) said most of the time, 35% (N=14) mentioned some times, 12.5% (N=5) said rarely and 7.5% (N=3) stated never. The second question asked ‘how appropriate they thought the combined use of MS and avatars was appropriate for computer-based educational purposes?’ Students’ answered with 65% (N=26) saying it was appropriate, 12.5% (N=5) saying it was appropriate most of the time, 15% (N=6) saying it was neither appropriate nor inappropriate, 2.5% (N=1) saying it was somewhat inappropriate and 5% (N=2) saying it was inappropriate.
5 Conclusions and Future Work The reliability of MS readings to assess attention levels and to amalgamate with usergenerated data was evaluated in an assessment exercise in Second Life, N=34. The results showed there is variability in the readings and they correlate with self-reported attention levels suggesting the MS adapts to different users providing accurate readings of attention. The results of analyzing the device’s usability suggest some users
Assessing NeuroSky’s Usability to Detect Attention Levels
157
have problems with wearing the device due to head sizes or hair interference and that the device’s signals to indicate flat batteries are poor. By analyzing individual logs it was possible to determine that, when the device fits properly, the MS provides valid and constant data as expected. Log analyses also helped establish the frequency different reactions types were provided in the exercise in the light of attention variability. The frequencies suggested the model did not lean to the most expected reaction (Type 3) but that it tended to be distributed amongst Reaction Types 5 and 6, providing an indication that user-generated data was fusing adequately with attention readings. When asked about their experience, 35% of the population said the avatar helped them realize there was something wrong with how s/he was answering the questions and 65% indicated using a MS in combination with avatars was appropriate in computer-based educational settings. When asked about comfort, 35% thought the device was neither comfortable not uncomfortable, 37.5% thought it was neither easy nor difficult to wear and 47.5% said they had a satisfactory experience with the device. There were other results that were apparent only after the evaluation. In particular, it was found that: 1) sampling rates need to be considered to organize data in fixed, regular intervals to determine attention. 2) Developers need to be aware there is a delay when readings are lost due to usability issues. 3) Variability imposes new challenges for developers who wish to use levels of attention as input to control or alter interfaces. Work for the future includes the combination of MS readings with other technologies such as using learner’s gaze, body posture and facial expressions to read visual attention. Future work will be carried out to find out the degree of attention variability to program an algorithm capable of leveling-out different patterns of attention. In addition, future work will explore how attention data can be used to develop learner models that help understanding attention and engagement for informing gamebased learning design and user modeling.
Acknowledgements The research team thank the NeuroSky Corporation for providing the device for testing purposes. We also thank the students, lecturers and support staff at the Faculty of Informatics, Universidad Veracruzana, Mexico.
References 1. Amershi, S., Conati, C., McLaren, H.: Using Feature Selection and Unsupervised Clustering to Identify Affective Expressions in Educational Games. In: Workshop in Motivational and Affective Issues in ITS, 8th International Conference on Intelligent Tutoring Systems, Jhongli, Taiwan (2006) 2. Association, A.P.: Diagnostic and Statistical Manual of Mental Disorders. American Psychiatric Press (1994) 3. Berger, H.: On the electroencephalogram of man. In: Gloor, P. (ed.) The fourteen original reports on the human electroencephalogram, Amsterdam (1969) 4. Haynes, J.D., Rees, G.: Decoding mental states from brain activity in humans. Nature Neuroscience 7(7) (2006)
158
G. Rebolledo-Mendez et al.
5. Linden, M., Habib, T., Radojevic, V.: A controlled study of the effects of EEG biofeedback on cognition and behavior of children with attention deficit disorder and learning disabilities. Applied Psychophysiology and Biofeedback 21(1) (1996) 6. Loudin, J.D., et al.: Optoelectronic retinal prosthesis: system design and performance. Journal of Neural Engineering 4, 72–84 (2007) 7. Manske, M., Conati, C.: Modelling Learning in an Educational Game. In: 12th Conference on Artificial Intelligence in Education, IOS Press, Amsterdam (2005) 8. Mason, S.G., Birch, G.E.: A general framework for brain-computer interface design. IEEE Transactions on Neural Systems and Rehabilitation Engineering 11, 70–85 (2003) 9. Poli, R., Cinel, C., Citi, L., Sepulveda, F.: Evolutionary brain computer interfaces. In: Giacobini, M. (ed.) EvoWorkshops 2007. LNCS, vol. 4448, pp. 301–310. Springer, Heidelberg (2007) 10. Rebolledo-Mendez, G., De Freitas, S.: Attention modeling using inputs from a Brain Computer Interface and user-generated data in Second Life. In: The Tenth International Conference on Multimodal Interfaces (ICMI 2008), Crete, Greece (2008) 11. Sona, D., Veeramachaneni, S., Olivetti, E., Avesani, P.: Inferring cognition from fMRI brain images. In: de Sá, J.M., Alexandre, L.A., Duch, W., Mandic, D.P. (eds.) ICANN 2007. LNCS, vol. 4669, pp. 869–878. Springer, Heidelberg (2007) 12. Sitaram, R., et al.: fMRI Brain-Computer Interfaces. IEEE Signal Processing Magazine 25(1), 95–106 (2008) 13. Trejo, L.J., Rosipal, R., Matthews, B.: Brain-computer interfaces for 1-D and 2-D cursor control: designs using volitional control of the EEG spectrum or steady-state visual evoked potentials. IEEE Transactions on Neural Systems and Rehabilitation Engineering 14(2), 225–229 (2006) 14. Vaughan, T., et al.: Brain-computer interface technology: a review of the second international meeting. IEEE Transactions on Neural Systems and Rehabilitation Engineering 11(2), 94–109 (2003)
Effect of Body Movement on Music Expressivity in Jazz Performances Mamiko Sakata1, Sayaka Wakamiya2, Naoki Odaka2, and Kozaburo Hachimura3 1
Faculty of Culture and Information Science, Doshisha University, 1-3 Tatara Miyakodani, Kyotanabe City, 610-0394 Japan [email protected] 2 Graduate School of Human Development and Environment, Kobe University [email protected], [email protected] 3 College of Information Science & Engineering, Ritsumeikan University [email protected]
Abstract. In this study, we tried to examine empirically how body motion contributes to music expressivity, both in terms of intensity and manners, during impromptu jazz performances. Psychological rating experiments showed that music expressivity in jazz performances are assessed in two aspects, namely power and aesthetic quality. In the assessment of musical performances, the music itself basically contributed to how observers evaluated its expressivity. However, it was also shown that body motion had a greater influence on assessing the quality of music in terms of “hard or soft” and “light or heavy.”As a result of the three-dimensional motion analysis using motion capture, we learned that the characteristics of the player’s body motions changed with the playing style and the playing dynamics. The player, therefore, is making music not only by producing the “sound,” but by also showing “body motions” for creating that sound. Keywords: Jazz Performances, Music Expressivity, Body Movement, Motion Capture.
words, the visual expression of body movements. Recently, some studies (those by Davidson [1], Okada [2], and Maruyama [3] and others, or example) have made some reference to the role played by the body in musical performances, but such works are sporadic, and the discussion has really only just begun. This study aims to illuminate “the visual roles of body movements during impromptu performances of jazz music” and empirically show the modes and intensity of body movements that contribute to music expressivity. For this purpose, we employ “Kansei” information processing techniques, motion capturing, feature extraction from motion data and some statistical analyses. “Kansei” is a Japanese word whose meaning is close to “feeling” or “sensibility” in English. Kansei information processing is a method of extracting features related to Kansei conveyed by the media we receive. Conversely, it is also a method of adding or generating some Kansei factors to media produced by computers [4]. We employ motion capturing techniques for obtaining images of human body motions. This technique is now used commonly in movie and CG animation production. Several systems are commercially available nowadays. This study uses motion capturing to analyze jazz performances and quantitatively analyze the roles played by body movements.
2 Study Subjects For study subjects, we prepared materials with the help of professional jazz musicians in order to study the role of body movements in music expressivity. We asked a 10year veteran male alto-saxophone player (24 years old) to play the jazz standard “Summertime.” He played the front theme1, an ad-lib solo in the middle and the back theme for two choruses (with each chorus containing 16 bars). He was asked to play them in three different modes: “ordinary,” “expressionless” and “over expressive.” In this study, these three different modes of expression are defined as “expression dynamics.” In order to retain the characteristic feature of freestyle jazz performance, which is to perceive each other’s music expressivity in real-time, respond to them or follow them, we asked other players to join in the performance of our subject. The back band consisted of a drummer, a wood base player and a guitarist. The drummer was asked to keep the BPM=120 tempo, and all the other players were asked to follow the alto-saxophonist’s performance.
3 Motion Capture System We used an optical motion capture system (Motion Analysis Corporation, EvaRT with Eagle cameras) to measure body movements during a jazz performance. Fig. 1 shows a scene from the motion capturing session in our studio. Reflective markers were attached to the joints on the player’s body, and several high-precision, highspeed video cameras were used to track the motion. In our case, 33+2(on the instrument) markers were put on the player's body (see Fig. 2), and the movement was 1
The pre-composed part of a jazz number is called “theme”. In common jazz performances, musicians play the theme first, then the solo, and then the theme again.
Effect of Body Movement on Music Expressivity in Jazz Performances
161
measured with 10 cameras. The acquired data can be observed as a time-series using the three-dimensional coordinate values (x, y, z) of each marker in each frame (frame rate is 120 fps).
Fig. 1. Motion Capture
Fig. 2. Positions of markers
4 Psychological Rating Experiments In order to determine what kind of impressions are perceived from jazz performances, the object of our study, and how the different modalities, namely the “sound” and the “body,” contribute to musical expressivity, we conducted psychological rating experiments. Thirty-eight observers (20 men and 18 women) participated in this experiment. The mean and standard deviations of age among the 38 observers were 20.5 and 1.43, respectively. All observers had some training in jazz. 4.1 Stimuli for Experiments The motion measured by the motion capture system described above was filmed by a digital camera (SONY) and then edited to produce experimental stimuli. The stimuli were obtained by editing the performances of the front theme and ad-lib solo played with three different expression dynamics, and then further edited using the three modalities of “sound only,” “visual images only” and “sound and visuals.” Table 1 shows the order in which the stimuli were presented (the modalities, dynamics and styles) and the length of each stimulus. 4.2 Procedure We briefed the observers on the experiment, asked them to answer questions concerning their personal attributes and then presented the stimuli, one type at a time, in the order of “visual images only” -> “sound only”-> “sound and visuals.” We provided an interval after showing one type of stimulus twice. In the first showing of the stimuli, the subjects were asked to closely and carefully observe the stimuli. In the second showing, they were asked to fill in the Answer Sheet using the Assessment Words on
162
M. Sakata et al.
a scale from 1 to 7 for each word in the adjective pair, which are shown in Table 22. The videotaped recording was temporarily stopped during the intervals between showing the different types of stimuli. The new stimulus was presented after making sure all subjects finished filling in their Answer Sheet. Table 1. Order of stimuli Order of display 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Modality
Expression dynamics
Style
Duration
visual images only visual images only visual images only visual images only visual images only visual images only sound only sound only sound only sound only sound only sound only sound and visuals sound and visuals sound and visuals sound and visuals sound and visuals sound and visuals
ordinary expressionless over expressive over expressive expressionless ordinary expressionless over expressive ordinary over expressive expressionless ordinary over expressive expressionless over expressive expressionless over expressive ordinary
solo theme theme solo solo theme solo theme theme solo theme solo theme solo solo theme theme solo
tight hard weak unclear unimpressive no presence messy passionate sad heavy ordinary realistic poor ugly slow cold bright articulate unfavorable bad
In defining the Assessment Words used in our experiment, we referred to Iwamiya [5] and added our own 10 pairs of adjectives to create a group of Assessment Words and started our preliminary experiments using these words. After running a factor analysis, we deleted the terms that were not affecting any of the factors, deleted one of the highly-correlated pairs and finally selected 20 pairs of adjectives.
Effect of Body Movement on Music Expressivity in Jazz Performances
163
4.3 Results of Kansei Assessment Experiment The results of the assessments of each stimulus were converted into scores from 1 to 7 using the SD method. We also obtained the average for the adjectives. Extraction of KANSEI information from the stimuli. After conducting a principal component analysis based on the Kansei Assessment Scores obtained, we extracted two principal components with a characteristic value greater than 1. (The cumulative contribution rate was 0.879 up to the second principal component.) Table 3 shows the values of factor loading of each word pair to the two principal components. The shaded areas in the Table indicate the significant image word pair ratings to each principal component with a magnitude larger than 0.8. Table 3. Results of PCA for the rating experiment Assessment Words
PC1
PC2
Table 4. Result of multiple regression analysis Assessment Words
Standardized Coefficients Sound
Adjusted R2
Visuals
loose-tight
.882
.122
soft-hard
.650
.025
powerful-weak
.959
-.141
clear-unclear
.948
.252
powerful-weak
0.926
0.055
0.876
impressive-unimpressive
.964
.108
clear-unclear
0.942
0.081
0.932 0.928
loose-tight
0.851
0.204
0.931
soft-hard
0.593
0.601
0.897
.974
.158
impressive-unimpressive
0.999
-0.031
-.649
.737
0.931
0.077
plain-passionate
.932
.102
have presence-no presence
happy-sad
.749
-.524
neat-messy
0.773
0.222
0.914
light-heavy
.673
-.385
plain-passionate
0.841
0.193
0.959
happy-sad
0.617
0.433
0.847
light-heavy
0.587
0.656
0.959
unique-ordinary
-
-
-
fantasy-like-realistic
-
-
-
rich-poor
1.028
-0.047
0.991
beautiful-ugly
0.943
0.051
0.945
fast-slow
0.670
0.348
0.861
-
-
-
0.657
0.405
0.910 0.969
have presence-no presence neat-messy
unique-ordinary
.974
-.076
fantasy-like-realistic
.859
.036
rich-poor
.789
.589
beautiful-ugly
.308
.942
fast-slow
.743
-.653
warm-cold
.618
.640
subdued-bright
-.701
.706
inarticulate-articulate
-.464
.853
favorable-unfavorable
.490
.846
good-bad
.468
.865
Eigenvalue
11.708
5.877
Variance (%)
58.539 87.925
warm-cold subdued-bright
0.926
inarticulate-articulate
0.687
0.341
favorable-unfavorable
0.915
0.102
0.898
good-bad
0.884
0.145
0.988
From Table 3, one can interpret the PC1 to be the variable concerned with the “power” of a musical performance, and PC2 to be the variable concerned with “aesthetic quality.” We plotted the principal components of PC1 and PC2 on the x- and y-axes and also plotted 18 types of stimuli on a graph (see Fig. 3). On this graph, the presence or power of a performance increases as you move toward the right, while the aesthetic quality increases as you move upward. From the PCA results, it became clear that musical expressivity in jazz performance is perceived from two aspects, namely the “power” and “aesthetic quality.”
164
M. Sakata et al.
Fig. 3. Plot of PCA score for each motion
The Body and the Music in Music Expressivity. To examine which of the two factors, namely “body motions” and “sound,” contributes more to music expressivity, we made a multiple regression analysis for each of the adjectives listed in Table 2, using Kansei Assessment Scores obtained by showing “visual images only” (in other words, body motions only) and “sound only” (in other words, music only) as independent variables. The Kansei Assessment Scores obtained by showing the “sound and visuals” were treated as dependent variables. As a result, we obtained a multiple regression equation (p<0.05) with a high contribution rate for many of the adjectives, as shown in Table 4. Based on the information in Table 4, we see that music contributes more than the body motion (visual images) in the Kansei assessment, as expressed by the words “loose-tight,” “powerful-weak,” “clear-unclear,” “impressive-unimpressive,” “have presence-no presence,” “neat-messy,” “plain-passionate,” “happy-sad,” “rich-poor,” “beautiful-ugly,” “fast-slow,” “subdued-bright,” “inarticulate-articulate,” “favorableunfavorable” and “good-bad.” On the other hand, the body motion (visual images) contributes more than the music in the Kansei assessment, as expressed by words like “soft-hard” and “light-heavy.”
5 Feature Values for Body Motion In the current study, the angles of each body part (back, sides, and knee), velocity (finger tips, elbow, sacral, head, and toe) and the distance moved on the floor (heel movement per unit time) were adopted as “feature values for body motion.” 5.1 Extracting Physical Parameters Angle. This parameter shows how the various body parts change in a time-series manner during musical performances. In our study, we measured the angles of the back, the body’s side and the knee. In Fig. 2, the angle created by marker numbers 5, 18 and 31 shows the angle of the back. The angle created by marker numbers 7, 4 and
Effect of Body Movement on Music Expressivity in Jazz Performances
165
18 is the angle of the side of the body. The angle created by marker numbers 19, 21 and 31 is the angle of the knee. For example, in the case of the back, we set the origin at marker no.18 (x2, y2, z2) and measured the angleθ created between marker no. 5 (x1, y1, z1) and marker no. 31 (x3, y3, z3). Then, using Equation (1) below, we calculated cosθ and returned the obtained cosine radian to a radian using the Arc Cosine. Then we used the Degrees function to convert the angles in degrees into numerical values of the angle. (1) Velocity. This parameter shows the time-series change in the movement of the body parts during a musical performance. In the current study, we measured the speeds of the fingertips of the right hand (no.11), elbow (no.7), sacral (no.18), head (no.2) and the toe (no.27). For each marker, we obtained the Euclidian distance in the frames from the data expressed by the x, y and z coordinates. We then multiplied the Euclidian distance with the frame rate to obtain the time-series data of the velocity. For example, when the x, y and z coordinates of the marker in Frame i are expressed as xi, yi and zi, we can obtain the distance d from Equation (2). (2) Then, by multiplying the d with the motion data frame rate, 120[Frame/sec], we can obtain the velocity |v|. |v| = d * 120
(3)
For the elbow, we used the relative coordinates by using the shoulders as the origin. The sacral were set as the origin to obtain the relative coordinates for the head and toe. This means that the velocity of the elbow, head and toe is expressed by a relative velocity based on the shoulder and the sacral. We obtained the velocity of the fingertips of the right hand and the sacral by using absolute coordinates using an origin determined in the capture area. Floor Travel Distance. This parameter shows how much the player moved on the floor during the performance. We obtained the distance traveled by the left heel (no.32) frame by frame. 5.2 Feature Values for Body Motion By conducting a principal component analysis after obtaining each parameter (raw data) described in Section 5.1 above, we extracted three components with characteristic values greater than 1. (The cumulative contribution rate up to the third principal component was 0.782.) From the values of factor loading shown by the Table 5, we can interpret that PC1 is the component showing the velocity of the upper part of the body, PC2 is the component showing flow travel distance, and PC3 is the component showing the bending of the body.
166
M. Sakata et al. Table 5. Results of PCA for the motion capture data Physical parameters Angles of the back
PC1
PC2
PC3
-.088
.181
.240
.285
.601
Angles of the knee
-.396
.628
-.107
Speed of the hand
.925
.251
-.053
Speed of the elbow
.922
.278
-.093
Speed of the head
.922
.258
-.155
Speed of the sacral
.913
.065
.000
Speed of the toe
.883
-.312
.091
Floor travel distance
.422
-.744
.190
4.572
1.371
1.094
Angles of the body’s side
Eigenvalue Variance (%)
.801
50.798 66.029 78.186
Fig. 4. Plot of PCA score for each motion by motion capture data
In the left graph of Fig. 4, the centers of gravity for PC1 and PC2 are plotted on the x and y axes, and six types of stimuli are plotted. In this figure, the more you move to the right, the greater velocity at which the top part of the body moves. The lower you go on the figure, the more distance covered in floor travel. Looking at each stimulus, the player showed greater floor travel when playing solo than when playing the theme. In terms of expression dynamics, the velocity of the top part of the body increased in this order: “ordinary” -> “expressionless” -> “over expressive.” Likewise, we plotted the centers of gravity for the principal components of PC1 and PC3 on the x and y axes and plotted each stimulus in a graph, which is shown in the right graph of Fig. 4. In this graph, the more you go to the right, the greater velocity of the top part of the body, and the lower you go, the more bending there is of the body. This graph shows that playing solo, rather than playing the theme, resulted in greater body bending. And when playing either solo or the theme, the player’s body bending was greatest during the “ordinary” mode of playing.
Effect of Body Movement on Music Expressivity in Jazz Performances
167
5.3 Relationship between Kansei Assessment and Feature Values for Body Motion We calculated the average and standard deviations for each stimulus for the nine parameters obtained in Section 5.1. In order to examine the relationship between the Kansei Assessment and the Feature Values for Body Motion, we calculated the coefficient of correlation between the principal component score of each performance obtained in Chapter 4 and the characteristic value of body motion of each performance(see table 6). The shaded area shows the combinations that showed a significant correlation of 5%. In the “Power Component,” we found a significant correlation between all Power Components and body motion parameters, except for the parameters “average angle of body’s side,” “average foot travel distance” and “standard deviation of foot travel distance.” For the Aesthetics Components, on the other hand, we could not find any correlation with any of the body motion parameters. Table 6. Correlation matrix Physical parameters Angles of the back Angles of the body’s side Angles of the knee Speed of the hand Speed of the elbow Speed of the head Speed of the sacral Speed of the toe Floor travel distance
Mean SD Mean SD Mean SD Mean SD Mean SD Mean SD Mean SD Mean SD Mean SD
This means that in musical performance, body motion contributes a large measure to the Power, but not to the Aesthetics.
6 Discussion and Conclusion In this study, we tried to examine empirically how body motion contributes to music expressivity, both in terms of intensity and manners, during impromptu jazz performances. Psychological rating experiments showed that music expressivity in jazz performances are assessed in two aspects, namely power and aesthetic quality. In the Kansei assessment of musical performances, the music itself basically contributed to how observers evaluated its expressivity. However, it was also shown that body motion
168
M. Sakata et al.
had a greater influence on assessing the quality of music in terms of “hard or soft” and “light or heavy.” As a result of the three-dimensional motion analysis using motion capture, we learned that the characteristics of the player’s body motions changed with the playing mode and the playing dynamics. The player, therefore, is making music not only by producing the “sound,” but by also showing “body motions” for creating that sound. It was found that body motions had a great role in creating “power,” but were not much related to the “aesthetics quality.” Naturally, the Kansei emanated from the sound itself is central to music expressivity. However, we have shown empirically that the body motions people make when making music also contribute greatly in music expressivity. This study offers a basic examination of the role of body motions in musical performances; however, many challenges and problems still remain to be explored. Acknowledgments. This work was supported in part by the Grant-in-Aid for Scientific Research, for Young Scientists (B) No. 19700493 of the Ministry of Education, Culture, Sports, Science and Technology, Japan. The authors would like to express their sincere gratitude to Mr. Junya Kondo for his cooperation with our research. Thanks are also due to Mr. Takahiro Yorino, Ms. Hitomi Toyoka and Mr. Naoyuki Okamoto for their kind help in motion capturing experiment.
References 1. Davidson, J.W.: Visual Perception of performance manner in the movements of Solo Musicians. Psychology of Music 21, 103–113 (1993) 2. Okada, A.: The body playing the piano. Shunjusha Publishing Company (2003) (in Japanese) 3. Maruyama, S.: The Embodied Sense of Music: Case Studies on the Rhetorical Function of Bodily Gestures by Highly Practiced Musicians. Cognitive studies: bulletin of the Japanese Cognitive Science Society 14(4), 471–493 (2007) (in Japanese) 4. Sakata, M., Hachimura, K.: KANSEI Information Processing of Human Body Movement. In: Smith, M.J., Salvendy, G. (eds.) HCII 2007. LNCS, vol. 4557, pp. 930–939. Springer, Heidelberg (2007) 5. Iwamiya, S.: Multimodal Communication of Music and Image. Kyushu University Press (2000) (in Japanese) 6. Iwamiya, S.: Design of Sounds. Kyushu University Press (2007) (in Japanese) 7. Nagashima, Y.: Drawing-in Effect on Perception of Beats in Multimedia. The Journal of the Society for Art and Science 3(1), 108–148 (2004)
A Method to Monitor Operator Overloading Dvijesh Shastri, Ioannis Pavlidis, and Avinash Wesley Computational Physiology Lab, Department of Computer Science, University of Houston, Houston, TX, 77204 {dshastri,ipavlidis,awesley}@uh.edu
Abstract. This paper describes research that aims to quantify stress levels of operators who perform multiple tasks. The proposed method is based on the thermal signature of the face. It measures physiological function from a standoff distance and therefore, it can unobtrusively monitor a machine operator. The method was tested on 11 participants. The results show that multi-tasking elevates metabolism in the supraorbital area, which is an indirect indication of increased mental load. This local metabolic change alters heat dissipation and thus, it can be measured through thermal imaging. The methodology could serve as a benchmarking tool in scenarios where an operator’s divided attention may cause harmful outcomes. A classic example is the case of a vehicle driver who talks on the cell phone. This stress measurement method when combined with user performance metrics can delineate optimal operational envelopes. Keywords: Human-Machine Interaction, divided attention, stress, thermal imaging.
2 Methodology During dual tasking there is considerable temperature increase in the supraorbital region of a participant. This locally elevated temperature is the result of increased metabolic activity due to activation of the forehead muscle group. The phenomenon is consistent with findings in prior experiments involving Stroop color conflict testing [1]. In that previous work, the stress signal was extracted from the evolution of the mean thermal footprint of the entire supraorbital region (see Fig. 1. a). This approach, however, introduced noise in the extracted signal, partly due to the wide probing area and partly due to sub-optimal tracking performance [4]. In the present work, the tracking region was differentiated from the measurement region. An even bigger area that included sharp contrasts (e.g., skin versus hair) was selected for tracking. This improved tracking performance. However, only a small subset within the tracking region was selected for the thermal measurement. This subset was confined in the area were metabolic changes are more dramatic, to reduce the effect of probing noise (Fig. 1. b).
(a)
(b)
Fig. 1. (a) Tracking and measurements regions coincide in legacy method. (b) Measurement region (in pink) is a subset of tracking region in the current method.
For every participant in the experiment, tracking and measurement regions of interest were selected as described above. The mean temperature of the measurement region of interest was computed for every frame in the thermal clip. Thus, a 1D supraorbital signal was produced from the 2D thermal data. Any residual noise in the supraorbital signal was suppressed by a noise cleaning algorithm based on Fast Fourier Transform (FFT) [5]. The supraorbital signal was split into segments corresponding to the phases of the experiment (resting, initial single task, dual task, latter single task, and cool-off). Each segment was approximated with a linear fit (Fig. 2), which described the local metabolic rate at the time.
3 Experimental Design A high quality Thermal Imaging (TI) system was used for data collection. The centerpiece of the TI system was a ThermoVision SC6000 Mid-Wave Infrared (MWIR) camera (FLIR Systems) (sensitivity = 0.025oC) [12]. The experimental protocol included thermal imaging of the participant’s face while he was resting, engaging in a
A Method to Monitor Operator Overloading
171
driving simulation game (single task), engaging in a driving simulation game and talking on the cell phone (dual task), and relaxing. The dataset featured participants of both genders, different races, and with varying physical characteristics. The participants were placed 6 feet away from the thermal camera (Fig. 3). We used a XBOX-360 game console and the Test Drive: Unlimited game to simulate driving. The participants were asked to follow all traffic signs, drive normally, and not race during the experiment. They were given an opportunity to testdrive before the experiment began to acquaint themselves with the driving simulation. In the first formal phase of the experiment, the participants were asked to rest for 5 min while being imaged. This helped to isolate effects of other stress factors that participants may have carried inside them. This was the baseline phase.
Fig. 2. Supraorbital raw thermal signal (marked in blue color), noise cleaned thermal signal (marked in pink color) and linear fitting (marked in yellow color). Slope values for the linear segments are shown in blue colored text.
Next, the participants were asked to play the driving simulation game. This phase of the experiment also lasted about 5 min. After around 1 min of driving simulation (initial single task phase), the participant received a cell phone call that played a set of prerecorded questions in the following order: Instruction: Please do not hang up until you are told so. Q1: Are the lights ON in the room, yes or no? Q2: Are you a male or female? Q3: Who won the American civil war, the north or the south? Q4: What is 11 + 21? Q5: How many letters ‘e’ are in the word experiment? Q6: I am the son of a mom whose mother in law's son hit. How am I related to the other son? Q7: My grandma's son hit his son. How are the sons related? Q8: A man is injured in 1958 and died in 1956. How is that possible? Q9: What is 27 + 14? Instruction: You may now hang up the phone and pay attention to the game.
172
D. Shastri, I. Pavlidis, and A. Wesley
The question set was a combination of basic, logical, simple math, and ambiguous questions. The order of the questions was designed to build-up pressure on the participants. Additional pressure was achieved by repeating one more time every question that was incorrectly answered. The participants were supposed to drive while talking on the cell phone (dual task phase). At the end of the phone conversation, participants put the phone down and continued driving till the end of the experiment (the latter single task phase). Finally, the participants relaxed for 5 minutes. The purpose of this so-called cool-off segment was to monitor physiological changes after the simulated driving experiment.
4 Experimental Results The slopes of the linearly fitted segments, computed according to the method described in Section 2, were used as stress indicators. Fig.4 shows the mean slope values of the various segments for the entire data set (statistically constructed mean participant). The graph clearly indicates that the temperature increase during dual tasking is the highest among all segments (sole exception was participant S-6). Since the temperature increase is correlated to metabolic rate, the results indicate elevated metabolism in the supraorbital region during dual tasking. This is presumably due to strong muscle activation, associated with frowning, a facial expression autonomically associated with mental engagement. Stress during the latter single task phase was stronger than the initial single task phase (Fig. 4). Apparently, this was due to residual effect from the dual task that preceded the latter single task phase. Most of the participants admitted during debriefing that they were thinking about their dual task performance while performing the latter single task.
A Method to Monitor Operator Overloading
173
Fig. 4. Mean slope value of the experimental segments. This stress indicator is the highest during dual tasking.
Interestingly, baseline stress was a bit higher than the initial single task stress. This indicates that either the baseline was poorly designed (just sitting idle can be stressful) or participants carried some residual stress from the informal test-driving phase that preceded it. Participant S-6 is an interesting case. It appears that his stress started decreasing in the middle of dual tasking (Fig. 5). On careful examination of the data, the investigators found that this is the only participant who started perspiring in the middle of the experiment, apparently due to overwhelming stress. Perspiration reduces the thermal signature and the current method wrongly interprets this as lower metabolic rate and thus, lower stress. It is exactly the opposite. A method that identifies the emergence of perspiration and switches measurement metrics is needed to overcome this issue. In all cases, the rate of thermal change of the cool-off segment had an opposite global trend than that of the dual task segment. In most cases, the rate of thermal change of the cool-off segment had also an opposite global trend than that of the initial and latter single task segments. This illustrates that the participants indeed felt relaxed after 5 minutes of intense mental activity but at various degrees. Those participants who were thinking about their performance during the cool-off period exhibited slow recovery. This is an interesting finding, as it illustrates that not only an action but also thoughts about the action (past action in this case) could affect stress levels. Performance of the drivers degraded during the dual task segment, as measured by the point system of the simulator. This was inversely proportional to the mean stress level measured through the supraorbital channel.
174
D. Shastri, I. Pavlidis, and A. Wesley
Fig. 5. During the dual task period, the supraorbital signal (marked in blue color) of S6 showed ascending global trend in the first half and then descending global trend in the second half (marked in green color). The culprit is the onset of perspiration.
5 Conclusion This research brings to the fore a stress quantification method ideally suited to situations where the attention of the machine operator is divided. Unobtrusive quantification of stress and its correlation to operator performance and emotions are of singular importance in man-machine interaction. A feedback system can be developed that alerts the operator about his/her stress status based on the facial thermal signature. Results from a pilot experiment on the effect of cell phone communication during driving are more than encouraging. They open the way for a plethora of other multitasking experiments drawn from daily life. At the technical level, the issue of perspiration, which corresponds to the onset of extreme stress, cannot be handled with the current method. A method that identifies perspiratory patterns and handles thermal computation in a different manner from that point onward is needed in the future.
Acknowledgments This material is based upon work supported by the National Science Foundation under Grant No. #ISS-0812526, entitled “Do Nintendo Surgeons Defy Stress.” Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
References 1. Puri, C., Olson, L., Pavlidis, I., Levine, J., Starren, J.: StressCam: Non-contact measurement of users’ emotional states through thermal imaging. In: CHI 2005 Extended Abstracts on Human Factors in Computing Systems, pp. 1725–1728 (2005)
A Method to Monitor Operator Overloading
175
2. Palvidis, I., Eberhardt, N.L., Levine, J.: Human behavior: Seeing through the face of deception. Nature 415, 35 (2002) 3. Palvidis, I., Levine, J.: Thermal image analysis for polygraph testing. IEEE Engineering in Medicine and Biology Magazine 21(6), 56–64 (2002) 4. Dowdall, J., Pavlidis, I., Tsiamyrtzis, P.: Coalitional tracking. Computer Vision and Image Understanding 106, 205–219 (2007) 5. Tsiamyrtzis, P., Dowdall, J., Shastri, D., Pavlidis, I., Frank, M.G., Ekman, P.: Imaging facial physiology for the detection of deceit. International Journal of Computer Vision 71(2), 197–214 (2006) 6. Standage, D.I., Trappenberg, T.P., Klein, R.M.: A continuous attractor neural network model of divided visual attention. In: Processing of IEEE international Joint Conference on Neural Networks, vol. 5, pp. 2897–2902 (2005) 7. Yamakoshi, T., Yamakoshi, K., Tanaka, S., Nogawa, M., Shibata, M., Sawada, Y., Rolfe, P., Hirose, Y.: A Preliminary Study on Driver’s Stress Index Using a New Method Based on Differential Skin Temperature Measurement. In: 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 722–725 (2007) 8. Yamaguchi, M., Wakasugi, J., Sakakima, J.: Evaluation of Driver Stress using Biomarker in Motor-vehicle Driving Simulator. In: 28th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 1834–1837 (2006) 9. Yili, L.: A queuing network model of human performance of concurrent spatial and verbal tasks. IEEE Transactions on Systems, Man and Cybernetics 27(2), 195–207 (1997) 10. Kenmochi, A., Takaki, Y., Fukuzumi, S.: Psychological tension estimation during the use of a driving simulator: A finger and ear pulse volume study. In: Proceedings of the 18th Annual International Conference of the IEEE Bridging Disciplines for Biomedicine, vol. 5, pp. 1804–1805 (1996) 11. Healey, J.A., Picard, R.W.: Detecting stress during real-world driving tasks using physiological sensors. IEEE Transactions on Intelligent Transportation Systems 6, 156–166 (2005) 12. FLIR Systems, 70 Castilian Dr., Goleta, California 93117, http://www.flir.com
Decoding Attentional Orientation from EEG Spectra Ramesh Srinivasan, Samuel Thorpe, Siyi Deng, Tom Lappas, and Michael D’Zmura Department of Cognitive Sciences, UC Irvine, SSPA 3151, Irvine, CA 92697-5100 {r.srinivasan,sthorpe,sdeng,tlappas,mdzmura}@uci.edu
Abstract. We have carried out preliminary experiments to determine if EEG spectra can be used to decode the attentional orientation of an observer in threedimensional space. Our task cued the subject to direct attention to speech in one location and ignore simultaneous speech originating from another location. We found that during the period where the subject directs attention to one location in anticipation of the speech signal, EEG spectral features can be used to predict the orientation of attention. We propose to refine this method by training subjects using feedback to improve classification performance. Keywords: EEG, attention, orienting, classification.
top-down instruction to orient attention in one direction (or to one location) rather than the orienting elicited by a salient stimulus (bottom-up). Recently, a number of EEG and fMRI studies have been directed at this question and have demonstrated preparatory neural activity when attention is directed by instruction (cued) to one location on a computer screen [12-15]. The fMRI studies examined retinotopically mapped areas of the visual system and demonstrated preparatory increases in neuronal activity as indexed by the BOLD signal. EEG studies showed increased alpha rhythm in the parietal cortex ipsilateral to the attended visual field. Although these studies have elucidated some of the mechanisms of top-down attentional orienting, they are limited in their usefulness towards developing a BCI that can decode attentional orientation. In general, orienting attention takes place in a larger sensory field, not just a limited sector of visual space within 10-12 degrees of fixation on a computer monitor. In audition, perception of sources takes place in all directions, even behind the subject. Thus, in our experiment observers selectively attend to auditory rather than visual stimuli in order to investigate attentional orientation in a wider span of sensory space. We have carried out experiments directing attention to one of two directions and identify the attended direction by classification of the EEG spectra. Our results suggest that attentional orientation can potentially be decoded from the EEG but that further work is needed to train the observers and improve classification methods.
2 Methods Procedure. Six subjects participated in a speech perception experiment (see Figure 1). The subject was seated in a dimly lit room between two speakers (each at 1 m distance) and instructed to fixate on a point. There were two experimental conditions – attend left or attend right. The subject is given the instruction through both speakers. After a variable ISI (500, 700, 900, 1100, or 1300 ms), two different speech stimuli were presented one through each speaker. The speech stimuli were synthesized (http://cepstral.com) in two distinct male voices, one played through each speaker. The assignment of voices to speakers was independently randomized on each trial. The stimuli were a simplified version of the Coordinate Response Measure corpus [16]. These sentences were structured as “(Arrow, Baron, Eagle or Tiger) go to (Blue, Green, Red, or White) now”. The subjects’ task was to identify the two words played through the attended speaker with a response on a keypad. In the example shown in the figure the correct response is “Baron” and “Blue”. The experiment was designed in this manner to demand that the subject direct attention to the correct speaker before the speech was played; otherwise the subject would miss the first code word. The variable ISI and randomized voice ensured that the observer quickly deployed and maintained attention to the appropriate speaker. In addition an adaptive staircase procedure was used to control subject performance. When the subject responded correctly to the first word, the volume was reduced by 5 % on the attended speaker and increased by 5% on the unattended speaker on the next trial. When the subject responded incorrectly to the first word, the volume was increased by 10% on the attended speaker and decreased by 10% on the unattended speaker. This procedure resulted in subjects performing the task correctly about 70% of the time.
178
R. Srinivasan et al.
Fig. 1. The experimental setup. A. The physical layout of the experiment. Note that each speaker was 45 degrees away from fixation and could not be seen by observer without moving the eyes. B. The time course of each trial in the experiment. In the example shown, the instruction is to attend to the right and after a variable ISI, two distinct sentences are played through each speaker. In the example shown, the subject responds “Baron” and “Blue”, and received feedback indicating a correct response.
At this threshold, the amplitude of the attended speaker was typically 30 dB below the unattended speaker. A single experimental session comprised 200 trials, presented in two 100 trial blocks with a break. Each subject participated in three such sessions, each lasting around an hour. EEG recording. EEG was recorded using a 128 Channel Geodesic Sensor Net (Electrical Geodesics, Inc., Eugene, OR, USA) in combination with an amplifier and acquisition software (Advanced Neuro Technology, Inc, Enschede, NL). The EEG was sampled at 1024 Hz and on-line average referenced. Artifact editing was performed through a combination of automatic editing using an amplitude threshold and a manual editing to check the results. Trials with excessive bad channels (> 15 %) were first discarded and then channels with bad trials were discarded. This typically yielded 80100 usable EEG channels (out of 128) and 500-550 usable trials. Data Analysis. We focused on three intervals within the ISI – 0-500, 200-700, and 400-900 ms. For the later intervals, we discarded trials where the target stimulus had already started, resulting in different numbers of trials for the three intervals – 600, 480, and 360. The data were Fourier transformed using an FFT (Matlab) for each 500 ms interval ( f = 2 Hz) and the power spectrum calculated as the squared magnitude of the Fourier coefficients. We limited further analyses to the interval from 4-22 Hz.
Decoding Attentional Orientation from EEG Spectra
179
This was motivated by the goals of this study to identify robust spectral features that can predict attentional orientation. Below 4 Hz EEG is often contaminated with movement and eye blink artifacts. Above 20 Hz the EEG is often contaminated with EMG artifacts. EEG power at each frequency was log transformed and normalized against the total power from 4-22 Hz. Classification. Classification was performed using a naïve Bayes classifier (Matlab) which assumes independent variables, i.e., a diagonal covariance matrix. The data were divided into three conditions based on instruction and performance: Correct Left, Correct Right, and Incorrect. There were roughly equal numbers of Correct Left
Fig. 2. Examples of classification performance versus the number of variables for the three classification intervals. Overall S0 had the best classification performance and S5 had the worst performance. The other subjects showed intermediate levels of classification performance.
180
R. Srinivasan et al.
and Correct Right trials, and 10-20% fewer Incorrect trials. To facilitate comparisons in classification performance between the analysis intervals, we used a fixed number (determined by the 400-900ms interval) of randomly selected “training” trials (typically 150) to calculate a linear classifier, which was applied to classify another 150 “test” trials. The classification proceeded in two steps. First we evaluated the performance of each individual variable (80-100 channels x 10 frequencies) in classification of the “test” data, using 30 random samples of “training” and “test” trials. Then we evaluated the performance of the best 10, 20, 50, 75, and 100 variables using 50 random samples of “training” and “test trials”.
Fig. 3. Power spectra from the channel that provided the best three-way classification for each subject. For 3 subjects (S0, S3, and S7) the channel shown is located over parietal cortex. For the other subjects (S2, S4 and S5) the channel shown is located over the frontal lobes. Note that spectral differences are present at most frequencies.
Decoding Attentional Orientation from EEG Spectra
181
Fig. 4. Topographic distribution of channel sensitivity to attentional orientation. Each channel classification ability was scored and averaged across frequencies. The results were plotted as a topographic map, where white shows the highest predictive power and black indicates no predictive power.
3 Discussion In a simple cued spatial attention experiment we were able to demonstrate preliminary results indicating that the EEG contains information that can be used to decode the orientation of attention in space. Our approach here has been very direct and overly
182
R. Srinivasan et al.
simplistic. We have made use of spectral features of the EEG and a naïve Bayes classification scheme. Both approaches can be significantly improved. Despite the limitation of these methods, we were able to achieve as high as 75% classification performance in 2-way classification and 60% classification performance in 3-way classification. This indicates that EEG signals clearly contain information about attentional orientation that can be used to decode attentional orientation for BCI applications. The data also suggest that the signatures of attentional orientation can be more robustly decoded at a longer interval following the cue. In our comparisons, 400-900 ms after the cue provided the best classification in each subject. This is consistent with theoretical and experimental studies of the episodic nature of attention, which suggested that following the cue the attention window will take at least 300 ms to open [17]. How long this window remains open and whether more robust classification can be obtained remains to be tested. Estimating the duration for which the neural signatures of attentional orientation are sustained will require a larger amount of data. Our results suggest that electrodes over frontal and parietal cortex have the greatest sensitivity to attentional orientation. This finding is consistent with a large body of experimental research with EEG and fMRI that indicate that large-scale networks spanning parietal and frontal cortex mediate selective attention [18-20]. More surprising was the lack of frequency specificity in our classification results. Previous reports had suggested occipital/parietal alpha rhythms would be sensitive to attentional orientation [15, 21]. However, those studies used visual displays where attention was directed to one of two regions within 4-6 degrees of fixation. Thus, those results were specifically related to attentional orienting within the narrowly defined retinotopically mapped visual space. Our results relate to orienting to two regions of auditory space separated by 90 degrees and out of the field of view while fixating. The results of this paper are preliminary and indicate the potential for obtaining an attentional orientation BCI. An important factor we have not yet considered is training subjects to optimize the BCI. Our current work extends this study through a consideration of the full 360 degrees of auditory space surrounding the subject.
Acknowledgements This work was supported by ARO 54228-LS-MUR.
References 1. Desimone, R., Duncan, J.: Neural mechanisms of selective visual attention. Annu. Rev. Neurosci. 18, 193–222 (1995) 2. Egeth, H.E., Yantis, S.: Visual attention: control, representation, and time course. Annual Reviews in Psychology 48, 269–297 3. Kastner, S., Ungerleider, L.G.: Mechanisms of visual attention in the human cortex. Annual Review of Neuroscience 23, 315–341 (2000) 4. Spence, C., Pavani, F., Driver, J.: Crossmodal links between vision and touch in covert endogeneous spatial attention. Journal of Experimental Psychology: Human Perception and Peformance 26, 1298–1319 (2000)
Decoding Attentional Orientation from EEG Spectra
183
5. Spence, C., Driver, J.: Crossmodal attention. Current Opinion in Neurobiology 8, 245–253 (1998) 6. Moore, T., Armstrong, K.M., Fallah: Visuomotor origins of covert spatial attention. Neuron 40, 671–683 7. Sheliga, B.M., Riggio, L., Rizzolatti, G.: Orienting of attention and eye movements. Experimental Brain Research 98, 507–522 (1994) 8. Hillyard, S.A., Anllo-Vento, L.: Event-related brain potentials in the study of visual selective attention. PNAS 95, 781–787 (1998) 9. Kastner, S., Pinsk, M., De Weerd, P., Desimone, R., Ungerleider, L.: Increased activity in human visual cortex during directed attention in the absence of visual stimulation. Neuron 22, 751–761 (1999) 10. Corbetta, M., Shulman, G.L.: Control of goal- and stimulus-driven attention in the brain. Nature Reviews Neuroscience 31, 201–215 (2002) 11. Ding, J., Sperling, G., Srinivasan, R.: SSVEP power modulation by attention depends on the network tagged by the flicker frequency. Cerebral Cortex 16, 1016–1029 (2006) 12. Sylvester, C., Jack, A., Corbetta, M., Shulman, G.: Anticipatory Suppression of Nonattended Locations in Visual Cortex Marks Target Location and Predicts Perception. Journal of Neuroscience 28(26), 6549–6556 (2008) 13. Shulman, G., Ollinger, J., Akbudak, E., Conturo, T., Snyder, A., Petersen, S., Corbetta, M.: Areas Involved in Encoding and Applying Directional Expectations to Moving Objects. The Journal of Neuroscience 19(21), 9480–9496 (1999) 14. Fu, K., Foxe, J., Murray, M., Higgins, B., Javitt, D., Schroeder, C.: Attention-dependent suppression of distracter visual input can be cross-modally cued as indexed by anticipatory parieto–occipital alpha-band oscillations. Cognitive Brain Research 12, 145–152 (2001) 15. Worden, M., Foxe, J., Wang, N., Simpson, G.: Anticipatory Biasing of Visuospatial Attention Indexed by Retinotopically Specific alpha band Electroencephalography Increases over Occipital Cortex. The Journal of Neuroscience 20 (2000) 16. Moore, T.J.: Voice communication jamming research. In: AGARD Conference Proceedings 331: Aural Communication in Aviation, vol. 2, pp. 1–6 (1981) 17. Weichselgartner, E., Sperling, G.: Dynamics of automatic and controlled visual attention. Science 238, 778–780 (1987) 18. Gitelman, D., Nobre, A., Parrish, T., LaBar, K., Kim, Y., Meyer, M., Mesulam, M.: A large-scale distributed network for covert spatial attention, Further anatomical delineation based on stringent behavioural and cognitive controls. Brain 122(6), 1093–1106 (1999) 19. Posner, M., Petersen, S.: The attention system of the human brain. Annual Review Neuroscience 13, 25–42 (1990) 20. Coull, J.T.: Neural correlates of attention and arousal: insights from electrophysiology, functional neuroimaging and psychopharmacology. Progress in Neurobiology 55(4), 343– 361 (1998) 21. Foxe, J., Simpson, G., Ahlfors, S.: Parieto-occipital ~10 Hz activity reflects anticipatory state of visual attention mechanisms. NeuroReport 9, 3929–3933 (1998)
On the Possibility about Performance Estimation Just before Beginning a Voluntary Motion Using Movement Related Cortical Potential Satoshi Suzuki1, Takemi Matsui1, Yusuke Sakaguchi1, Kazuhiro Ando1, Nobuyuki Nishiuchi1, Toshimasa Yamazaki2, and Shin’ichi Fukuzumi3 1
Tokyo Metropolitan University, Asahigaoka 6-6, Hino, Tokyo 191-0065, Japan Kyusyu Institute of Technology, Kawazu 680-4, Iizuka, Fukuoka 820-8502, Japan 3 NEC Common Platform Software Research Laboratories, Shibaura 2-11-5, Minatoku, Tokyo 108-8557, Japan [email protected] 2
Abstract. The present study aimed to investigate this tripartite relationship, regarding MRCP as a physiological index, ballistic movement as an index of operation and accuracy of the task performance. Experiments were conducted 'reaching' task; the subject touches the target appears 300 pixels away from the start point in a vertical direction on the touch sensitive screen with the forefinger. During experiments, EEG, EMG as trigger, image by high-speed camera and the efficiency of task were acquired. As a result, significant differences between the high and poor performance groups were clear on the NS in MRCP acquired from Fz (p < 0.05), Cz (p < 0.05) and Pz (p < 0.05). Furthermore, the difference was confirmed on the duration of ballistic movement. Based on our findings, we attempted to extract MRCP rapidly and automatically without using signal averaging and discuss whether it is possible to estimate accuracy just before the motion is executed. Keywords: Accuracy, ballistic movement, movement-related cortical potential (MRCP), reaching, voluntary motion.
to use as a method of medical diagnostics [6], although not all details have been fully clarified. Previous work has also shown that the thinking process and information processing within the brain is broken into several steps. Many useful models of this information processing in the head are proposed [7, 8]. These models commonly have 4 steps: perception, cognition, motion planning and motion. Related to this model, MRCP is generally assumed to reflect a preparation or planning stage for beginning the motion [9]. On the other hand, “reaching”, such voluntary goal-directed movement, is known to be one of the most important components of motion by human arms. Mathematical models considering jerk, torque and dynamics of the musculoskeletal system have been used to understand how the motion of reaching is planned and controlled in the brain [10, 11, 12]. Movement efficiency has also been studied and Fitts’ Law proved that reaction time during reaching constantly changes according to the difficulty of the task [13, 14]. During the motion process reaching has two characteristics, the ballistic and corrective movements [15, 16], which influence both the accuracy and duration time of the motion. Observation of these characteristics enables us to comprehend the efficiency of the task performance. There is believed to be a relationship between the ballistic movement and MRCP as the former is a feed-forward movement and MRCP reflects a planning stage for this motion. Building on this, a relationship has also been suggested between MRCP, the ballistic movement and the accuracy of motion (Figure 1). Considering the tripartite relationship among MRCP as a physiological index, ballistic movement as an index of operation and accuracy of the task performance has meaning for ergonomics field. Motion Start Perception Cognition Controlled Processing
Information Processing
Action
MRCP EEG
Motion
BP
IS
NS N-P feedforward
feedback
Ballistic Movement
Collective Movement
Task Performance
Motion Time Accuracy
Fig. 1. The concept of this research. The aim of this study is an investigation the tripartite relationship, regarding MRCP as a physiological index, ballistic movement as an index of operation and accuracy of the task performance.
186
S. Suzuki et al.
The present study aimed to investigate this tripartite relationship, regarding MRCP as a physiological index, ballistic movement as an index of operation and accuracy of the task performance within the field of ergonomics. Based on our findings, we attempted to extract MRCP rapidly and automatically without using signal averaging. We explain and consider the results of this extracting system and discuss whether it is possible to estimate accuracy just before the motion is executed.
2 Methods 2.1 Electroencephalogram MRCP is generally observed at scalp positions Fz, Pz and Cz, on the midline of the frontal and parietal regions of the head. To achieve high spatial resolution around these regions, electroencephalogram (EEG) data were acquired using a 128ch sensor net (Geodegic Sensor Net, Electrical Geodesics Inc., OR, USA) and analyze system (Net Station 4.3, Electrical Geodesics Inc., OR, USA). Electrode impedance was set between 10-50 kΩ. EEG was recorded using a 0.1–50 Hz bandpass (3 dB attenuation). Signals were sampled at 1 kHz and were digitized. 2.2 EMG and Trigger The trajectory of voluntary motion was observed using a high-speed camera (Fastcam512PCI, Fotron Co., Tokyo, Japan) with 125 fps (Figure 2). The trigger signal was generated using the surface EMG on the common digital extensor muscle. Data from the EEG and the high-speed camera were clipped at each trial using triggers. EMG data were acquired using a bio-amp system (Biotop, NEC-Sanei Co., Tokyo, Japan) with a sampling rate of 10 kHz to accurately synchronize and analyze in real time using a Field Programmable Gate Array (FPGA) module (PCI-7831, National Instruments Co., TX, USA). 2.3 Subjects and Task Procedure Experiments were conducted on eight healthy male subjects ranging in age from 22 to 24 years old (average 22.88 +/- 0.83 years). All subjects were right-handed as confirmed by the Edinburgh handedness test [17]. Experiments were conducted in an electromagnetic-shielded room using the following protocols: the subject touches the center of a 17-inch touch-sensitive screen (LCD-AD172F2-T, I/O Data Co., Tokyo, Japan) located 30 cm in front of them with their forefinger. A small cross-shaped target then appears 300 pixels away from the point previously touched in a vertical direction (screen pixel pitch 0.264 × 0.264 mm). The subject moves their forefinger to the center of the displaying target and touches it. The trial is repeated a total of 50 times in two sets. Specific instructions regarding motion speed and accuracy during trial performance were not given to all subjects. The subject’s head was placed on a jaw rest and the right forearm on an armrest to reduce artifacts other than motion of the forefinger.
On the Possibility about Performance Estimation
187
2.4 Analysis MRCP and high-speed camera data were clipped in each trial by a trigger signal generated using the surface EMG on the common digital extensor muscle. EEG data were clipped from 1500 msec before the start of the movement to 500 msec afterward, according to previous studies. The gap from the center of the target to the position actually touched was used as an evaluation index for the accuracy of the voluntary movement. Based on the gap average of task performance, we divided data of performance in each trial into two
EMG System EMG Amp.
PC1 (Full-wave Rectification) Trigger Signal 1
EEG System Shielded room
EEG Electrodes
128ch EEG Amp.
EMG Electrode
PC2 (EEG Analysis)
Motion Analysis System High Speed Camera
PC3 (Motion Analysis)
Lenz
Touch Panel
Trigger Signal 2
Task System PC4 (Task Control)
Fig. 2. Block diagram and data flow. Experiments were conducted on eight healthy male subjects using EEG, high-speed camera, EMG and task systems.
Fig. 3. Analytical process. MRCP and high-speed camera data were clipped in each trial by a trigger signal generated using the surface EMG. Based on the gap average of task performance, the clipped EEG and high-speed camera data from each trial were separated into two groups corresponding to the ‘A’ and ‘B’ groups. The typical waveforms in each group were calculated by averaging and compared with each other.
188
S. Suzuki et al.
groups: a high performance group ‘A’ and a low performance group ‘B’. The clipped EEG and high-speed camera data from each trial were separated into two groups corresponding to the ‘A’ and ‘B’ groups. The typical waveforms in each group were calculated by averaging and compared with each other (Figure 3).
3 Results 3.1 Task Performance Figure 4 shows the distribution of gaps in each trial according to a subject (S1). The distribution resembled a gamma distribution with an average gap of 5 pixels. The trend of the distribution form was seen in all subjects. EEG signals corresponding to trials when the gap was less than 5 pixels were placed in group A and those over 5 pixels were placed in group B, in this subject S1’s case. Typical waveforms corresponding to the two groups were calculated by averaging. High-speed camera data were also divided into two groups. 3.2 MRCP Figures 5 a) and Figure 5 b) show sample of typical MRCP waveforms corresponding to the two groups acquired from Fz and Cz based on the same subject S1 shown in Figure 4. Differences between BPs (observed from 2000 msec before movement) and ISs (observed from 900 to 500 msec before movement) of groups A and B were not obvious. However, differences in NSs (observed from 500 msec before movement) were clearly observed between the two groups, with group A showing a steeper slope than group B. This trend was confirmed from around Fz, Cz, and Pz of all subjects (Figure 6) and the difference in average values between the two groups was found to be significant (Fz: p < 0.05 (p = 0.019), Cz: p < 0.05 (p = 0.050), Pz: p < 0.05 (p = 0.017)). This suggests that there is a relationship between the performance of the arm and the NS slope. Average
20
A group
B group
frequency
15
10
5
0 0
5
10
15
20
Accuracy (pixel)
Fig. 4. Sample data of result of performance (Subject 1). The distribution resembled a gamma distribution with an average gap of 5 pixels, in this subject S1’s case.
On the Possibility about Performance Estimation
A group
B group
A group
Trigger
B group
189
Trigger
-
-1500
-1000
-500
0
500
msec
-1500
5μV
-1000
-500
0
500 msec
5μV
+
+
a) Fz
b) Cz
Fig. 5. Sample data of MRCP (Subject 1). Typical MRCP waveforms corresponding to the two groups acquired from Fz (in Fig. 5 a)) and Cz (in Fig. 5 b))based on the same subject S1 shown in Figure 4. *
15
*
*
Slope (μV/sec)
A group B group
10
(*:p<0.05)
5
0 Fz
Cz
Pz
Fig. 6. Comparison of NS slopes in 2 groups (All subjects). Differences in NS (observed from 500 msec before movement) were clearly observed between the two groups, with group A showing a steeper slope than group B.
3.3 Process of the Reaching Motion Figure 7 shows the movement distance of the forefinger peak. A small difference in ballistic movement between groups A and B was confirmed, but we could not confirm a difference in corrective movement. This is because the experimental task is simple and primitive and the quantity allocated for corrective movement was few so it is not necessary to allocate for corrective movement. Movement distance (mm) (mm)
500
400 400
300
A group B group
200 200 100
00
1 Ballistic Movement
2
Corrective Movement
Fig. 7. Comparison of movement processes in 2 groups (All subjects). A small difference in ballistic movement between groups A and B was confirmed.
190
S. Suzuki et al.
4 Discussion and Conclusion In the present study, we attempted to confirm the tripartite relationship between MRCP, ballistic movement and accuracy of task performance. We could not confirm any difference between high-performance group ‘A’ and low- performance group ‘B’ on BP and IS in MRCP acquired from Cz and Fz. However, significant differences between the groups were clear on the NS in MRCP acquired from Cz and Fz. Furthermore, the difference was confirmed on the duration of ballistic motion. As NS is generally believed to represent a preparation stage for voluntary motion, it appears that the observed difference between groups influences the process of motion and the task performance. These results show the possibility of estimating execution of the performance just before beginning the voluntary motion using MRCP. On the other hand, the motion “reaching” is nearly optimized in terms of smoothness over the entire movement in the field considering with mathematical models. Various optimal criteria are proposed for trajectory planning of this motion in terms of multi-joint arm movements, like human arms. The minimum-jerk criterion [10] plans smooth trajectories in the extrinsic task space, while the minimum-joint-anglejerk criteria, the minimum-torque-change criteria [11], and the minimum-motorcommand-change criteria plan smooth trajectories [12] in the intrinsic body space. These models and criteria are discussed under the assumption that humans plan the trajectory of arm movement before beginning the motion. However, Hogan & Wolpert [18] pointed out that it is difficult to explain the biological relevance of factors such as jerk or torque change in previous models of arm trajectory. They showed that the movement can be achieved by reducing the variance of errors at the end of the movement. The final goal of the “reaching” movement is to minimize the gap at the end of the motion, as suggested by their minimum-variance theory. Although it is not clear how the cerebrum and cerebellum contribute to achieving this movement, this model is an effective way of considering this movement. If we place our own results in context with this model, the concept becomes validated as it is known that humans plan and learn the trajectory of reaching with optimal efficiency just before beginning the motion. Finally in the current study, we attempted to develop a prototype system to derive the NS slope in MRCP from EEG data automatically and in real time. When we observe the P300 and N200, the importance of the shape of the waveform necessitates signal averaging. If the event-related potentials (ERP), such as N200 and P300, was observed, the technique signal averaging is generally used not only to remove the artifact but also to extract more information, i.e. latency, amplitude, the shape of the waveform and so on. However, in the case of MRCP, we only need to observe slopes during the 500 ms just before the motion is executed, namely, just before the trigger. We are therefore attempting to develop a prototype system that obtains a sequentialmemory of the value during 2000 ms of EEG and sequentially calculates the value of the NSs slope in 500 ms. This prototype system is currently being developed using LabVIEW (National Instruments Co., TX, USA) and we already confirmed that the performance can be estimate with 75 % maximum of the time using an index based on NS values. Thus, in the future, our concept and method appears to be promising for use controlling devises safety, such as at car driving.
On the Possibility about Performance Estimation
191
References 1. Kornhuber, H.H., Deecke, L.: Hirnpotentialanderungen bei Willkurbewegungen und passiven Bewegungen des Menschen. Bereitschafts potential und reaffernte potentiale. Pfligers Arch, Gesamte Physiol Menschen Tiere 284, 1–17 (1965) 2. Barrett, G., Shibasaki, H., Neshige, R.A.: Computer-assisted method for averaging movement-related cortical potentials with respect to EMG onset. Electroencephalogr. Clin. Neurophysiol. 60(3), 276–281 (1985) 3. Shibasaki, H., Hallett, M.: What is the Bereitschaftspotential? Clin. Neurophysiol 117(11), 2341–2356 (2006) 4. MacKay, D.M., MacKay, V.: Behind the eye. Basil Blackwell, Malden (1991) 5. Yamamoto, J., Ikeda, A., Satow, T., et al.: Human eye fields in the frontal lobe as studied by epicortical recording of movement-related cortical potentials. Brain 127, 873–887 (2004) 6. Barrett, G., et al.: Cortical potential shifts preceding voluntary movements are nomal in parkinsonism. Electroencephalogr. Clin. Neurophysiol. 63, 340–348 (1986) 7. Card, S.K., Moran, T.P., Newell, A.: The psychology of human-computer interaction. Lawrence Erlbaum Associates, New Jersey (1983) 8. Gopher, D., Sanders, A.F.: S-Oh-R: Oh Stages! Oh Resources! In: Prinz, W., Sanders, A.F. (eds.) Cognition and motor processes, Springer, Heidelberg (1984) 9. Revelle, W.: Indivisual differences in personality and motivation: ’non-cognitive’ determinants of cognitive performance. In: Baddeley, A., Weiskrantz, L. (eds.) Awareness Control, Clarendon Press, Oxford (1991) 10. Flash, T., Hogan, N.: The Coordination of Arm Movements: An Experimentally Confirmed Mathematical Model. Journal of Neuroscience 5, 1688–1703 (1985) 11. Uno, Y., Kawato, M., Suzuki, R.: Formation and control of optimal trajectory in human multijoint arm movement - Minimum torque-change model. Biological Cybernetics 61(2), 89–101 (1989) 12. Kawato, M.: Optimization and learning in neural networks for formation and control of coordinated movement. In: Meyer, D., Kornblum, S. (eds.) Attention and performance XIV, pp. 821–849. MIT Press, Cambridge (1993) 13. Fitts, P.M., Peterson, J.R.: Information capacity of discrete motor responses. Journal of Experimental Psychology 67, 103–112 (1964) 14. Accot, J., Zhai, S.: Beyond Fitts’ Law: Models for trajectory-based HCI tasks. In: CHI 1997, pp. 295–302 (1997) 15. Flowers, K.: Ballistic and corrective movement on an aiming task. Neurology 25, 413–421 (1975) 16. Pratt, J., Abrams, R.A.: Practice and component submovements: The roles of programming and feedback in rapid aimed limb movements. Journal of Motor Behavior 28(2), 149–156 (1996) 17. Oldfield, R.C.: The assessent and analysis of handedness: The Edinburgh Inventory. Neuropsychologia 9, 97–113 (1971) 18. Harris, C.M., Wolpert, D.M.: Signal-dependent noise determinations motor planning. Nature 394, 780–784 (1998)
A Usability Evaluation Method Applying AHP and Treemap Techniques Toshiyuki Asahi, Teruya Ikegami, and Shin’ichi Fukuzumi Common Platform Software Research Laboratories, NEC Corporation, 8916-47, Takayama-Cho, Ikoma, Nara 630-0101, Japan [email protected]
Abstract. This report proposes a visualization technique for checklist-based usability quantification methods. By applying the Treemap method, the hierarchical structure of checklists, weights of check items and evaluation results for target systems can be viewed at a glance. Effective support for usability analysis and its presentation tasks of usability evaluation results are expected. A prototype tool was implemented on a PC and experimental studies assuming actual usability evaluation tasks were conducted. The results indicate that the proposed method improves performance time of some typical tasks. Usability engineers gave higher subjective scores on the usefulness of the proposed method than that of printed table presentation .
it was still hard to read out which and how check items contribute to (or lower) the synthetic scores since the detailed check items below the second hierarchy (88 items) are not presented on the graph. In this paper, we propose to apply the well-known technique, Treemap, to visualize the evaluation result of checklist-based heuristics. One of the quantification techniques using checklist will be explained first; then the Treemap visualization techniques for that method will be demonstrated with the examples on the PC tool prototype. Section 4 describes the procedure and the results of the experimental study validating the effectiveness of the proposed method. Finally, we discuss the result and its implications for future study.
2 Checklist-Based Usability Quantification Treemap visualization was applied on a checklist-based heuristic method developed by Ikegami et al. [4]. The checklist consists of 126 check items categorized into five sections and 18 sub-sections, which were extracted and arranged from various user interface (UI) guidelines, ISO standards [5][6] and consultation know-how. Each check item is described from the viewpoints of UI components or system functions such as “Are titles attached to each window?” or “Are substitutive operations provided for double-click operations?” However, the usability score is expected to be given from user viewpoints (ex. learnability, memorability) when it is utilized in usability testing or benchmarking. Therefore, a weighting value was given to each checklist item for selected user viewpoints. Several techniques have been known for weight calculation such as expert opinion, task usage frequency or task importance, KANO model, entropy model, geometric mean of pair-wise comparison result in AHP (Analytic Hierarchy Process) and number of problems collected [7]. Ikegami et al. adopted the AHP method considering reliability and execution cost. A few usability experts applied a pair comparison method to give weighting value sets for every viewpoint. Thus, four weighting value sets, which correspond to “learnability,” “memorability,” “efficiency” and “low error rate” (selected by referring to [8]), were given to the check items, their sections and sub-sections (Fig. 1).
3 Usability Visualization Using Treemap In order to visualize the evaluation result of the checklist effectively, we think the major requirements are: 1. Giving the overall view of the checklist including its hierarchical structure, and 2. Displaying detailed information such as each check item’s weight and checking result. In other words, both structural and quantitative information, and both overall and detailed information should be displayed in one view in a form easy to understand. We assumed the Treemap method [9] developed at the University of Maryland would fill these requirements and tried to apply it to the checklist heuristics. The following parts of this section describe detailed techniques of Treemap implementation.
A Usability Evaluation Method Applying AHP and Treemap Techniques
197
Fig. 1. Outline of checklist weighting method [4]
3.1 Visualize the Checklist Treemap techniques were applied to the checklist introduced in Section 2. Figure 2 shows examples of checklist visualization, (a) without weights (assuming all sibling items or sections have equal weighting value), and (b) with the weight set given for the “efficiency” viewpoint. The “slice and dice” and “offset” techniques introduced in reference [9] were adapted to utilize the display area effectively and to present check item structure and titles clearly (full titles of check items appear by simple mouse-over operation). It is obvious that the structure and weight distribution can be viewed in a given display area even for a checklist consisting of more than 100 items. By providing four sets of weighting values and the corresponding map, each of which represents one of the four viewpoints, the entire checklists can be visualized. Although it is quite easy to merge them into one map by adding one more level of hierarchy, we chose to show them separately because the merged Treemap looked too busy and there seemed to be less need for viewing four viewpoints simultaneously.
198
T. Asahi, T. Ikegami, and S. Fukuzumi
(a) No weight
(b) With a set of weights for "Efficiency" Fig. 2. Checklist visualization
3.2 Visualize the Checking Result Two large-scale Web application systems (tentatively named “A” and “B”) were evaluated along with the checklist, and the result was displayed as shown in Fig. 3. The preceding study [10] and Ikegame proposed to do checklist evaluation with yes (meaning the target user interface satisfies the corresponding check item) or no judgment for every check item in order to minimize the effect of individual difference. In the example of Fig. 3, colored areas (yellow for A, red for B) mean “yes” and black areas represent “no” judgment. The total scores, which are the summations of colored areas, are displayed as the bar chart below the map. Intuitively, it is quite easy to read out which check items have bigger influence on the total score, or why system B gets a higher score than A from the viewpoint of
Black: means "no" for A or B Yellow: means A satisfied the check item Red: means B satisfied the item
Fig. 3. Visualization of checking result with a bar chart
A Usability Evaluation Method Applying AHP and Treemap Techniques
199
efficiency, for instance. On the other hand, it is hard to explore the checking result for the low-weighted items. (However, this problem has been partly solved by implementing the zooming function, which enables the zooming of any node (rectangle) out to the base rectangle area with its subsidiary nodes.)
4 Experimental Study In the previous sections, Treemap visualization for a checklist-based quantification method was proposed. The simulated map seems to be useful for usability analysis tasks and for reporting their results. In this section, we try to validate its usefulness experimentally assuming actual data analysis tasks in usability consultation. 4.1 Experimental Design We assumed two kinds of situations in which Treemap visualization would be helpful: 1) usability engineers analyze the checking result, and roughly estimate the usability level of the target systems or detect where they should be changed for effective usability improvement, and 2) usability engineers or consultants explain the analysis result to developers of targeted systems or clients of the consultation. The tasks in the experiment were designed assuming situation (1). Outline. The Treemap on the PC screen (such as Fig. 2) and the table forms printed on paper sheets, both of which represent the checklist items, their weighting values, and checking results for two systems (A, B), were displayed to subjects. Each subject executed tasks assuming situation (1). Task completion time was measured by an experimenter along with the correctness of the answers. Subjective questionnaires were applied after completing all tasks. Table 1. Experimental tasks (task set TA)
a b c d e f
g
Task description Meaning of tasks From the “learnability” viewpoint, what Understand which category is supis the most significant category? posed significant in each viewpoint. Select all viewpoints in which system A’s Grasp target systems’ usability featotal score is higher than that of B. ture roughly. From the “low error rate” viewpoint, Understand which check item is what is the most important check item? most influential on the final score. From the “low error rate” viewpoint, Estimate which check item is not so what is the least important check item? critical. From the “memorability” viewpoint, Compare overall scores among tarwhich system’s score is higher? get systems. From the “memorability” viewpoint, Identify where to improve for raiswhich check item should be improved to ing usability score most effectively. raise system B’s score most effectively? Count up the number of check items in Compare system usability for cersub-category “data output,” where only B tain (restricted) usability aspect. is OK.
200
T. Asahi, T. Ikegami, and S. Fukuzumi
Participants. Nine usability engineers with usability testing or consulting experience ranging from 20-40 years old participated in the experiment. They were asked to do tasks supposing they were in a consulting situation. Tasks. Seven simple tasks supposed to be typical in actual usability analysis work were selected as shown in Table 1. In addition to the seven tasks (task set “TA”), similar tasks (doing the same thing using another data viewpoint) a’ - g’ were also prepared as task set “TB.” Each subject tried TA first, then tried TB. Half of them used table data for TA and Treemap data for TB, and the other half executed TA with the Treemap data and TB with the tables. This order is considered to minimize the training effect possibly appearing in the performance time. Just after completing TA and TB, all subjects were asked to check the seven subjective questionnaires. All subjects could complete all tasks and questionnaire responses in 40-60 min without any serious problem. 4.2 Results Figure 4 shows mean and standard deviation of task completion time and number of correctly executed tasks. In five of seven tasks (except for d and e), completion times on the map seem shorter than those on the table data when simply comparing with mean value. ANOVA shows significant difference in task c (t=2.67, P<0.02) and in task f (t=4.94, P<0.01). Task completion time (sec ) 80
Table Map
60
40
20
0 a
b
c
d
e
f
g Task
Correct
9
9
9
9
7
8
6
8
9
9
8
8
7
6
Not correct
0
0
0
0
2
1
3
1
0
0
1
1
2
3
Fig. 4. Task completion times and numbers of tasks correctly executed or not
A Usability Evaluation Method Applying AHP and Treemap Techniques
201
Figure 5 shows the result of subjective rating for seven questionnaires. As for the overall impression, participants (usability engineers) gave higher scores to table form for “simplicity,” to map form for “comfortableness,” and the same score for “easiness to understand.” As for the readability of check item weights, higher scores were given to Treemap. (As for the check item structure, the same scores were given.) Also, participants tended to feel the Treemap was more useful for both usability analysis and presentation tasks. 4.3 Discussion of Experimental Result From this experiment only, we could not reach a clear conclusion because the number of subjects was not sufficient for strict statistical analysis. One of the reasons was that participants were screened according to their experience as usability engineers. However, some implications or tendencies about the usefulness of the proposed method can be extracted. The experimental results indicate the Treemap presentation should: …be useful in data analysis tasks for usability consultation. We are paying attention to the result that significant task performance improvement was seen in tasks c and f. These tasks are to “select the most important check item (c)” and “pinpoint the check item for improving overall score most effectively.” In many cases, usability engineers need to examine heavily weighted check items prior to others for improving target system usability effectively and promptly. Tasks c and f were designed with the intention of checking the adaptability to this requirement. Effectiveness and usefulness of the Treemap method are highly expected since significant improvement was observed in both tasks. …not be suitable for examining low-weight check items. Although it was not statistically significant, performance time on tables was higher than that on Treemap (approximately 10% shorter mean time) in task d (select the lowest weight check item). When there are a lot of check items, sometimes weights are set smaller than 1/1000. In Treemap representation, where the weights are displayed as areas of rectangles, it is often hard to read out these areas exactly. In real situations, there are not many cases that require elaborate examination of low weight check items, and Treemap may not support such tasks sufficiently. (If engineers are accustomed to using the zooming function, the task completion time will be greatly improved, though.) …not have clear advantages in “rough examination” tasks. As for tasks a, b, e and g, significant difference in task completion time was not observed, contrary to the authors’ expectations. These tasks include roughly comparing target system characteristics, such as selecting viewpoints in which system A’s score is higher. In these tasks, participants did not have to search for items by comparing weighting values, and they could complete the task easily just by reading numerical values in the tables. …be useful by usability engineers. Participants gave higher subjective scores on Treemap presentation for two questionnaires about usefulness. Taking into consideration that every participant was a beginner in using Treemap, this indicates their expectations of Treemap are considerably high. More experience with Treemap and additional software functions that support usability analysis tasks will raise the subjective rating further.
202
T. Asahi, T. Ikegami, and S. Fukuzumi
Very easy to understand
Easy to understand
Average
Simple
Average
Confusing
Very confusing
Very comfortable
Comfortable
Average
Painful
Very painful
Very easy to understand
Easy to understand
Average
Hard to understand
Very hard to understand
Hard to understand
Very hard to understand
Overall impression Very simple
Structure of check items
Weights of check items
For usability analysis
For presentation to developers or clients
Very easy to understand
Easy to understand
Average
Hard to understand
Very hard to understand
Very useful
Useful
Average
Useless
Quite useless
Very useful
Useful
Average
Useless
Quite useless
Printed table
Treemap
Fig. 5. Result of subjective questionnaires
5 Concluding Remarks and Future Work In this report, a method applying the Treemap visualization to checklist-based heuristics was proposed. In some tasks that were intended to simulate usability analysis, improvement of task completion time was observed. Subjective ratings were higher than table form presentation on the usefulness for both data analysis and presentation tasks. Users’ (usability engineers’) experience and additional software functions will raise this score further. We think the following three issues should be overcome for making the checklist-based quantification method practical and widespread. 1. Provide theoretical and reliable bases to the quantification method. 2. Enable objective evaluation by eliminating/minimizing the score difference caused by individual skills or impressions. 3. Provide practical and useful tools for evaluation and analysis tasks.
A Usability Evaluation Method Applying AHP and Treemap Techniques
203
As for issue (2), Ikegami has claimed it will be achieved to some extent by designing check items and their terms elaborately and tuning them iteratively through experimental studies [4]. The proposed method was intended to contribute to resolving issue (3) and some effect was shown in the experimental studies. Of course, we will need many more functions and tools to support real usability analysis tasks. We are considering a tool for weight assignment by using the Treemap as a data input tool [11]. In order to create breakthroughs on issue (1), we need to present scientific bases of quantification, but it is hard to accomplish with short-term research. Although preceding studies [2][3][4][10] have tried to add reliability by adopting well-used guidelines or regulations, they have not become widespread as established methods. We think we need to ensure the validity of the method by developing or refining user cognitive/behavioral models for the checklist-based quantification as with the GOMS and KLM models for performance prediction with CogTool. Both scientific research and practical field activities should be merged harmoniously to develop reliable and systematic methodologies.
References 1. Hirasawa, N.: Usability Engineering for Software Development. IPSJ Magazine 44(2), 136–144 (2003) 2. Smith, S.L., Mosier, J.N.: A Design Evaluation Checklist for User-System Interface Software, Technical Report ESD-TR-84-358, MITRE, MA (1984) 3. Yamada, S., Tsuchiya, K.: A Study of Usability Evaluation of PC – Discussion of Model on PCs. The Japanese Journal of Ergonomics 32, 350–351 (1996) 4. Ikegame, T., Okada, H.: Toward Quantification of Usability. NEC Technical Journal 3(2), 53–56 (2008) 5. ISO 9241-10: Ergonomic requirements for office work with VDTs –Dialog principles (1996) 6. ISO 9241-12: Ergonomic requirements for office work with VDTs –Presentation of information (1998) 7. Ham, D.-H., Heo, J., Fossick, P., Wong, W., Park, S.-H., Song, C., Bradley, M.: Model-based approaches to quantifying the usability of mobile phones. In: Jacko, J.A. (ed.) HCI 2007. LNCS, vol. 4551, pp. 288–297. Springer, Heidelberg (2007) 8. Nielsen, J.: Usability Engineering. Academic Press, London (1993) 9. Johnson, B., Shneiderman, B.: Tree-Maps: A Space-Filling Approach to the Visualization of Hierarchical Information Structures. In: Proc. of the 2nd Conference on Visualization 1991, pp. 284–291 (1991) 10. Kato, S., Horie, K., Ogawa, K., Kimura, S.: A Human Interface Design Checklist and Its Effectiveness. Transaction of Information Processing Society of Japan 36(1), 61–69 (1995) 11. Asahi, T., Turo, D., Shneiderman, B.: Using Treemaps to Visualize the Analytic Hierarchy Process. Information Systems Research 6(4), 357–375 (1995)
Evaluation of User-Interfaces for Mobile Application Development Environments Florence Balagtas-Fernandez and Heinrich Hussmann Media Informatics Group, Department of Computer Science, University of Munich, Germany {florence.balagtas,heinrich.hussmann}@ifi.lmu.de
Abstract. This paper discusses about the different user-interfaces of mobile development and modeling environments in order to extract important details in which the user-interfaces for such environments are designed. The goal of studying such environments is to come up with a simple interface which would help people with little or no experience in programming, develop their own mobile applications through modeling. The aim of this research is to find ways in order to present the user interface in a clear manner such that the balance between ease-of-use and ease of learning is achieved.
1 Introduction Nowadays, the development of software applications is no longer bounded within the confines of people with programming skills. People are no longer limited to just being end-users of an application, but are encouraged to be the creators of their own applications as well. An example of this is the growth of the world wide web and how the creation of web pages are no longer restricted to people who have skills in writing HTML code and scripts. The introduction of WYSIWYG HTML Editors such as Microsoft Front Page and Google Page Editor has made this possible. By hiding the HTML code in the background and allowing components to be dragged and dropped on to a page makes it easy for novices to create their own web pages. The same thing is happening now to the mobile industry. Mobile phone users are no longer limited to use pre-installed applications on their devices or buying readymade mobile applications for their personal purposes. People now have the power to create their own applications given the right motivation, creativity, skills and tools. Mobile phone companies and organizations have now opened up their application programming interfaces (APIs) that would allow anyone to develop their own applications for their mobile devices. Examples of these are the Java Platform Micro Edition (Java ME) API1 from Sun Microsystems, the Android API2 from the Open Handset Alliance and the iPhone API from Apple3. However, even though many users may have ideas for novel applications for mobile phones, software development is simply too difficult for most people. It takes a large amount of skill and familiarity with how 1
Evaluation of User-Interfaces for Mobile Application Development Environments
205
the framework is used before a person can create a decent amount of code for a simple application. Even setting up the programming environment is a complex task, let alone, trying to figure out how to use the APIs, compiling, running and deploying the application on the actual device. Other things that make developing applications for mobile devices more difficult as compared to desktop applications are factors such as device limitations (e.g. screen size, computing power, power consumption) [4], different operating systems for mobile devices, different data representation and additional device capabilities (e.g. Bluetooth, Wifi, GPS, Camera-enabled) which are not standard to all devices and therefore should be considered when developing a uniform application that can be run on different mobile devices. In this research, we are investigating ways to make application development accessible to people with low or no programming skills. We propose applying modeldriven development (MDD) which is an approach to creating complex software systems by first creating a high-level, platform-independent model of the system, and then generating specific code based on the model to the target platform [5]. In ordinary software development, models are just thought of as tools for getting system requirements and for documentation purposes, however, in MDD, the models are actually part of the implementation of the system. The basic idea of our work is to come up with a modeling environment that is specific for modeling mobile applications which targets non-experts as the main users. Non-expert users here are defined to be people who have little or no experience in programming for mobile platforms. We want to present to the user one application that they could use to model their mobile applications without having to worry about low-level coding. In order to do this, we are developing a tool called Mobile Applications Modeler (Mobia). The focus of discussion for this paper is on the design of the user-interface for the Mobia modeling environment. The aim here is to find out which user interface design concepts are most suitable in order for non-expert users to develop their own mobile application with ease. The goal is to present the interface such that the balance between ease-ofuse and ease-of-learning [10] is achieved. We have focused on non-expert users in this research and do not include expert users in general since these two types of users often differ in their experiences and needs [6]. Unlike existing modeling tools such as Magic Draw4 and Eclipse with the Eclipse Modeling Framework (EMF)5 plugins that are more general purpose modeling tools, we want to present to the user a more domain-specific modeling tool that specializes only on modeling mobile applications. The focus of this part of our research is on how to present the user interface of the Mobia modeler such that it is easy to use for non-expert users. The rest of the paper is organized as follows: section 2 will discuss a few related researches to our work particularly in the area of model-driven development. In section 3 we will discuss the different user interfaces of existing development and modeling tools that are the basis of some of our designs. And then, the remainder of the paper will be a discussion of our approach such as, the design of our prototypes and evaluation results.
2 Related Work Integrated development environments (IDEs) are tools which are made to ease the application development process. Most IDEs provide an environment that features a text editor, compiler, debugger and simulator to name a few, which are all integrated into one application. They have evolved throughout the years, adding more features (e.g. GUI designer, version control, etc.) that would help the developer in accomplishing their tasks in the most efficient way. For mobile applications development in particular, examples of IDEs that allows plugins for mobile application development are the Netbeans, Eclipse and XCode development environments. A problem with IDEs for mobile application development though is that, different mobile phones have different application programming interfaces and platforms. Thus, creating a common application that would run on different platforms of mobile phones tends to get tedious and redundant since developers have to develop different code for each of them. One solution to this problem is by applying the model-driven approach in which models are used to describe the application and through transformation tools, these models are transformed to code that would run on specific platforms [5]. An example research that applies MDD is the Multimedia Modeling Language (MML) which is a platform-independent language used for the model-driven development of multimedia applications [7]. MML models are transformed into Flash models which can then later on be loaded into the Flash authoring tool for further completion of the application [8]. The approach [7,8] is usually for teams wherein graphical designers and software designers need to work together in a certain project. Each group of users has their own expertise in terms of skills and tools they are using. However, when non-expert users are involved in the development process, this approach can be quite complicated. Extensive knowledge on how to make UML models is necessary in order to create the applications using this approach, and since the tools are not yet integrated, mastery in using these tools is a must [8]. Dunkel and Bruns [3] also presents a model-driven way of producing business applications for mobile devices with BAMOS (Base Architecture for Mobile Applications in Spontaneous networks) as the target platform. Their models are expressed in UML activity diagrams to specify control flow and the description of mobile services through a DSL they have defined using UML profiles. As with MML, the approach uses different tools which are not yet integrated [3]. Another research that applies the MDD process and targets non-experts as the primary developers is the Simple Mobile Services (SMS) project. This project aims to create service authoring tools and mobile services that are simple to use, find and set up[9]. They focus on non-expert users as the people assembling these mobile services on their own. SMS applies the MDA [5] approach in building their services [2]. Our approach is similar to SMS in a way that we target non-expert users as main users of our tool for developing mobile applications. However, while SMS focuses on mobile web-based services, our research focuses on mobile-based applications. In the next section, we will discuss the different user-interface components present in various development and modeling environments. We want to find out which existing approaches in the UI and some new ones are most suitable for non-experts.
Evaluation of User-Interfaces for Mobile Application Development Environments
207
3 A Closer Look into the User Interfaces In this section, we would like to compare user-interfaces of existing IDEs that supports mobile application development (Netbeans and Eclipse), and a modeling tool (MetaEdit+) that supports domain specific modeling of mobile applications. We want to explore what features these tools have, and which of these features are essential parts of an environment in which non-experts can benefit in it.
Fig. 1. General Parts of a Development/Modeling Environment
In studying these tools, we have identified four basic areas that are usually present in such environments. For the purpose of discussion in this paper, we will attach a general name to each of the areas, which may be identical or not to how they are labeled in such environments. Fig. 1. shows the typical default location of the main areas and their names. Table 1 contains the main areas and some of the possible contents that may appear in those areas. Table 1. The different areas and their possible contents Area Navigation/Browsing Area Main/Central Area Palette/Properties Area Toolbar Area Output Area
Possible Contents Different components in a certain development project (e.g. files and folders, classes and packages) The component in which the user is currently working on (e.g. source code, design for a user interface, data source) Components that can be dragged and dropped to the main/central area (e.g. UI components, Datasets) Button controls (e.g run, debug), editing controls (e.g. copy, paste) Program output, Compiler errors, Debugging messages, etc.
Shown in Fig. 2. is an overview of the Netbeans 6.5 environment. The components described in our general UI model for a development environment are present in the Netbeans environment. One additional feature of Netbeans is the ability to switch to different views in the main area depending on what the user is focusing on. The source view allows the user to make changes to the source code; the screen view allows drag and drop design of the mobile application’s user interface; the flow view allows adding logic to the program by dragging flow arrows between the different
208
F. Balagtas-Fernandez and H. Hussmann
screens; and the analyzer view shows unused resources and MIDP compliancy. Switching through the different views changes the contents of the palette area, depending on what components are needed in that certain view. Netbeans also has the ability to bind a screen component’s data to information taken from a database.
Fig. 2. Overview of the Netbeans Environment
The next IDE interface that we are going to discuss is the Eclipse IDE. There are several projects that aim to develop plugins for Eclipse to allow mobile applications development (e.g. EclipseME6, Eclipse Plugin for Android7). For the purpose of this paper, we will focus on analyzing the interface for the Eclipse IDE used for developing Android applications since the basic components of the IDE are similar anyway. As shown in Fig. 3, the positioning of the components in the environment are similar to that of Netbeans. However, the features offered by the Eclipse environment are just a subset of Netbeans. As of the moment this paper is written, it does not feature a drag-and-drop GUI environment for developing Android applications, but through editing an XML file for the placement of the GUI components on the screen, or by directly adding lines to the source code for the GUI. Although, as expected in the future, developers might add more features for easy GUI development as the platform matures. DroidDraw8 is one example application UI editor for the Android platform which generates an XML file that can be copied to the main code. The MetaEdit+ Modeler is a DSM tool that allows the modeling of different domain specific applications (e.g. Mobile, Automotive, Telecom, Embedded, etc.). One supported domain is for modeling smartphone applications. Fig. 4 shows an overview of the basic user-interface of MetaEdit+ for modeling mobile applications. Unlike the first two IDEs we have described, the MetaEdit modeler features a simpler interface with several of its components positioned at different areas. The palette area contains fewer number of components as compared to the first two tools discussed. It features specialized constructs which is specific for such mobile platforms. The navigation area contains a list of components in the model. Below the navigation area is the properties area, which contains information about the current component in focus. 6
Evaluation of User-Interfaces for Mobile Application Development Environments
209
Fig. 3. Overview of the Eclipse Environment
From these three examples, we want to extract the most desirable feature of each that can be applied to the design of Mobia. The Netbeans environment for instance, features the ability to change to different views, which can allow the user to concentrate on one task at a time. However, it contains so much features that it can take awhile before the user can actually take advantage of such features. The MetaEdit tool on the other hand contains only a limited number of components. It contains specialized constructs that could easily be identified by the user. All the tasks such as designing the screen and adding flow to the program is modeled in one view. The disadvantage about this though, is that since it is very specialized, the user is restricted to the type of application they can create. The Eclipse environment also offers a very simple interface and not shows too much features. It is clearly a tool for expert developers, who basically know what source code to type in for the applications they are developing. In the next section, we will discuss more about our approach in finding the ideal user-interface for the Mobia Modeler such that non-expert users will be able to use it. We apply some design patterns that is shown in the previous tools that we have described, and try to evaluate it in order to find out which features are most desirable for such an environment.
Fig. 4. Overview of the MetaEdit+ Modeler
210
F. Balagtas-Fernandez and H. Hussmann
4 The Mobia Modeler User Interface The Mobia Modeler is a modeling tool specifically designed to allow modeling for mobile applications. The target users for Mobia are non-expert users who are people that have little or no experience at all in programming for mobile platforms. For this particular study, we feature a module of Mobia that is focused on modeling applications in the domain of mobile health monitoring. As of the moment, we want to focus on one type of domain, since different domains may offer different modeling constructs. For this module, the users will model a certain type of application that can be used for health monitoring and feature modeling constructs that represent data from different medical gadgets or medgets (e.g. ECG meter, Thermometer, etc.). In order to find the ideal interface for Mobia, we have created two prototypes using Flash, which offers two different UI designs. Just to clarify, these prototypes are simply focused on evaluating the different user interface designs and interactions and do not yet have code transformation features. 4.1 Mobia with One View Shown in Fig. 5 is a screenshot of the first version of our Mobia prototype which we called Mobia One View. The first prototype offers one view for the user which means that, the user can design an instance of the mobile screens, add data and application control flow all in one view. The user can concentrate on designing a single screen by zooming into that area, and try to see an overview of the whole system by zooming out. The palette on the right side of the screen contains screen components that can be dragged on to the mobile screen. For our prototypes, we only feature a subset of the possible screen components that a mobile application can have. The right palette also contains data input which we call medget (short for medical gadget) input. The medget constructs in the medget input palette contains abstract representations of information that comes from health monitoring devices capable of sending their data to a mobile device. More information about the different representations of medget data will not be discussed in this paper, but in a separate paper [1].
Fig. 5. Mobia with One View. (In the foreground) The main area is zoomed-in to see the screen designs better.
Evaluation of User-Interfaces for Mobile Application Development Environments
211
4.2 Mobia with Multiple Views The second design approach that we did for Mobia is what we called multiple views which is shown in Fig. 6. This is similar to the Netbeans IDE in which the main area features different kinds of views depending on the specific task that the user is doing. The reason behind the choice of this type of design is that we want the user to focus on one task at a time.
Fig. 6. Mobia with Multiple Views (Design, Data and Navigation View)
The default view is the Design View in which the user can design individual screens by dragging and dropping screen components from the palette to the screen. The left panel contains all the mobile screens for that application. Clicking on an individual screen in the left panel shows it in the main view to be further edited and designed. Screens can be added and deleted by pressing the add and delete buttons respectively. This design is borrowed from presentation programs such as Microsoft Powerpoint and OpenOffice Impress wherein each slide can be viewed from a panel and allows switching from one slide to another by clicking on the mini versions of the slides in the panel. The Data View is similar to the design view except for the fact that the palette contents on the right panel changes to medget data. In this view, the users can concentrate on how they want data taken from health monitoring devices be displayed on the screen. These medget components act as placeholders into which the real information from the devices will appear in the real application. The last view is the Flow View, which shows all the screens in the model and how the screen transitions from one to the next. The user can add basic control logic to the application by dragging on arrows and linking the screens together. In this view, a small component palette contains buttons in which the users can drag to the screens. The logic behind this design approach is that, in the application, only by pressing a control component such as a button can trigger going from one screen to the next.
5 User Study Evaluation and Results Given the prototypes that we have described in the previous section, we want to find out which of the prototypes provides a simpler UI for the user and gets the task done
212
F. Balagtas-Fernandez and H. Hussmann
quickly. For a more subjective evaluation, we also want to find out which design is more fun and easier to use. In order to do this, we have conducted a user study in which each user is given a task to accomplish using both prototypes. In order to measure efficiency, we get the time in which the user accomplishes a certain task. In order to eliminate the bias towards the second prototype, we alternate which prototype each participant uses first in doing the tasks. The participants were instructed not to ask any questions from the evaluators. The goal here is to allow the participants to explore the tool themselves and learn how to use it by themselves without any outside intervention. At the beginning of the user-study, the participant was asked to explore the prototypes and give comments. After they are done studying the tool in whichever method they choose, they are given two tasks which are to design the contents of the screens and then later on to add control flow to the screens. They were asked to create 3 screens with some screen components in them. After designing the screens, they were asked to add control flow in which allows switching to a different screen whenever a button is pressed. Table 2. The average times accomplishing the tasks using the two prototypes
Version Mobia One View Mobia Multiple Views
Average Time (Minutes) Screen Design Adding Control Task Flow Task 4.036 minutes 1.126 minutes 5.833 minutes 2.223 minutes
There were 10 participants to our user study: 60% have backgrounds in the field of Computer Science and the rest from the fields of Educational Psychology, Archeology, Architecture and Social Welfare. Only 10% of the participants have a background in programming for mobile platforms which is also in the very basic level. Table 2 shows the average times of when the users accomplished the tasks while, Table 3 shows the results for the subjective evaluation in terms of which prototype is easier and more fun to use. Based on the results shown in the tables, Mobia One View allows the users to do the tasks faster as compared to the multiple view. A factor that might contribute to this is the fact that in multiple views, the user has to switch from one to the next in order to add a different component or do another design task. Based on the subjective feedback of the participants, the Mobile one view poses an environment that is both easy and fun to use. Table 3. Subjective evaluation for the Mobia Prototypes Version Mobia One View Mobia Multiple Views None
Criteria (Percentage of Users) Easier to Use More fun to Use 60% 50% 40% 40% 0% 10%
Evaluation of User-Interfaces for Mobile Application Development Environments
213
6 Summary and Future Work In this paper, we have presented different design ideas for a mobile application modeling environment that targets non-experts as the main users. The design and results presented here are just the initial phase of our iterative approach to finding the ideal interface for a tool that would help accomplish tasks with ease. Our future work aside from continuing to polish the user interface design of Mobia, is to come up with an underlying framework to support code transformation from the models. An approach to have a user-adaptive tool that changes according to each user’s existing skills and preferences to enhance user experience and learning is also envisioned.
Acknowledgments We would like to thank the German Academic Exchange Service (DAAD) for funding this research. We would also like to thank Ugur Örgün for helping with the prototypes and to the people who participated in our user study.
References 1. Balagtas-Fernandez, F., Hussmann, H.: Modeling Information From Wearable Sensors. In: MDDAUI 2009- Model Driven Development of Advanced User Interfaces 2009. CEUR Proceedings (2009) 2. Bartolomeo, G., Casalicchio, E., Salsano, S., Melazzi, N.B.: Design and Development Tools for Next Generation Mobile Services. In: International Conference on Software Engineering Advances, ICSEA 2007, p. 16 (2007) 3. Dunkel, J., Bruns, R.: Model-Driven Architecture for Mobile Applications. Business Information Systems, pp. 464–477 (2007) 4. Gaedke, M., Beigl, M., Gellersen, H.-W., Segor, C.: Web Content Delivery to Heterogeneous Mobile Platforms. In: ER 1998: Proceedings of the Workshops on Data Warehousing and Data Mining, pp. 205–217. Springer, London (1998) 5. Kleppe, A., Warmer, J., Bast, W.: MDA Explained: The Model Driven Architecture: Practice and Promise. Pearson Education, Inc., Boston (2003) 6. Petre, M.: Why looking isn’t always seeing: readership skills and graphical programming. Commun. ACM 38, 33–44 (1995) 7. Pleuss, A.: MML: A Language for Modeling Interactive Multimedia Applications. In: ISM 2005: Proceedings of the Seventh IEEE International Symposium on Multimedia, pp. 465– 473. IEEE Computer Society, Washington, DC, USA (2005) 8. Pleuß, A., Vitzthum, A., Hussmann, H.: Integrating heterogeneous tools into model-centric development of interactive applications. In: Engels, G., Opdyke, B., Schmidt, D.C., Weil, F. (eds.) MODELS 2007. LNCS, vol. 4735, pp. 241–255. Springer, Heidelberg (2007) 9. The SMS Project, http://www.ist-sms.org 10. Weiss, S.: Handheld Usability. John Wiley and Sons, Chichester (2002)
User-Centered Design and Evaluation – The Big Picture Victoria Bellotti1, Shin’ichi Fukuzumi2, Toshiyuki Asahi2, and Shunsuke Suzuki2 1
Abstract. This paper provides a high-level overview of the field of usability evaluation as context for a panel “Systematization, Modeling and Quantitative Evaluation of Human Interface” in which several authors report on a collaborative effort to apply CogTool, an automated usability evaluation method, to mobile phone interfaces and to assess whether usability predictions made by CogTool correlate with user subjective impressions of usability. If the endeavor, which is still underway at the time of writing, is successful, then CogTool may be applied economically within the product development lifecycle to reduce the risk of usability problems. Keywords: Usability evaluation, methods, metrics, systematization.
User-Centered Design and Evaluation – The Big Picture
215
is presented as context for a series of position papers within a panel that reflects on one particular effort to systematize usability evaluation in a large corporation where a number of fairly common constraints such as lack of availability of user-centered design experts and tight product deadlines and budgets apply.
2 Challenges for Usability Evaluation in Design Despite the obvious importance of evaluation in user-centered design (obvious at least to our own HCI community), it is not always the case that applications built to be placed in the hands of hapless end-users enjoy the benefits of any kind of objective evaluation; i.e., a method relying on more powerful evidence than the intuitions of inexperienced engineers [see, for example, 6]. And HCI researchers have expressed concern that the evaluations that do take place are inadequate (for example, they may involve the wrong user representatives [19]; or only quality control testers [41]). In this section I review three obstacles that may stand as explanations for this unfortunate phenomenon. 2.1 A Plethora of Methods to Choose From One key obstacle to understanding what usability evaluation methods one should adopt is the abundant diversity of methods and tools starting with “quick-and-dirty” methods such as expert reviews or guidelines walkthroughs [18] and simple tools such as surveys [e.g. 42] all the way through to extended in situ evaluation methods incorporating multiple data sources such as logging and interview-based data collection methods [e.g. 33] or sophisticated tools such as eye-tracking systems [e.g. 11]. Each method has its strengths and drawbacks and is appropriate in different circumstances, for example discount usability methods [36] are appropriate when resources are constrained, even if they are not as sensitive at picking up problems as a full-scale evaluation. If there were a one-size-fits-all solution available for standardized usability evaluation, it would surely be easier to train designers and developers to apply it in all projects that are likely to impact end users. But instead diversity opens the doorway to confusion and suboptimal choice. 2.2 A Diversity of Influential Design Circumstances At the time of writing in 2009, based on my own experiences interacting with representatives of a variety of commercial and research application development organizations, it is still not uncommon for a design effort to take place without a serious usability evaluation. We are all familiar with the baffling results of such endeavors, which we encounter regularly in our interactions with hardware, software and webbased user interfaces. Many circumstances can exert influence over whether usability evaluation takes place at all and over what type of evaluation with what metrics is most appropriate. Consider the following variables (which are both contextual and inherent to the design) as examples: • •
Application domain Standards and performance criteria that pertain to the application domain
216
• • • • • •
V. Bellotti et al.
Target users and their particular characteristics Novelty of the design and its interaction elements User-centered design expertise within the design team Budget Time available Organizational culture around the design team and perceptions of the importance of usability
Let me briefly illustrate the kinds of impact these factors can have. In my own past work exploring novel solutions in the application domain of personal information management (PIM) [8, 9 and 10], it quickly became apparent that experimental evaluations made no sense, since the proof of the PIM pudding so-to-speak can only be in the extended use of a solution with one’s own real personal information, leading to a need for in situ evaluation of real use over weeks rather than the hours that Nielsen [38] suggests can usefully be applied to web-site evaluation. As another example, Grudin [19] reported extensively on various organizational related constraints that can lead to suboptimal design results and Bak et al. [6] more recently also highlight organizational obstacles as significant, both in the literature and in their own survey, together with developer mindset (a culture of greater focus on functionality and efficient code and lack of user-centered design expertise). Perhaps the key factor impacting usability evaluation is the overall culture of the host organization for the design effort, (or even the culture within which that organization exists) which can in turn influence other factors that I listed above. Specifically, if few members of the organization are aware of user-centered design as a discipline and the value of a good user experience and fewer still have the relevant skills to apply the appropriate methods, then budget and time will not be allocated to a serious effort to evaluate the user experience and people with the required expertise will of course not be available to engage in that effort. In many countries today usability experts (or engineers who also have user-centered design skills) are indeed still an extremely rare species and, even if a corporation wishes to hire them, they may find that they simply cannot find them. In such circumstances, how might a large corporation make the best of limited usability expertise? This is an issue to which I will return in a subsequent section. 2.3 Metrics Usability is defined in the ISO 9241-11:1998 standard as the “extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use.” Taking this standard as widely agreed upon, effectiveness, efficiency and satisfaction thus have to be measured somehow in order to know how usable a prototype or product is. Unfortunately metrics also present something of a challenge to evaluators since, as Sauro & Kindlund [43] point out without overlooking the obvious irony, “Usability Metrics Need to be Easier to Use.” Bak et al. [6] surveyed 2795 papers in the HCI literature, amongst which they found 28 with a focus on usability in organizations. Out of these, 11
User-Centered Design and Evaluation – The Big Picture
217
mentioned poor understanding of usability as an obstacle (behind resource demands, 17/28; test participant issues such as identification and access to users, 14/28; and organizational obstacles including anti-usability culture, 14/28). In particular, according to Bak et al., usability is often confused with functionality, which, at least to this author’s mind, may explain why so many applications have so many unused, often hard to discover, and near useless features. In fact there are five basic (although not cleanly independent) usability metrics that have been applied repeatedly and seem to be accepted as fairly standard within the HCI community [e.g. 32, 34 and 43]. These are: • • • • •
Time taken to learn to execute tasks Time taken to execute tasks Task completion rate (proportion of tasks in an evaluation that can be completed to some standard of correctness) Number of errors (deviations from viable task completion paths or production of a result or state that must be undone) User satisfaction (a composite of a variable host of subjective assessments)
Other less common metrics such as objectively measurable stress [e.g. 45] and analytically derived cognitive complexity [27] that can be correlated with at least one of these four have also been discussed in the literature. However, the five basic metrics can be all measured directly without special equipment (although perhaps not always as accurately as with special equipment) and cover the most significant possible consequences of bad design. The question then is, should all of these dimensions be measured or do some matter more than others in different design circumstances? In fact, design circumstances can heavily weight the importance of one metric over another and may even require trade-offs to be made between metrics. For example, a UI optimized for novice users with lots of easy to find and learn menus and buttons will tend to be slower and less efficient for an expert who will usually look for keyboard shortcuts that are faster to execute. So the evaluator must understand the importance of the metric to the design circumstances at hand and the extent to which any given tool or method is likely to provide reliable values for the metrics that matter. Different usability evaluation methods vary in the extent to which they are able to provide these metrics. For example, a cognitive walkthrough [40] will not allow the evaluator to measure task times very accurately, although it may be better at measuring the extent to which a system is likely to be error-prone. A laboratory experiment may allow an evaluator to measure time, task completion and errors quite accurately, but render no reliable measurement of user satisfaction. A user survey may measure satisfaction quite well, but provide only subjective (and thus unreliable) appraisals of time, task completion and error proneness. Of course, it can make sense to combine multiple methods in one evaluation such as an experiment and a survey, usually at a reduced cost for each method, since study participants need only be recruited, scheduled and paid once for a single session to perform more than one exercise. Whatever the case may be, it requires some expertise to know what aspects of the design situation to pay attention to in deciding what evaluation metrics are best.
218
V. Bellotti et al.
3 Systematizing Usability Evaluation Given the above challenges for usability evaluation in the design process, it is hardly surprising that some professionals have sought to develop systematic methods in an attempt to help those with less expertise assess usability without the added time and expense of bringing in real users or an expert who may be hard to find. Three approaches to systematization are: • • •
Guidelines Procedures Automation
Usability guidelines have been common for quite some time. For example, Jakob Nielsen describes participating in a US Airforce exercise to compile “existing usability knowledge into a single, well-organized set of guidelines for its user interface designers” between 1984 and 1986 [37]. Many corporate, governmental and quite a few international user interface design guidelines have been compiled and updated since then [e.g. 5, 24, and 28]. They seek to describe what the designer must aim to accomplish or constraints she or he must work within. However conforming to guidelines can be a tricky business for the unskilled, especially when they are not well articulated as in the ISO guideline for “Suitability for learning” which is articulated thus, “A dialogue is suitable for learning when it supports and guides the user in learning to use the system.” Procedures is a term I want to use here to refer to well-defined methods for usability evaluation and count as a subset of the methods described in section 2.1 of this paper. The Cognitive Walkthrough [40] is one such procedure that has been evaluated [31] to show that it can be followed by a knowledgeable person and, depending on the extent of that person’s skill, produce consistent predictions without requiring a complex modeling effort or a real user evaluation. Another similar approach is the Heuristic Evaluation method [39], which, in evaluation, has been shown to be better performed by usability experts and best of all by application domain specialist usability experts [e.g. 34]. These general-purpose methods have also been accepted for a long time and have stood the test of time, still being in use even over 15 years after their invention [22]. Earthy at al. [17] provides a review of the ISO 13407 humancentred design processes which represents an attempt to set standards for interactive systems design in general. Other evaluation procedures have been developed more recently for specific platforms (e.g. the mobile phone [30]) and specific application domains (e.g. e-learning [50]). Automation may be the Holy Grail of usability evaluation since the possible cost savings in design endeavors are immense. Card, Moran & Newell developed the foundational example of an evaluative human information processing model and the GOMS (Goals, Operators, Methods and Selection rules) approach to computational modeling of human interaction with computers in the early 80’s [13]. Since then, many attempts have been made to achieve full automation [25]. Quite a number of early efforts to develop approaches only focused on specifying the rules that would need to be learned by a user to operate a system and were never automated and took far too long to apply successfully [7]. More successful has been the work based on sustained development of working software models of a human information processor
User-Centered Design and Evaluation – The Big Picture
219
such as ACT [1] and its descendents (ACT* [2], ACT-R [3 and 4] and ACT-R/PM, [12], which is still under development at Carnegie Mellon University (CMU). Another, albeit less well-known, example of such a system is the SOAR cognitive architecture [29] developed in the UK. Building upon the ACT-R computational cognitive architecture Bonnie John at CMU and her colleagues and students have developed a GOMS-based system called CogTool [26] that has perhaps come closest to achieving minimal effort from the system developer. CogTool uses performance measurements taken from real user interactions and is able to generalize them to specifications of user interfaces that contain the same basic features (e.g. buttons, menus and other GUI elements). The user interface specification is provided to CogTool in the form of a storyboard (based on sketches or screenshots) that preserves the dimensions of the target GUI upon which the evaluator demonstrates to CogTool the actions required to execute tasks. Using its models of user thinking times and actions (plus expected system response times), CogTool is able to output a metric of task completion time predictions for a skilled user and also completion times and deviating actions together with the time they will take for novices. CogTool uses an augmentation of ACT called SNIF-ACT [17] which assumes novice users read text labels and click on items that are semantically close to their goal, sometimes this will lead to mistakes since interfaces often contain ambiguous or misleading elements [46 and 47].
4 Seeking Systematization in the Enterprise The HCI International 2009 panel, “Systematization, Modeling and Quantitative Evaluation of Human Interface” with which this paper is associated includes a number of positions from collaborators who have participated in an effort to systematize usability evaluation in a large corporation, NEC, based in Japan, that frequently develops software and hardware products for both consumers and for business use. The discussion in this paper has sought to provide some context for the approach adopted in the work reported, by addressing some of the key considerations that relate to its rationale. The chosen approach reflects a desire to simplify the choice of usability evaluation method in NEC where usability expertise is not as pervasive as would be ideal and where tight deadlines and budgets always apply. A small team of usability specialists in the research division of NEC began a collaboration effort with researchers at the Palo Alto Research Center and at Carnegie Mellon University to validate the use of CogTool (introduced above) in assessing the usability of products under development. CogTool was chosen as an ideal method because of its being easy to apply to a graphical UI specification (possibly early in the design process) without much expertise. This approach sidesteps the problems experienced by the corporation of lack of expertise and limited time and budget for usability. By providing one general-purpose systematic approach, it also seeks to get around possible bewilderment by non-experts working at the corporation at the number of methods available and the reliance of many of the more suitable, under the circumstances, economical and fast discount usability methods (such as Heuristic Evaluation) on experts to apply them well. Because of their importance to the corporation as a product category and the discrete, and thus easy to model, nature of the tasks users perform on them, mobile
220
V. Bellotti et al.
phones were chosen as having an ideal user interface upon which to first test the CogTool approach. However, CogTool at the time of writing mainly generates predictions of the occurrence of user behaviors such as looking, thinking and gesturing and the time they each take during task execution (which may include trial-and-error exploration for novice users). So it was necessary to determine whether these predictions correlate with the kind of usability that really matters to a product vendor developing applications where user performance demands (i.e. time to execute tasks and error rates) are not stringent; the subjective experiences of both experts and novices. Experts, of course, will form this opinion over extended periods of use, but novices can only form this opinion based on hearsay from existing users (probably experts) or their own initial exploration of the product, perhaps in a store (also known as “shelf usability”) or when given the opportunity to try the product for the first time by a friend or colleague. A literature search was undertaken to find the best possible method for assessing user subjective impressions of usability. Out of many possible candidates, the Mobile Phone Usability Questionnaire developed by Ryu [42] was chosen because it built upon and systematically refined questions from a number of previously well-accepted subjective usability assessment questionnaires. At the time of writing, an intensive effort is underway to obtain naïve and expert user performance data on a set of tasks across three types of mobile phone interface (between subject measures) and to correlate that data with user subjective impressions of usability obtained using the MPUQ both before and after using the mobile phones. The anticipated outcomes of the research will be: • • •
A comparison of three mobile phone models in terms of time taken by expert and naïve users to complete a set of tasks on each of the three phones. CogTool predictions for both experts and naïve users on each of the phone interfaces and an assessment of the accuracy of those predictions. A comparison between the real user data and the CogTool predictions and user subjective impressions of usability.
If the team is able to demonstrate that predictions of CogTool do indeed correlate with user subjective impressions, then we have evidence that CogTool may be used by commercial mobile phone developers to improve the usability of their mobile phones by using its predictions as a means to identify usability problems and areas for improvement in new phone interface design efforts. Whilst this might not be the ideal method for formative usability evaluation in the product development lifecycle, it will be a practical solution that can be used by non-experts under non-ideal circumstances and should reduce instances of at least some types of usability problem going undetected before product release.
References 1. Anderson, J.R.: Language, Memory and Thought. Erlbaum Associates, Hillsdale, NJ (1976) 2. Anderson, J.R.: The Architecture of Cognition. Harvard University Press, Cambridge, MA (1983)
User-Centered Design and Evaluation – The Big Picture
221
3. Anderson, J.R.: Rules of the Mind. Lawrence Erlbaum Associates, Hillsdale (1993) 4. Anderson, J.R.: ACT: A Simple Theory of Complex Cognition. American Psychologist 51, 355–365 (1996) 5. Apple: Apple Human Interface Guidelines for Mac OS X (2008), http://developer.apple.com/documentation/UserExperience/ Conceptual/AppleHIGuidelines 6. Bak, J.O., Nguyen, K., Risgaard, P., Stage, J.: Obstacles to Usability Evaluation in Practice: A Survey of Software Development Organizations. In: Proceedings of the 5th Nordic Conference on Human-Computer interaction: Building Bridges, NordiCHI 2008, vol. 358, pp. 23–32. ACM, New York (2008) 7. Bellotti, V.: Implications of Current Design Practice for the Use of HCI Techniques. In: Jones, D.M., Winder, R. (eds.) People and Computers IV, pp. 13–34. Cambridge University Press, Cambridge (1988) 8. Bellotti, V., Dalal, B., Good, N., Bobrow, D.G., Ducheneaut, N.: What a To-do: Studies of Task Management Towards the Design of a Personal Task List Manager. In: ACM Conference on Human Factors in Computing Systems, CHI 2004, pp. 735–742. ACM, New York (2004) 9. Bellotti, V., Ducheneaut, N., Howard, M.A., Smith, I.E.: Taking Email to Task: The Design and Evaluation of a Task Management Centered Email Tool. In: CSCW 2002 Workshop: Redesigning Email for the 21st Century, New Orleans, LA, ACM, New York (2003) 10. Bellotti, V., Smith, I.: Informing the Design of an Information Management System with Iterative Fieldwork. In: Proceedings of the 3rd conference on Designing interactive systems: processes, practices, methods, and techniques, ACM, New York (2000) 11. Benel, D.C.R., Ottens, D., Horst, R.: Use of an Eye Tracking System in the Usability Laboratory. In: Proceedings of the Human Factors Society 35th Annual Meeting, Santa Monica, Human Factors and Ergonomics Society, pp. 461–465 (1991) 12. Byrne, M.D.: ACT-R/PM and Menu Selection: Applying a Cognitive Architecture to HCI. International Journal of Human-Computer Studies 55, 41–84 (1999) 13. Card, S.K., Newell, A., Moran, T.P.: The Psychology of Human-Computer Interaction. Lawrence Erlbaum Associates Inc., Hillsdale (1983) 14. Constantine, L.L.: Beyond User-Centered Design and User Experience: Designing for User Performance. Cutter IT Journal 17(2), 16–25 (2004) 15. Dix, A., Finlay, J., Abowd, G., Beale, R.: Human Computer Interaction, 2nd edn. PrenticeHall, Englewood Cliffs (1993) 16. Earthy, J., Sherwood Jones, B., Bevan, N.: The Improvement of Human-centred Processes – Facing the Challenge and Reaping the Benefit of ISO 13407. International Journal of Human Computer Studies 55(4), 553–585 (2001) 17. Fu, W.-T., Pirolli, P.: SNIF-ACT: A Cognitive Model of User Navigation on the World Wide Web. Human-Computer Interaction 22, 355–412 (2007) 18. Gray, W.D., Salzman, M.C.: Damaged Merchandise? A Review of Experiments that Compare Usability Evaluation Methods. Human Computer Interaction 13(3), 203–261 (1998) 19. Grudin, J.: Systematic Sources of Suboptimal Interface Design in Large Product Development Organizations. Human-Computer Interaction 6(2), 147–196 (1991) 20. Gunther, R., Janis, J., Butler, S.: The UCD Decision Matrix: How, When, and Where to Sell User-Centered Design into the Development Cycle (2001), http://www.ovostudios.com/upa2001/ (retrieved January 24, 2009) 21. Hartson, H.R., Andre, T.S., Williges, R.C.: Criteria for Evaluating Usability Evaluation Methods. International Journal of Human-Computer Interaction 15(1), 145–181 (2003)
222
V. Bellotti et al.
22. Hollingsed, T., Novick, D.G.: Usability Inspection Methods After 15 Years of Research and Practice. In: Proceedings of the 25th Annual ACM international Conference on Design of Communication, SIGDOC 2007, pp. 249–255. ACM, New York (2007) 23. ISO 9241-11:1998. Ergonomic Requirements for Office Work with Visual Display Terminals (VDTs) – Part 11: Guidance on Usability. International Organization for Standardization (1998) 24. ISO 9241-110:2006 Ergonomics of Human-System Interaction – Part 110: Dialogue Principles (2006) 25. Ivory, M.Y., Hearst, M.A.: The State of the Art in Automating Usability Evaluation of User Interfaces. ACM Computing Surveys 33(4), 470–516 (2001) 26. John, B.E., Prevas, K., Salvucci, D.D., Koedinger, K.: Predictive Human Performance Modeling Made Easy. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI 2004, pp. 455–462. ACM, New York (2004) 27. Kieras, D., Polson, P.: An Approach to the Formal Analysis of User Complexity. Int. Journ. of Man-Machine Studies 22, 365–394 (1985) 28. Koyani, S.J., Bailey, R.W., Nall, J.R.: Research-Based Web Design & Usability Guidelines. U.S. Department of Health and Human Services (HHS) and the U.S. General Services Administration (GSA), Washington DC (2004), http://www.usability.gov/pdfs/guidelines.html#1 (retrieved February 21, 2009) 29. Laird, J., Rosenbloom, P., Newell, A.: Universal Subgoaling and Chunking: the Automatic Generation and Learning of Goal Hierarchies. Kluwer Academic Publishers, Dordrecht (1986) 30. Lee, Y.S., Hong, S.W., Smith-Jackson, T.L., Nussbaum, M.A., Tomioka, K.: Systematic Evaluation Methodology for Cell Phone User Interfaces. Interacting with Computers 18(2), 304–325 (2006) 31. Lewis, C., Polson, P.G., Wharton, C., Rieman, J.: Testing a Walkthrough Methodology for Theory-Based Design of Walk-up-and-use Interfaces. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems CHI 1990, pp. 235–242. ACM, New York (1990) 32. Macleod, M., Bowden, R., Bevan, N., Curson, I.: The MUSiC performance measurement method. Behaviour & Information Technology 16(4-5), 279–293 (1997) 33. Muller, M.J., Geyer, W., Brownholtz, B., Wilcox, E., Millen, D.R.: One-Hundred Days in an Activity-Centric Collaboration Environment Based on Shared Objects. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems CHI 2004, pp. 375– 382. ACM, New York (2004) 34. Nielsen, J.: The Usability Engineering Life Cycle. Computer 25(3), 12–22 (1992) 35. Nielsen, J.: Usability engineering. Academic Press, Boston (1993) 36. Nielsen, J.: Using Discount Usability Engineering to Penetrate the Intimidation Barrier. In: Bias, R.G., Mayhew, D.J. (eds.) Cost-Justifying Usability, Academic Press, London (1994) 37. Nielsen, J.: Durability of Usability Guidelines. Jakob Nielsen’s Alertbox (January 17, 2005), http://www.useit.com/alertbox/20050117.html (retrieved January 27, 2009) 38. Nielsen, J.: Cost of User Testing a Website. Alertbox (May 3, 1998), http://www.useit.com/alertbox/980503.html (retrieved January 24, 2009) 39. Nielsen, J., Molich, R.: Heuristic evaluation of user interfaces. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI 1990, pp. 249–256. ACM Press, New York (1990)
User-Centered Design and Evaluation – The Big Picture
223
40. Polson, P.G., Lewis, C., Rieman, J., Wharton, C.: Cognitive Walkthroughs: A Method for Theory-Based Evaluation of User Interfaces. International Journal of Man-Machine Studies 36, 741–773 (1992) 41. Poltrock, S.E., Grudin, J.: Organizational Obstacles to Interface Design and Development: Two Participant-Observer Studies. ACM Trans. Comput.-Hum. Interact 1(1), 52–80 (1994) 42. Ryu, Y.S.: Development of Usability Questionnaires for Electronic Mobile Products and Decision Making Methods, Doctoral dissertation, State University, Blacksburg, VA, USA (2005) 43. Sauro, J., Kindlund, E.: A Method to Standardize Usability Metrics into a Single Score. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems CHI 2005, pp. 401–409. ACM, New York (2005) 44. Scholtz, J.: Usability Evaluation. National Institute of Standards and Technology (2006), http://www.itl.nist.gov/iad/IApapers/2004/ Usability%20Evaluation_rev1.pdf (retrieved January 24, 2009) 45. Stickel, C., Scerbakov, A., Kaufmann, T., Ebner, M.: Usability Metrics of Time and Stress - Biological Enhanced Performance Test of a University Wide Learning Management System. In: Holzinger, A. (ed.) Proceedings of the 4th Symposium of the Workgroup HumanComputer interaction and Usability Engineering of the Austrian Computer Society on HCI and Usability For Education and Work. Lecture Notes In Computer Science, vol. 5298, pp. 173–184. Springer, Heidelberg (2008) 46. Teo, L., John, B.E.: Towards Predicting User Interaction with CogTool-Explorer. In: Proceedings of the Human Factors and Ergonomics Society 52nd Annual Meeting, pp. 950– 954 (2008) 47. Teo, L., John, B.E., Pirolli, P.: Towards a Tool for Predicting User Exploration. In: CHI 2007 Extended Abstracts on Human Factors in Computing Systems, CHI 2007, pp. 2687– 2692. ACM, New York (2007) 48. Vredenburg, K., Mao, J., Smith, P.W., Carey, T.: A Survey of User-Centered Design Practice. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 471–478. ACM, New York (2002) 49. Wixon, D.: Evaluating Usability Methods: Why The Current Literature Fails the Practitioner. Interactions 10(4), 28–34 (2003) 50. Zaharias, P.A.: Usability Evaluation Method for E-Learning: Focus on Motivation to Learn. In: CHI 2006 Extended Abstracts on Human Factors in Computing Systems, pp. 1571–1576. ACM, New York (2006)
Web-Based System Development for Usability Evaluation of Ubiquitous Computing Device Jong Kyu Choi, Han Joon Kim, Beom Suk Jin, and Yonggu Ji Dept. of Information and Industrial Engineering, Yonsei University, Seoul, Korea {jk.choi,khjoon,kbf2514jin,yongguji}@yonsei.ac.kr
Abstract. Recently, with the development of electronic technology, information technology (IT) devices that satisfy user requirements, such as PMP (Portable Multimedia Player), PDA (Personal Data Assistant), UMPC (Ultra Mobile Personal Computer) and mobile phones have been developed. These devices are making wireless communication and network communication more accessible, and by the ubiquitous paradigm, provide accessibility of information everywhere. The appearance of these devices and the development of the technology are integrating and converging in the IT devices. Therefore, there are significant changes in the purpose and environment of IT device applications. This is due to the modification of the environment in which the device is used (not only in a passive state but also in a motional state), which has a greater influence on usability. Therefore, a new methodology is required to evaluate the usability of the devices. In previous studies, by gathering and integrating the usability factors and ubiquitous characteristics, the Ubiquitous Evaluation Factor was obtained. For each factor of ubiquitous devices, deconstruction was accomplished for each usability evaluation. Through this process, components of ubiquitous devices could be extracted. Evaluation scores of ubiquitous device components and the score of the evaluation of each usability factor could be obtained from the usability evaluation. This evaluation framework was developed as a Webbased system to let the users perform the usability evaluation without having trouble with the location. This system was developed in Windows Server 2003 Enterprise Edition platform. Web Server IIS (Internet Information Server) 6.0 was used, and MS-SQL 2000 was used for the database server. For development of language, ASP (Active Server Page) was used, which is run in IIS. This study is meaningful in that through a Web-based system, various people could easily access the device, and in that evaluation of a portion of the device as well as the entire device is possible. Keywords: Ubiquitous computing device, usability, web-based system, system development.
more accessible and providing the possibility of having access to information everywhere[1]. Therefore, the development of these IT devices and the improvement upon technology together requires integration and convergence [2]. The environments where the IT devices are being used and the purposes for their utilization are changing. For instance, IT devices are not only used in the passive state, but also in a state of motion, which influences the usability of the device. The distinguishing characteristic of ubiquitous computing is that it is a communication system [3] that allows the users to obtain the required information in any place. Therefore, previous usability evaluation tools need to be improved, taking into consideration new user environments and ubiquitous computing [1]. From previous studies, we selected and integrated the usability factors and ubiquitous characteristic factors, developing new Ubiquitous Evaluation Factors. Each part of the ubiquitous device usability evaluation was deconstructed so the separate components of the ubiquitous device could be obtained. Through the usability evaluation, the evaluation score and each usability factor evaluation score of a ubiquitous device component can be obtained. In this study, an evaluation framework was developed as a Web-based system, allowing users to perform a usability evaluation anywhere.
2 Background 2.1 Ubiquitous Computing Device Ubiquitous computing technology development and mobile information device convergence development provide information to the user everywhere at every moment with any device. The basis of ubiquitous computing is to provide service at the request of a user and to grasp the intention of a user and situation. This results in one service system actively supporting another; that is to say the ubiquitous computing service. A ubiquitous computing device is a device for the ubiquitous service that allows a user to interact with the service anywhere at any time. Also, it grasps the intention of the user and situation to support the user. Ubiquitous devices function in a state where people do not realize that we acquire information about embedded, pervasive, portability and mobility functions; that is, to realize the ubiquitous environment [5]. Table 1. Characteristics of ubiquitous device
Characteristics of ubiquitous device Pervasiveness Ubiquity Diversification Poratbility Interconnectivity
2.2 Previous Ubiquitous Computing Research By understanding the ubiquitous computing user’s intention and utilizing the user’s envi ronmental characteristics, it was possible to reflect the interaction with the user [4]. Ther efore, it is possible to say that the Context-Aware Computing is similar to the condition
226
J.K. Choi et al.
of the user; especially when it was focused on context-of-use of the mobile devices. Thu s, the Context-Aware Computing [6] model and the ubiquitous computing model are si milar in many ways, especially when focused on the context-of-use of mobile devices. Consequently, it is possible to say that the Context-Aware Computing model and the ubiquitous computing model have many similar points [8,9]. J. Scholtz and S. Consolvo [10] have presented a framework (UEA, Ubiquitous computing Evaluation Area) to evaluate the ubiquitous computing application. The evaluation domains for the ubiquitous computing were: attention, conceptual model, appeal with each conceptual measure and metric. In relation to Context-Aware Computing, Nigel and Miles [11] had presented the idea that to confidentially calculate the usability, it is necessary to evaluate the representative environment, user and task. Thus, it is essential to have a deep understanding of the context of use of the product. 2.3 Limitation J. Scholtz [10] defined things that are important to take into account in ubiquitous computing as “area,” and then categorize them to experiment with a systematic analysis to present a conceptual measurement variable. However, this had a major focus on the ubiquitous service; it did not focus on the device usability evaluation, so there was insufficient consideration of the user’s task. Taken from Nigel’s study[12,13], in most of the context-aware computing studies, only information on the diverse types of context is presented, lacking a concrete connection with usability principles.
3 Framework The Ubiquitous device’s usability evaluation framework was established from previous studies [14]. New suggestions on usability evaluation were proposed after a modification on the context deconstruction. Figure 1 shows how the main user and main task were selected by having the user information of the device. In this way, a specification of the device context information was achieved, which later will be used in an evaluation checklist. Consideration of each device characteristic makes further ubiquitous device usability evaluations possible.
Fig. 1. Evaluation Framework
Web-Based System Development
227
Fig. 2. Generating Evaluation Factors
3.1 Evaluation Factors Figure 2 shows the extractions made in the usability factors for the evaluation framework of the ubiquitous computing environment property. In previous studies, the basic properties for usability evaluation were: efficiency, effectiveness and satisfaction – that is to say ‘General Evaluation Factors.’ The General Evaluation Factors are considered suitable for general device evaluation, but are not specified for ubiquitous computing devices. Proposing factors for a ubiquitous device demands creation of pervasive computing quality and ubiquitous computing quality tool. Table 2. Ubiquitous Device Evaluation Factors
Description Adaptable or easily adjusted to the changes in context Able to control device in any circumstances Interconnected network among devices, allowing sharing of information The station of the device can be mobile as the user carries it with him From past experience, the result of the system execution can be predicted User interface and instruction are simple Provides the current status of system as well as when it is running an execution
Table 2 shows some comparative computing of related studies that were used for ubiquitous service or ubiquitous software studies: Adaptability, Controllability, Interconnectivity, Mobility, Predictability, Simplicity and Transparency. This is an assortment and integration of ubiquitous device related factors. Usability evaluation factors of devices are organized as: (visual) Clarity, Accessibility, Affect, Compatibility, Consistency, Effectiveness, Efficiency, Error prevention, Feedback, Forgiveness, Helpless, Learnability, Memorability, Multi-threading, Responsiveness, Safety and User tailorability. 3.2 Evaluation Area Figure 3 shows an elementary device deconstruction for a device evaluation. It implements the usability evaluation on each device so as to obtain the degree of usability
228
J.K. Choi et al.
(high or low) that each factor has. The developed factors can be applied to evaluate each device components: LUI (Logical User Interface), GUI (Graphical User Interface), and PUI (Physical User Interface), respectively. The device component can be individually evaluated by making a separation.
Fig. 3. Evaluation area
LUI is divided into: Application software, Menu structure and Contents, System Awareness and System Acceptance. GUI is divided into Indicator, Icon and Menu. In H/W Area, Device H/W is separated in Body and Screen while PUI is separated into Control key and Touch screen. In the case of Touch screen, it is necessary to separately subdivide by input methods, and when evaluating it is risky to use different factors and checklists as standard controllers. Consequently, when evaluating devices with a touch screen, PUI of the touch screen is performed. If there is no touch screen, that evaluation factor is not taken into account. 3.3 Context of Use Figure1 shows the context of information that is solidified as: user type, device type, task type and use type. Use type is information about the environment and condition (situation) in which the user is using the device. Each context information framework has significance on the evaluation target, information access and entertainment systems. User type is divided into novice and expert, while device type is divided into PMP, Music Player, PDA, UMPC, Smartphone and Game Device. Through the expert evaluation, depending on different contexts, each evaluation factor and checklist was evaluated, giving the results of a usability evaluation with relative importance. Each evaluation factor has its own weight, which changes the importance of each checklist, depending on the device and context information.
4 System Development 4.1 System Structure In this study, the system was developed as a Web-based system to let the users perform an evaluation without having trouble with location. This system is composed for a client and a Web server or database server. The client indicates work to the Web server through browsers connected to the Internet after accessing the Web. The Web server then sends a Web page to the client
Web-Based System Development
229
and provides data that the client requires of the database server. The database server is able to query the user regarding the data that the user wants from the Web server, and carries out the work, finally returning the results to the Web server. In Figure 4, the system was developed using Windows Server 2003 Enterprise Edition, IIS (Internet Information Server 6.0) Web Server and MS-SQL 2000 Database Server. ASP (Active Server Page) was used for developing language.
Fig. 4. System structure
4.2 Evaluation Procedure The first step shows a Web page where information of each context type has to be inputted. The information about the context type that was saved in the database is recalled, then shown. In this step, the desired device to be evaluated is selected, and the user to be evaluated is identified as a novice or expert of the device in User Type. In Device Type, a selection is made between PMP Type, MP3 Type and PDA Type. Task type is divided into: video player, music player, information reading and game recognition. Use type depends on the wearable shape and portable form. After that, there is a step to input all the data in the form of a questionnaire. It gives a description of each context type so the user can understand easily, and it also gives an option of ‘not considering’ for the context types that the user does not wish to evaluate. In the second step, after having selected the information for the context, the information about the user is inputted in the server session and a Web page appears showing the different evaluation areas that can be selected to continue. This action recalls the information that is saved in the database and displays it. To allow more than one selection of the area the user wisher to evaluate, the options are selected by checking a box. Moreover, to support the user’s need of knowing more about the area to be evaluated, a description of each area is provided. In the third step, the information about the area that is going to be evaluated is saved in the server session and is shown in the corresponding checklist of each area in the database. In the upper part of the page where the checklist was selected, information about the area that is being evaluated is displayed. Each area is displayed on a different page to reduce and avoid confusion and disorder. In the last step, the user’s selection of the checklist, the information of the context type that was saved in the sever session in previous steps, and the information of the
230
J.K. Choi et al.
evaluation area are saved. From the data that was saved about a determined device, it is possible to obtain the average evaluation score. After being saved in the database, the visible result is shown in a page with an eight-column graph. The results from the evaluation of ubiquitous characteristics are shown in a graph that indicates the score of each ubiquitous factor. The results from the evaluation of general characteristics are also shown in a graph. Moreover, by providing a graph for each factor with a score over 100, we can see insufficient areas more clearly. Also, the result of the device evaluation area (LUI, GUI, PUI, Device H/W) of each of them is represented in a graph with a score over 100 so as to show the areas that have to be improved.
Fig. 5. Evaluation system
5 Conclusion and Further Study This study developed a Web-based system of a framework to evaluate the usability of ubiquitous computing devices. There are three important aspects. First, through the development of a Web-based system, the user can evaluate everywhere where the Internet is accessible. Also, it is more comfortable, as it allows the user to see the results in a moment. Secondly, the user is able to select the area that he desires to evaluate. A complete or part evaluation of the selected areas is possible. Thirdly, as this system uses a database, the evaluated data can be saved. Through this saved data, it is possible to see an average of all the other data of previous and other evaluations. However, this system has only been implemented for a small number of devices, and not in every type of device. Therefore, in further studies it is necessary to increase the validity of the system by performing an evaluation of a more diverse range of devices. Then, after obtaining the validity of the system, it will be possible to make updates.
Web-Based System Development
231
References 1. Zhang, D., Adipat, B.: Challenges, Methodologies, and Issues in the Usability testing of Mobile Applications. International Journal of Human-Computer Interaction 18(3), 293– 308 (2005) 2. Rondeau, D.B.: Branding is Experience. Communication of the ACM 48(7), 61–66 (2005) 3. Park, I.C., Kim, S.C., Choi, M.S.: A Study on the Development Methods of gestureoriented Natural Interface for the Control of Ubiquitous Device. Koeran Society of Basic Design & Art 6(3), 265–274 (2005) 4. Scholtz, J., Richter, H.: Report from ubicomp 2001 workshop: Evaluation methodologies for ubiquitous computing, SIGCHI Bulletin (January/February 2002) 5. Kim, J.Y.: Ubiquitous Computing: Business model and prospects, Samsung Economic Research Institute (2003) 6. Schilit, B.N., Adams, N., Want, R.: Context-Aware Computing Applications. In: IEEE Workshop on Mobile Computing Systems and Applications (1994) 7. Betiol, A.H., de Abreu Cybis, W.: Usability Testing of Mobile Devices: A Comparison of Three Approaches. In: Costabile, M.F., Paternó, F. (eds.) INTERACT 2005. LNCS, vol. 3585, pp. 470–481. Springer, Heidelberg (2005) 8. Kwon, O.B., Kim, J.H.: A Multi-Layered Approach to Assessing Level of Ubiquitous Computing Services. Information System Review 8(1), 43–61 (2006) 9. Nam, J.H., Choi, M.S.: A Study on the Multi Interface Factor Base the Service Evolution on Ubiquitous Computing Environment. Korea Digital Design Council 11, 357–366 (2006) 10. Scholtz, J., Consolvo, S.: Toward a framework for evaluating ubiquitous computing applications. IEEE 3(2), 82–88 (2004) 11. Bevan, N., Macleod, M.: Usability measurement in context. Behavioral and Information Technology 13, 132–145 (1994) 12. Lee, I.S., et al.: Use Contexts for the Mobile Internet: A Longitudinal Study Monitoring Actual Use of Mobile Internet Services. International Journal of Human-Computer Interaction 18(3), 269–292 (2005) 13. Barnard, L., Yi, J.S., Jacko, J.A., Sears, A.: Capturing the effects of context on human performance in mobile computing systems. Personal and Ubiquitous Computing 11, 81–96 (2007) 14. Kim, H.J., Choi, J.K., Ji, Y.: Usability Evaluation Framework for Ubiquitous Computing Device. In: International Conference on Convergence and Hybrid Information Technology (2008)
Evaluating Mobile Usability: The Role of Fidelity in Full-Scale Laboratory Simulations with Mobile ICT for Hospitals Yngve Dahl1, Ole Andreas Alsos2, and Dag Svanæs2 1
Telenor Research & Innovation, Otto Nielsensvei 12, 7004 Trondheim, Norway [email protected] 2 Department of Computer and Information Science, Norwegian University of Science and Technology, Sem Sælandsvei 7-9, 7491 Trondheim, Norway {Ole.Andreas.Alsos,Dag.Svanes}@idi.ntnu.no
Abstract. We have applied full-scale simulations to evaluate the usability of mobile ICT for hospitals in a realistic but controllable research setting. Designing cost-effective and targeted simulations for such a purpose raises the issue of simulation fidelity. Evaluators need to identify which aspects of the research setting that should appear realistic to simulation participants, and which aspect that can be removed or represented more abstractly. Drawing on research on training simulations, this paper discusses three interrelated fidelity components—equipment/prototype fidelity, environmental fidelity, and psychological fidelity. These components need to be adjusted according to which design aspects evaluators want to gather feedback on. We present examples of how we have configured the components in various simulation-based usability assessments of mobile ICT for hospitals. The paper concludes by providing a set of guiding principles concerning the role of fidelity in simulation-based usability evaluations. Keywords: Clinical information systems, fidelity, evaluation, human factors, mobility, simulation, training simulation, usability, user-centered design.
As human-computer interaction moves “beyond the desktop” and into highly dynamic work settings, such as hospitals, the old standards for usability testing arguably no longer hold water. Work situations in hospitals often involve mobility and bodily work [2, 3]. This makes the usability of mobile ICT more subject to external factors that are not related to the GUI and the software being evaluated as such. These are factors that fall beyond what evaluations conducted in conventional usability laboratories can reveal. We have attempted to meet these challenges by means of full-scale laboratory simulations—Simulated, natural-like hospital environments, in which nurses and physicians act out clinical scenarios using mock-ups or prototype systems. Such an approach arguably raises the need for evaluators to think about the fidelity, or level of detail, of the research setting. A critical issue for applying simulations as a usability evaluation methodology in a cost-effective way is deciding on the right level of fidelity. This paper aims first to show that fidelity in usability evaluations of mobile ICT is a concept that extends beyond the software prototype being evaluated. Particularly, when addressing hospital settings, where the technology is likely to be used as part of work activities requiring manual labor with hands and feet in addition to high situational awareness, the physical environment and the work tasks become vital components of the total system being simulated. The second objective of this paper is to demonstrate how the test environment, the prototype, and the test scenario can be adjusted to achieve targeted evaluations and induce behavior among participant that is desirable in terms of informing specific design aspects relevant for the usability of mobile systems in hospitals.
2 Methodological Motivation The work conditions that ICT supporting clinical care is used in are very different than those for office settings and desktop-based computer interaction. The call for mobile ICT in hospitals essentially stems from the distributed nature of clinical work, and rapidly changing work situations. In order to conduct valid usability assessments of mobile interactive technology for such environments, then, the design solutions must be evaluated in a relevant use context [4]. Consequently, aspects that are characteristic for clinical work, such as mobility, clinician-patient interaction, and frequent context shifts, must be reflected in the setting in which the evaluation takes place. Conducting usability evaluations in actual clinical situations is challenging. Firstly, the hospital is a high-risk environment, in which it can be critical to avoid affecting ongoing work. Secondly, patient information confidentiality is likely to prevent video and audio recording. At the same time, conventional laboratories intended for controlled desktop-based usability evaluations are unsuited for reconstructing the rapidly changing conditions of hospital work. The combined need for a realistic, yet controllable, research setting, has motivated us to build a customized full-scale model of a hospital ward section, with advanced video and audio recording facilities.
234
Y. Dahl, O.A. Alsos, and D. Svanæs
3 Simulation Fidelity Principles HCI literature in general provide little practical guidance on how to compose fullscale simulations with the aim of evaluating usability of mobile ICT. This has motivated us to study literature on training simulations in search of guiding principles. Given the relatively long history of training simulators, we will first highlight some of the central principles described in related research literature. Next, we will provide a brief overview of studies within the field of mobile HCI, where simulations have been employed to evaluate prototypes and early concepts. 3.1 Simulations Applied for Training Purposes Within high-risk industries such as aviation, naval shipping, health care, and nuclear power production there has been a long tradition for using simulations for training purposes in risk-free environments. Obviously, the objective of training simulations and simulations for usability purposes differ. Simulations applied in the context of usability assessment aim at gathering data about the effectiveness, efficiency, and user satisfaction of a product used by specific users in a realistic situation. The focus is on product performance. Training simulations, on the other hand, are typically used for educational purposes and for maintaining or enhancing human work-related skills. In this sense, they are human-centric rather than product-centric. Despite these differences in focus, many of the concepts developed from research on training simulations are highly relevant when designing simulation-based usability evaluations. Among the most central concepts described in research literature on training simulations are equipment fidelity, environment fidelity, and psychological fidelity [5]. Equipment fidelity refers to the extent to which the appearance and feel of real tools, devices, or systems that simulation participants operate on are duplicated. For example, aircraft cockpit procedures have been trained both with hi-fi representations of aircraft instruments and with lo-fi mock-ups [6]. Environment fidelity concerns the extent to which physical characteristics of the real-world environment (beyond the training devices) are realistically represented in the simulation. High-fidelity aircraft simulators are full-size replica of cockpits that duplicate of the operational aircraft environment and motions to great detail [7]. In flight training environments of less fidelity (e.g., desktop evaluation environments), visual and motional cues are typically reduced or lacking. Lastly, psychological fidelity, relates to the realism of the simulation as perceived by its participants. In other words, it is extent to which participants are able to engage in the simulated situation, as they would have done in the natural setting. This is intimately dependent on the psychological demands the simulated tasks place on the participants. Human perception, attention, decision-making, memory, and action are factors that may influence psychological fidelity [8, p. 420]. Developing scenarios that replicate task demands of the real-world system is a common technique for enhancing psychological fidelity [5]. In training simulations, psychological fidelity is often considered the most important, because it is the most relevant attribute for learning. Prophet and Boyd [6] found that the transfer of training was equal for students practicing ground cockpit procedures in real airplanes, and student practicing the same routines on lo-fi representations of the relevant devices.
Evaluating Mobile Usability
235
Each of the three components described above can be set along a continuum ranging from low to high fidelity (Fig. 1). As pointed out by Beaubien and Baker [5], the level of fidelity for each component typically depends on the purpose of the training simulation. For example, low-fidelity role-plays have been used to train teamwork related attitudes and skills, while simulations of higher fidelity are required if the goal is to learn the specific consequences of actions. Taken together equipment, environment, and psychological fidelity form the overall simulation fidelity (Fig. 1).
Fig. 1. Three interrelated simulation fidelity components. Each component can be set along a continuum ranging from low to high fidelity.
3.2 Mapping the Simulation Fidelity Concepts to HCI Terminology In the context of HCI and usability assessments we consider equipment fidelity to be the equivalent of computer system or application fidelity (we will refer to this as prototype fidelity). As previously noted, this component encompasses physical appearance, interaction style, and functionalities. Environmental fidelity is the realism of the physical use setting (i.e., the point of interaction), while psychological fidelity corresponds to the user-perceived realism of the tasks they are given as part of a test. 3.3 Simulations Applied in Mobile HCI The use of simulations in mobile HCI is in many ways a result of the recognition that conventional usability laboratories and testing do not duplicate factors affecting mobile usability [9, 10]. It can also been considered an efficient approach to overcome many of the challenges related to studying mobile systems in the field [11]. Examples of usability studies in which contextual features have been simulated in laboratory settings are decribed in work by Bohnenberger et al. [12], Pirhonen et al. [13], and Kjeldskov and Skov [14]. Some simulation-based usability studies of relevance to mobile ICT in hospital settings can be found in Refs. [4, 15, 16].
4 Applying the Fidelity Principles in Practice In the current section we will provide examples from conducted simulations, showing how we have attempted to carefully adjust simulation fidelity to promote reflection
236
Y. Dahl, O.A. Alsos, and D. Svanæs
among participants regarding specific design aspects. Our examples also highlight the close interrelationship between the various simulation fidelity components presented earlier. 4.1 Case Study: Point-of-Care Scenarios There are many hospital scenarios in which mobile ICT may prove helpful [2]. We have concentrated on a limited set of situations that have been found appropriate for usability testing in our full-scale hospital ward model. In particular, we have focused on situations where nurses and physicians are located at the patient bedside, i.e., at the point of care. Examples of hospital routines were such situations occur include ward rounds, during administration of medicine, and in response to patient calls. These situations form suitable test candidates for mobile ICT because they occur frequently in hospital wards, require mobility, and involve personnel who need quick and effortless access to patient related information. 4.2 Environment Fidelity As previously pointed out, the circumstances under which mobile ICT supporting hospital work are used are radically different than that for office work and conventional desktop computer interaction. Human-computer interaction in clinical settings is more physical in nature both in the sense that hospital workers are using them while on the move, and in situations requiring physical interaction with patients (e.g., assistance and examination). We have observed that to capture physical and bodily usability aspects of mobile ICT used at point of care, prototype solutions must be evaluated in research environments that closely mimic the physical environment of real patient rooms. For example, realistically propositioned rooms furnished with patient beds makes it possible for participants to move naturally in the model both around and between patient beds. For simulations addressing the usability of point-of-care systems this is essential, because it can help give participants a realistic impression of how well different solutions are physically adapted to the care situations. Examples of physical design factors, which according to simulation participants can increase the usability of mobile ICT at point of care, include the possibility easily to share screen content with patients (Fig. 2, left) and the opportunity to have digital media ready at hand to be to be used and put aside depending on what the immediate care situation calls for (Fig. 2, right). Both examples illustrate the intimate relationship between the physical environment and the physical placement and form factors of interactive devices.
Fig. 2. Examples showing the close relationship between the physical environment and the physical placement (left) and form factors (right) of digital media
Evaluating Mobile Usability
237
There are often subtle details that need to be in place to capture physical and bodily usability factors. For example, to form a realistic impression of how well an interaction design solution accommodates the dialogue between clinicians and in-bed patients, patients need to be represented by human actors in real hospital beds (Fig. 2, left). Likewise, simulation participants need to wear their daily work uniforms, to bring forth or temporarily put aside digital media (Fig. 2, right). 4.3 Prototype Fidelity The Spectrum of Prototype Fidelity. Prototypes can generally be divided into low, medium, or high fidelity. In low-fidelity prototyping, props (e.g., foam and cardboard models, paper and post-it notes, etc.) are often used as physical representations of interactive devices with rough sketches of envisioned graphical interfaces. Mediumfidelity prototypes are functional (computer-based) models of systems. They generally have simplified GUIs, but little or no functionality behind the GUI elements. High-fidelity prototypes are sophisticated and functional versions of envisioned designs. They may show sample information content. Prototype Fidelity for Point-of-Care Scenarios. In our studies of mobile ICT applied in hospital settings we have used prototypes of different fidelities depending on design phase. As part of specifying user requirements for mobile ICT for point-ofcare usage, we have conducted simulations using lo-fi props in realistic models of patient rooms [17]. The main rational for this is to put focus on the context in which the technology will be used, rather than details concerning GUIs and software functionality. By applying mock-ups one can avoid restricting reflection among participants by committing to particular hardware and interfaces. In simulations with functional prototypes, we have attempted to be more particular about design aspects we wanted to address. For these simulations we have applied mixed fidelity prototypes [18], i.e., models that combine different fidelities with regard to GUIs, functionalities, interaction styles, and information content. For example, in simulations addressing the usability of location-based access to medical information at the point of care (the full study is described in Ref. [16]) we found it useful
Fig. 3. Mixed fidelity prototype providing location-based access to a patient’s medical record. High-fidelity sensors and radio tags detect the physical position of a test participant and near patients. The graphical representation of the medical record, which is automatically retrieved and presented on the bedside terminal, is of low fidelity.
238
Y. Dahl, O.A. Alsos, and D. Svanæs
to implement the interaction techniques using high-fidelity hardware and sensors to give participants a realistic impression of actual use. At the same time, we deliberately kept the details of the GUI at a low level and avoided to link the prototype to realistic medical sample data. The main rational for this delimitation is that we primarily wanted to help participant focus on and give feedback regarding the usability of interaction styles, rather than GUI related aspects of the design. Fig. 3 shows the mixed fidelity prototype that we applied in the example described above. 4.4 Psychological Fidelity This section gives a brief summary of different techniques we have employed to increase the psychological fidelity of our simulations. Domain Expertise. As pointed out above, developing scenarios that mimic the task demands of the real-world system can increase psychological fidelity. To facilitate this, and make sure that the simulations reflected a sufficient degree of realism, the scenarios were designed with assistance from domain experts (physicians and nurses). Baseline Scenarios. To help participants relate the prototypes and evaluated concepts to their everyday workday, we have learned that running an introductory baseline scenario reflecting current paper-based practices are useful (Fig. 4). These “as-is” scenarios can effectively act as a reference or benchmark for the participants when they later act out the same scenario using functional prototypes. This help participants “make sense” of scenarios involving information media they are not familiar with.
Fig. 4. Baseline scenarios reflecting current paper-based practices (left) can act as references for test subjects when they later try out digital solutions for the same purposes (right)
Targeted Simulations. Because the conducted simulations mainly have focused on evaluating early concepts and partial prototypes, we have tried to tailor the scenarios to help promote feedback on the particular design aspects. In some cases, this has resulted in delimitations that, depending on the purpose of the simulation, possibly might reduce psychological fidelity. Using only mock-up representations of electronic medical records in some of the simulations, as explained above, is an example of such a limitation of scope. This, however, we consider necessary compromises to achieve targeted evaluations. Feedback from the participants indicated that they found the simulations realistic in spite of such simplifications. Scripted Simulations. We have also experimented with ways to increase the participants’ perceived realism of the simulations by trying to integrate their professional experience. One technique we have applied is to use patient actors instructed to reveal
Evaluating Mobile Usability
239
certain pieces of information during the scenario, but leave it for the participants (i.e., clinicians) to decide how to act on that information [19]. For other types of simulations we have found it sufficiently to give the patient actors a more passive role, e.g., acting as physical markers in the simulations.
5 Setting the Scene Right A fundamental dilemma related to simulations, whether applied for usability assessment or training purposes, is specifying a priori a sufficient fidelity level. Typically, approximating mobile use contexts involves trade-offs between control over the research setting, realism, and available resources [4]. In the current section we will present and discuss some guiding principles for “setting the scene right” for simulation-based usability evaluations of mobile ICT for hospitals. 5.1 Psychological Fidelity First In Sect. 3.1 we pointed out that psychological fidelity is often considered the most significant fidelity component in training simulations. This is because it represents the transfer of skills learnt in the simulated setting back into the real world. Based on our experiments, we argue that the same component is also the most critical when designing full-scale simulations for usability assessment purposes, but for different reasons. We see psychological fidelity first and foremost as a key premise for provoking reflection among simulation participants. This is especially valuable in early phases of the design process before critical design choices are taken, and when end user feedback is most likely to help inform the actual design. In order to motivate reflection among simulation participants it is essential that they see the on-the-job relevance of the simulation. As discussed in Sect. 4, a realistic test environment and prototypes with certain functional features can help evoke the central psychological mechanism triggered in everyday clinical work. 5.2 How Much Fidelity Is Enough? In our full-scale simulations we have attempted to follow a “just enough” principle with regard to the amount of realism reflected. That is, duplicating aspects that we, along with domain experts, consider most likely to affect the perceived usability of the prototypes that are to be tested. This, as we have pointed out earlier, depends closely on the object of the simulation. Because the usability of mobile ICT for hospitals are highly contextual, there is no “one size fits all” approach when it comes to fidelity of simulation-based usability evaluations.
6 Summary and Conclusions In this paper we have investigated the role of fidelity in full-scale laboratory simulations used for usability assessment of mobile ICT for hospitals. Drawing on training simulation research, we identified three components of relevance—Environment fidelity, equipment or prototype fidelity, and psychological fidelity. We have shown
240
Y. Dahl, O.A. Alsos, and D. Svanæs
by examples from practical simulations how the different components can be modified to match the focus of different evaluations. The key principles the current paper has suggested regarding fidelity of simulationbased usability assessments of mobile ICT for hospitals are as follows: • Simulations need to be specific about the design aspects that are being evaluated. Cost-effective and targeted simulations should replicate features of the mobile ICT solution, physical environment, and work tasks, that are considered relevant for the design aspect one wants to gather feedback on. • There is no direct correlation between the overall fidelity of experimental simulations and effectiveness in terms of informing design. Realistic prototypes and work environments may enhance the perceived realism of the simulation, but this does not guarantee that feedback from participant are valuable in terms of informing design. • Simulations for usability assessment purposes should prioritize psychological fidelity. This component is most relevant for provoking reflection in design among simulation participants. To provoke such reflections it essential that participants are able to relate the design concept to their everyday work. • The requirements to simulation fidelity will typically increase as the mobile ICT solution is developed. We expect that future simulations in our full-scale ward model will enable us to further explore the role fidelity plays in simulation-based usability evaluations of mobile ICT for clinical work.
Acknowledgements The current work has been supported in part by the Norwegian Research Council by grant 176761 (POCMAP) of the VerdIKT program, DIPS ASA, The Industrial Research Fund for NTNU, St. Olav University Hospital, Akershus University Hospital, NTNU, and Telenor R&I.
References 1. Rudd, J., Stern, K., Isensee, S.: Low vs. high-fidelity prototyping debate. Interactions 3, 76–85 (1996) 2. Sørby, I.D., Melby, L., Nytrø, Ø.: Characterising cooperation in the ward: framework for producing requirements to mobile electronic healthcare records. International Journal of Healthcare Technology and Management 7, 506–521 (2006) 3. Bardram, J.E., Bossen, C.: Mobility Work: The Spatial Dimension of Collaboration at a Hospital. Computer Supported Cooperative Work 14, 131–160 (2005) 4. Kjeldskov, J., Skov, M.B., Als, B.S., Høegh, R.T.: Is It Worth the Hassle? Exploring the Added Value of Evaluating the Usability of Context-Aware Mobile Systems in the Field. In: Mobile HCI 2004, pp. 61–73 (2004) 5. Beaubien, J.M., Baker, D.P.: The use of simulation for training teamwork skills in health care: how low can you go? Qual Saf Health Care 13 (2004)
Evaluating Mobile Usability
241
6. Prophet, W.W., Boyd, H.A.: Device-Task Fidelity and Transfer of Training: Aircraft Cockpit Procedures Training. Tech. Report. Human Resources Research Organization, Alexandria, VA (1970) 7. Rehmann, A., Mitman, R., Reynolds, M.: A handbook of flight simulation fidelity requirements for human factors research. Tech. Report No. DOT/FAA/CT-TN95/46. Wright-Patterson AFB, OH: Crew Systems Ergonomics Information Analysis Center (1995) 8. Patrick, J.: Training. In: Tsang, P.S., Vidulich, M.A. (eds.) Principles and Practice of Aviation Psychology, CRC Press, Boca Raton (2002) 9. Johnson, P.: Usability and mobility; Interactions on the move. In: Proceedings of the First Workshop on Human-Computer Interaction with Mobile Devices (1998) 10. Graham, R., Carter, C.: Comparison of speech input and manual control of in-car devices while on-the-move. In: Mobile HCI 1999 (1999) 11. Pascoe, J., Ryan, N., Morse, D.: Using while moving: HCI issues in fieldwork environments. Transactions on Computer-Human Interaction 7, 417–437 (2000) 12. Bohnenberger, T., Jameson, A., Krüger, A., Butz, A.: Location-aware shopping assistance: Evaluation of a decision-theoretic approach. In: Paternó, F. (ed.) Mobile HCI 2002. LNCS, vol. 2411, pp. 155–169. Springer, Heidelberg (2002) 13. Pirhonen, A., Brewster, S., Holguin, C.: Gestural and Audio Metaphors as a Means of Control for Mobile Devices. In: Proceedings of the SIGCHI conference on Human factors in computing systems (2002) 14. Kjeldskov, J., Skov, M.B.: Creating Realistic Laboratory Settings: Comparative Studies of Three Think-Aloud Usability Evaluations of a Mobile System. In: Interact 2003, pp. 663– 670. IOS Press, Amsterdam (2003) 15. Alsos, O.A., Svanæs, D.: Interaction techniques for using handhelds and PCs together in a clinical setting. In: NordCHI, Oslo, Norway, pp. 125–134. ACM, New York (2006) 16. Dahl, Y., Svanæs, D.: A comparison of location and token-based interaction techniques for point-of-care access to medical information. Personal and Ubiquitous Computing 12, 459– 478 (2008) 17. Svanæs, D., Seland, G.: Putting the users center stage: role playing and low-fi prototyping enable end users to design mobile systems. In: Proceedings of the SIGCHI conference on Human factors in computing systems, Vienna, Austria, ACM, New York (2004) 18. Petrie, J.N., Schneider, K.A.: Mixed-fidelity prototyping of user interfaces. In: Doherty, G., Blandford, A. (eds.) DSVIS 2006. LNCS, vol. 4323, pp. 199–212. Springer, Heidelberg (2007) 19. Alsos, O.A., Dabelow, B.: Stylus, Finger, or Buttons? A Comparative Evaluation Study of Interaction Techniques for PDAs in Point-of-Care Situations (submitted manuscript)
A Multidimensional Approach for the Evaluation of Mobile Application User Interfaces José Eustáquio Rangel de Queiroz and Danilo de Sousa Ferreira Federal University of Campina Grande, Electrical Engineering and Computer Science Center – Computer Science Department, Av. Aprígio Veloso, s/n – Bodocongó, Campina Grande, CEP 58109-970, Paraíba, Brazil {rangel,danilo}@dsc.ufcg.edu.br, {rangeldequeiroz,danilo.sousa}@gmail.com
Abstract. This paper focuses on a hybrid approach for the evaluation of mobile application UI, based upon a set of well known techniques for usability evaluation. Two perspectives of the problem are focused: (i) the user’s perspective, which is expressed by user’s perception of the application; and (ii) the specialist’s perspective, which is expressed by his/her considerations from the point of view of the user-application interaction, and from the point of view of the HCI community as well. Further comparisons between a lab and field evaluation approaches are given for a case study involving an Internet tablet. Conclusions are given concerning on how to apply the experience acquired by evaluating conventional UI in the mobile technology domain. Keywords: Usability evaluation, mobile devices, multidimensional approach.
levels of choice and for a successful choice, it seems equally essential to know about the effectiveness of the chosen approach. Practitioners need to know which methods are more effective and in what ways and for what purposes. Otherwise the evaluation process may result in a big effort with a small payoff.
2 Usability Evaluation for Mobile Devices Usability data typically consists of any kind of information which can be used as measures or identification keys for factors affecting the usability of a system. Such kinds of data are collected by usability evaluation methods and techniques that can assign values to usability dimensions for evaluating different kinds of UI [1] and/or indicate usability problems or other design deficiencies in UI [2]. Usability data are usually gathered via either analytic or empirical methods [2][3]. Analytic or expert-based methods are often conducted by HCI experts and do not involve human participants performing the tasks, i.e. they rely frequently on the specialists' judgment. On the other hand, in spite of being also conducted by HCI experts, empirical or used-based methods involve data collection of human usage. Usability diagnosis basically begins with raw observational data often categorized into models/ frameworks emphasizing either (i) the nature/fidelity of the artifact being evaluated; (ii) the context of use (involving user and social relations, tasks and psychological factors, and environmental aspects); (iii) the approach adopted for capturing the data (including the expended resources, the involved degree of formality, rigor, and amount of designers/evaluators and users); or (iv) the goal of the collection effort [4][5][6]. Some of those models and frameworks are aligned to ISO usability dimensions [7], which are commonly taken to include the aspects efficiency, effectiveness, and subjective satisfaction in a HCI process. It is undeniable that usability evaluation effort for desktop systems has grown especially in the last decade. In spite of debates still taking place within the HCI area, they are often based on a tacit understanding of basic concepts. Extensive guidelines have been written for describing how usability evaluation in controlled environments should be conducted (e.g., [3][8]). Further, experimental results highlighting pros and cons of different techniques are available to be applied (e.g., [9]). Especially in the past decade, technological advances and methodological approaches in HCI have been challenged by the growing focus on applications for mobile computing devices. Several authors (e.g., [10][11]) argue that mobile computing demands not only real users but also a real or simulated context with device interaction tasks as well as real tasks or realistic task simulations. The question about carrying on mobile device evaluation in a lab or field context has also been discussed (e.g. [9][12]), the effectiveness of the approach depending on the relevance of the results presented, and on the quality of the data analysis process as well. However, despite presenting data analysis results, the reports usually omit some important details of the data gathering and the analysis process. For sure, they could guide choices, and give a comprehensive view of the approach. While a strong effort of HCI research has been devoted on alternatives for data collection issues, data analysis/validation are presented in rare cases (e.g., [3][12]). In consequence, the evaluator is unable to replicate appropriately and successfully the reported findings in other contexts. As for empirical data analysis, many methods and techniques have been employed for field testing data, video data, expert data, or
244
J.E.R. de Queiroz and D. de Sousa Ferreira
head-mounted video and cued recall [3][13] [14]. In essence, the usual method triangulation seems to be a field testing without or with video analysis, and transcriptions of usability test sessions. The absence of an in depth usage data analysis seems be due to it is often not applicable to industrial purposes due to several constraints. Nonetheless for research purposes it is strongly recommended to provide sufficient detail to allow for replication.
3 The Multidimensional Evaluation Approach The present approach was originally proposed for evaluating desktop application UI [15], and further adapted to evaluate the usability of mobile application UI [16]. It is based upon a hybrid strategy which encompasses the best features of: (i) standards inspection; (ii) user performance measurement; and (iii) user inquiry. It is based on the premises that (i) each evaluation technique provides a different level of information, which will help the evaluator to identify usability problems from a specific point of view; and (ii) triangulation is used to compare the data collected from the various techniques with the aim to produce complementary and more robust results. 3.1 Product Standard Conformity Assessment, User Performance Measurement and User Subjective Satisfaction Measurement According to [7] conformity assessment means checking whether products, services, materials, processes, systems, and personnel measure up to the requirements of standards. For conformity assessment, the desktop version, of the multidimensional approach adopts the standard ISO 9241 (Ergonomic Requirements for Office Work with Visual Display Terminals). In its mobile application UI evaluation version, and more specifically for the Internet tablet case study presented in this paper, it was found that only some parts of ISO 9241 could be applied: 14 [17], 16 [18], and 17 [18]. Some other standards were applied to this kind of device such as ISO/IEC 14754 [20], ISO/IEC 24755 [21]. In general, user performance measurement aims to enable real time monitoring of user activities, providing data on the effectiveness and efficiency of his/her interaction with a product. It also enables comparisons with similar products, or with previous versions of the same product along its development lifecycle, highlighting areas where the product usability can be improved. When combined with the other methods, it can provide a more comprehensive view of the usability of a system. The major change introduced in the original evaluation approach concerns the introduction of field tests as a complement to the original lab tests. The measurement of user subjective satisfaction has been widely adopted as a measure of IS success, and the subject of a number of researches since the 1980s (e.g., [22][23]). User satisfaction diagnosis provides an insight into the level of user satisfaction with the product, highlighting the relevance of the problems found and their impact on the product acceptance. In this approach, user subjective satisfaction data are gathered from three methods: (i) automated questionnaires administered before and after test sessions; (ii) informal think-aloud trials performed during test sessions; and (iii) unstructured interviews conducted at the end of test sessions.
A Multidimensional Approach
245
In essence, ISO defines usability as an extent to which a product can be used by specified users, in a specified context of use, to achieve specified goals with effectiveness, efficiency and satisfaction [7]. It also defines that at least one indicator in each of these aspects should be measured to determine the level of usability achieved. As briefly exposed in this section, the multidimensional approach presented here meets the requirements set by ISO 9241-11 because it is used: (i) the task execution time as an efficiency indicator; (ii) the number of incorrect actions, the number of incorrect choices, the number of repeated errors, and the number of accesses to the online/ printed help as effectiveness indicators; and (iii) the think-aloud comments, the unstructured interview responses, and the questionnaire scores as subjective satisfaction indicators.
4 Comparative Study of Lab versus Field Use of an Internet Tablet The main objective of this study was to investigate the need for adapting the original evaluation approach to the context of mobile UI applications, based on the analysis of the influence of the context - lab versus field, mobility versus stationary interaction. 4.1 Experiment Design The experiment was designed to investigate the influence of the context (field and lab and related aspects, e.g., mobility, settings) and the user experience on the evaluation results. Consequently, independent and dependent variables were chosen, as well as objective and subjective usability indicators were defined. The independent variables chosen were: (i) Task context, which comprised external factors (e.g., noise level and light intensity) and internal factors (e.g., stress or other health conditions) that could affect the user behavior and performance during the usability test; (ii) User mobility, which referred to conditions under which the task was being performed (e.g., moving between places or stand still wandering is working while being mobile); and (iii) User experience level, which referred to the user knowledge regarding mobile devices and desktop computers, in general. On the other hand, the dependent variables chosen were: (i) Task execution time (time taken by a device user to perform a task); (ii) Number of incorrect choices (number of times the user has made incorrect choices while selecting menu options in the interface); (iii) Number of incorrect actions (number of times the user has performed incorrect actions while selecting menu options in the interface, excluding menu incorrect choices); (iv) Number of repeated errors (number of times the same error was made by the user while performing a task, excluding the number of incorrect choices); (v) Number of accesses to the online/printed help (number of times the user accessed the online and/or printed help while performing a task); (vi) Perceived usefulness (user opinion about the usefulness of the mobile application for the prescribed task); and (vii) Perceived ease of use (user subjective satisfaction when using the mobile device). The chosen usability objective indicators were: (i) Task execution time; (ii) Number of incorrect actions; (iii) Number of incorrect choices; (iv) Number of repeated errors; and (v) Number of accesses to the online help and/or printed manuals. Additionally, the chosen usability subjective indicators were: (i) Product easy of use; (ii)
246
J.E.R. de Queiroz and D. de Sousa Ferreira
Task completion easiness; (iii) Input mechanism easy of use; (iv) Text input modes easy of use; (v) Ease-of-understanding terms and labels; (vi) Ease-of-understanding messages; and (vii) Ease-of-use of help instructions. 4.2 Test Environment, Materials and Participants In both realistic test environments, all the elements (e.g., tasks, informal think aloud, unstructured interviews) were identical, only the test environment was different. The lab test was conducted in a typical usability lab, while the field test was conducted in an environment in which users could walk, stand still, sit or do whatever they would normally do while performing their tasks. To minimize moderator bias, the tests were conducted with three experienced usability practitioners with 3 to 12 years of experience in usability testing. The instructions given to participants were predefined. All moderators participated in data gathering and data analysis. Statistical analysis was performed by one moderator, and revised by another one. The chosen mobile device for this experiment was the Nokia 770 Internet Tablet and some of its native applications. Tests were performed in a controlled environment (a usability lab) and in the field as well. In the first one, the interaction was recorded by using three cameras installed in the room, one focused on the user facial expressions, a second one wider focused on the table where the user performed test tasks with the device fixed to the table or free in his/her hand. In the field experiment, a micro-camera connected to a transmitter was coupled to the device to remotely record and transmit user-device interaction data to the lab through a wireless connection (see Fig. 1).
Fig. 1. Apparatus to support the video micro-camera
Additionally, a remote screen capture software (VNC) was used to take test session screenshots, and a web tool named WebQuest [24] was used as well for supporting the user subjective satisfaction measurement. WebQuest supports the specialist during data collection, automatic score computation, performs statistical analysis, and generates graphical results. Currently WebQuest supports two questionnaires: (i) a pre test questionnaire, the USer (User Sketcher), conceived to raise the profile of the system users; and (ii) a post test questionnaire, the USE (User Satisfaction Enquirer), conceived to raise the user degree of satisfaction with the system. The participants were divided into two groups of 20 for the field and lab tests. According to their experience levels, both groups were then subdivided into three subgroups. A ratio of 8 beginners to 8 intermediates to 4 experts was adopted. For the lab
A Multidimensional Approach
247
tests only, the 20 participants were subdivided again into two subgroups of 10 to perform the task script with the device fixed on the table and free in their hands. A ratio of 4 beginners to 4 intermediates to 2 experts for each subgroup was adopted. 4.3 Experimental Procedure Observation and retrospective audio/video analysis for quantitative and qualitative data were employed. Participants were required to provide written consent to be filmed prior to, during and immediately after the test sessions, and to permit the use of their images/sound for research purposes without limitation or additional compensation. On the other hand, the evaluation team was committed to do not disclose the user performance or other personal information. According to the approach basis, the first step consisted in defining the evaluation scope for the product as well as in designing a test task scenario, in which the target problems addressed were related to the: (i) shape/dimensions of the device; (ii) mechanisms for information input/output; (iii) processing power; (iv) navigation between functions; and (v) information legibility. Since the test objectives focused on (i) investigating the target problems; and (ii) detecting other problems which affect usability, a basic but representative set of test tasks was selected and implemented. The test tasks consisted of (i) initializing the device; (ii) searching for books in an online store; (iii) visualizing a PDF file; (iv) entering textual information; (v) using the e-mail; e (vi) using the audio player. After planning, 2 pilot tests (lab and field) were conducted to verify the adequacy of the experimental procedure, materials, and environment. Aiming to prevent user tiredness, the session time was limited to 60 minutes, and the test scenario was re-dimensioned to 6 tasks. Thus, each test session consisted of (i) introducing the user to the test environment by explaining the test purpose and procedure; (ii) applying the pre-test questionnaire (USer); (iii) performing the six-task script; (iv) applying the post-test questionnaire (USE); and (v) performing a non-structured interview. For the participants who declared not having had any previous contact with the Internet tablet, an introductory explanation about the device I/O modes and its main resources was given, considering that, at the time of the experiment, it was not yet widely spread in Brazil.
5 Results Conformity assessment results can be summarized by computing an Adherence Rating (AR), which is the percentage of the Applicable recommendations (Ar) that were Successfully adhered to (Sar) [17]. The results of the conformity assessment are summarized in Table 1. As it can be observed, all the ARs, excluding that one related to ISO 14745, are higher than 75%, which means successful results. As for ISO 14745, the result indicates the need to improve the input text via write recognition. Those results corroborate with the idea that the efficacy of standards inspection can be considerably improved if it is based upon standards conceived specifically for mobile devices, which could evidence more usability problems.
248
J.E.R. de Queiroz and D. de Sousa Ferreira Table 1. Nokia 770 conformity assessment with standards
STANDARD
#Sar
#Ar
AR (%)
ISO 9241 Part 14 ISO 9241 Part 16 ISO 9241 Part 17 ISO 14754 ISO 24755
45,0 26,0 47,0 4,0 6,0
53,0 33,0 52,0 11,0 7,0
84,9 78,8 90,4 36,4 85,7
As for the user subjective satisfaction measurement, both questions and answers of the post test questionnaire (USE) were previously configured. The questionnaire was applied soon after the usability test and answered using the mobile device itself, with the purpose to collect information on the user degree of satisfaction with the device by means of 38 questions about menu items, navigation cues, understandability of the messages, ease of use functions, I/O mechanisms, online help and printed manuals, users’ impression, and product acceptance level.. With the support of the pre test questionnaire (USer), the user sample profile was drawn. It was composed of 16 male and 24 female users, of which 16 were undergraduate students, 17 post-graduate students, 5 had graduate level and 2 had post-graduate level. The age varied between 18 and 29 years. They were mainly right handed and used some sort of reading aid (glasses or lenses). All of them had at least 1 year of previous experience with computer systems, were currently using computers on a daily basis, and had previous experience with mobile devices. The ranges for USE normalized user satisfaction are 0.67 to 1.00 (Extremely Satisfied), 0.33 to 0.66 (Very satisfied), 0.01 to 0.32 (Fairly satisfied), 0.00 (Neither satisfied nor unsatisfied), 0.01 to -0.32 (Fairly dissatisfied), -0.33 to -0.66 (Very dissatisfied), and -0.67 to -1.00 (Extremely dissatisfied). The normalized user satisfaction achieved was 0.330 (Very satisfied) for the lab experiment, and 0.237 (Fairly satisfied) for the field experiment . During the test sessions were identified 23 usability problems. 21 problems (91.3%) were detected in the lab experiment, while 14 ones (60.8%) were found in the field experiment . On the other hand, 12 problems (60.0%) were found with the device fixed on the table, while 15 (75.0%) were identified with the device free, in the user’s hands. Since the multidimensional approach is based upon the triangulation of results, Table 2 summarizes the usability problem categories which were identified during the evaluation process. For each category, the number of problems identified by each technique is given. As can be seen, some of the usability problem categories were more associated to the performance measurement (e.g. hardware aspects, help mechanisms) whereas others (e.g. menu navigation, presentation of menu options) were identified by the conformity assessment. The combination of the results from the post-test questionnaire to the comments made during the test sessions and the nonstructured interviews at the end of each experiment session showed that the user opinion was in agreement (e.g., location and sequence of menu options) or disagreement (e.g., menu navigation) with the results obtained from the other two evaluation techniques. This discrepancy can originate from the users’ perception of product quality, and from the perception of their own skills to perform the task.
A Multidimensional Approach
249
Table 2. Overlay of results obtained from different techniques described above PROBLEM CATEGORY Location and sequence of menu options Menu navigation Presentation of menu options Information feedback Object manipulation Symbols and icons Text entry via stylus (writing recognition) Text entry via virtual keyboard Processing power Hardware issues Fluent tasks execution Online and offline help Form manipulation
The statistic analysis consisted of: (1) building a report with univariate statistics; (2) generating the covariance matrices for the predefined objective and subjective indicators; (3) applying the one-way F ANOVA test to the data obtained from the previous step in order to investigate possible differences; and (4) applying the TukeyKramer process to the one-way F ANOVA results aiming to investigate if the found differences were statistically significant to support inferences from the selected sample.According to the results (see Table 3), the series of two-factor ANOVA involving Time, Errors (Incorrect actions, Incorrect choices, and Repeated errors), and Help accesses showed that the user experience level had a more significant effect on the number incorrect choices in the field experiment than in lab one. Pre and post-test questionnaire analysis and informal interviews results reinforced that domain knowledge and computer literacy have significant influence on user performance concerning the incidence of errors, both in lab and in the field. Table 3. Lab x Field and Fixed x Free experiment results
Lab
p-Value (α=0.05) Field Fixed
Free
Experience x Task Time
0.019
0.056
0.025
0.026
Experience x Incorrect Actions
0.003
0.003
0.001
0.043
Experience x Incorrect Choices
0.049
0.0006
0.164
0.270
Experience x Repeated Errors
0.017
0.127
0.194
0.133
Experience x Help Accesses
0.164
0.563
0.148
-
Variable Pair
250
J.E.R. de Queiroz and D. de Sousa Ferreira
6 Final Considerations Studies in the literature fit basically in two categories: (i) user mobility while using the device, inside of a lab or outdoors; and (ii) user distraction in pervasive computing. This study considered both aspects as part of the task context. In field test subjects were free to choose between moving or remaining still as they performed the task with the mobile device. The movement registered was limited to situations while the user waited for some device processing (e.g. web page downloads). During the field tests, while the user was moving, there was a clear interference of the environment on the user attention. Outdoors, in the ambient light, the device’s legibility was reduced/ aggravated by glare and reflections on the screen. Although the user’s opinion was that the camera apparatus did not interfere with the task execution the vast majority decided to lay the device down during task execution. Confirming previous findings, the experiments demonstrated that applications that require a lot of interaction and user attention are inappropriate for performing while walking due to attention distraction. This reinforces that, in spite of the mobility of the device targeted in this study, the evaluation settings did not need to differ substantially from the one employed in the evaluation of stationary devices, since the users tend not to wander while performing tasks that demand their attention. Until recently, studies have been published which deal with new paradigms and evaluation techniques for mobile devices. Few of the proposed new techniques are really innovative if compared to the ones traditionally employed. The data gathered and analyzed support the initial assumption that minor adaptations in the traditional evaluation techniques and respective settings are adequate to accommodate the evaluation of the category of mobile devices targeted by this study. The above comments corroborate with the views of the authors and [15] that of the laboratory and field evaluations do not diverge but are complimentary. As shown in this study they both add to the evaluation process, producing data that is significant to the process reinforcing the relevance of a multidimensional approach for the mobile device usability evaluation.
References 1. Rosson, M.B., Carroll, J.M.: Usability Engineering: Scenario-Based Development of Human-Computer Interaction. Academic Press, San Diego, CA (2002) 2. Hartson, H.R., Andre, T.S., Williges, R.C.: Criteria for evaluating usability evaluation methods. IJHCI 15(1), 145–181 (2003) 3. Nielsen, J.: Usability engineering. Academic Press, Boston (1993) 4. Wixon, D., Wilson, C.: The usability engineering framework for product design and evaluation. In: Helander, M., Landauer, T.K., Prabhu, P. (eds.) Handbook of humancomputer interaction, 2nd edn., pp. 653–688. John Wiley and Sons, Chichester (1997) 5. Jones, M., Marsden, G.: Mobile Interaction Design. John Wiley and Sons, Inc., Chichester, West Sussex (2006) 6. Danielson, D.R.: Usability data quality. In: Ghaoui, C. (ed.) Encyclopedia of humancomputer interaction, pp. 661–667. Idea Group Reference (2006)
A Multidimensional Approach
251
7. ISO 9241-11: Ergonomic requirements for office work with visual display terminals (VDTs) - Part 11: Guidance on usability. International Organization for Standardization, Geneva, Switzerland (1998) 8. Dumas, J.S., Loring, B.A.: Moderating Usability Tests: Principles and Practices for Interacting, illustrated edn. Morgan Kaufmann, San Francisco (2008) 9. Kjeldskov, J., Stage, J.: New techniques for usability evaluation of mobile systems. IJHCI 60(5-6), 599–620 (2004) 10. Ballard, B.: Designing the Mobile User Experience. John Wiley and Sons, Chichester (2007) 11. Goren-Bar, D., Graziola, I., Pianesi, F., Zancanaro, M., Rocchi, C.: Innovative Approaches for Evaluating Adaptive Mobile Museum Guides. In: Stock, O., Zancanaro, M. (eds.) PEACH - Intelligent Interfaces for Museum Visits, pp. 245–265. Springer, Heidelberg (2007) 12. Po, S., Howard, S., Vetere, F., Skov, M.B.: Heuristic evaluation and mobile usability: Bridging the realism gap. In: Proceedings of Mobile HCI, pp. 49–60 (2003) 13. Sanderson, P., Fisher, C.: Usability testing of mobile applications: A comparison between lab and field testing. Human-Computer Interaction 9, 251–317 (1994) 14. Omodei, M.A., Wearing, J., McLennan, J.P.: Head-mounted video and cued recall: A minimally reactive methodology for understanding, detecting and preventing error in the control of complex systems. In: Proceedings of 21th European Annual Conference of Human Decision Making and Control (2002) 15. de Queiroz, J.E.R.: Abordagem Híbrida para avaliação da usabilidade de interfaces com o usuário. Tese de Doutorado, UFPB, Brazil, p. 410 (2001) (in Portuguese) 16. Turnell, M.F.Q.V., de Queiroz, J.E.R., Ferreira, D.S.: Multilayered Approach to Evaluate Mobile User Interfaces. In: Lumsden, J. (ed.) Handbook of Research on User Interface Design and Evaluation for Mobile Technology, vol. 1, pp. 847–862. IGI Global (2008) 17. ISO 9241-14: Ergonomic requirements for office work with visual display terminals (VDTs) - Part 14: Menu dialogues. ISO, Geneva, Switzerland (1997) 18. IS09241-16: Ergonomic requirements for office work with visual display terminals (VDTs) - Part 16: Direct manipulation dialogues. ISO, Geneva, Switzerland (1999) 19. IS09241-17: Ergonomic requirements for office work with visual display terminals (VDTs) - Part 17: Form filling dialogues. ISO, Geneva, Switzerland (1998) 20. ISO/IEC 14754: Information technology - Pen-based interfaces - Common gestures for text editing with pen-based systems. ISO, Geneva, Switzerland (1999) 21. ISO/IEC 24755: Information technology - Screen icons and symbols for personal mobile communication devices. ISO, Geneva, Switzerland (2007) 22. Bailey, J.E., Pearson, S.W.: Development of a Tool for Measuring and Analyzing Computer User Satisfaction. Management Science 29(5), 530–545 (1983) 23. Aladwani, A.M., Palvia, P.C.: Developing and validating an instrument for measuring user-perceived Web quality. Information & Management 39, 467–476 (2002) 24. De Oliveira, R.C.L., de Queiroz, J.E.R., Vieira Turnell, M.F.Q.: WebQuest: A Configurable Web Tool to Prospect the User Profile and User Subjective Satisfaction. In: Salvendry, G. (ed.) Proceedings of the 2005 Human-Computer Interaction Conference, vol. 2, Lawrence Erlbaum Associates, Nevada (2005) (U.S. CD-ROM Multi Platform)
Development of Quantitative Usability Evaluation Method Shin’ichi Fukuzumi1, Teruya Ikegami1, and Hidehiko Okada2 1
NEC Corporation, Common Platform Software Research Laboratories, 7-1, Shiba 5-chome, Minato-ku, Tokyo 108-8001, Japan {s-fukuzumi@aj,t-ikegami@ct}.jp.nec.com 2 Kyoto Sangyo University, Kamigamo Motoyama, Kita-ku, Kyoto 603-8555, Japan [email protected]
Abstract. A variety of evaluation methods are practiced in order to make more appealing and improve the usability of computer systems. The authors have developed a quantitative usability evaluation method that uses a checklist that outlines an evaluation procedure and clarifies judging standards. This paper describes this quantitative usability evaluation method that is not influenced by an evaluator’s subjective impression. Moreover, such clear and precise definitions makes checklist-based evaluations more repeatable (thus more reliable) and less affected by differences among evaluators. The effectiveness of our checklist has been evaluated by the experiments with novice and experienced evaluators. This article reports the method and results of the experiments. Keywords: Usability, evaluation, checklist.
1 Introduction A usability evaluation method using a checklist [1], which is typical in usability evaluations [2], can be applied to the later stages of a development process [3]. However, obtaining justifiable evaluation results is difficult because they depend on the evaluators’ skill, experience, and subjectivity. To solve this problem, we have developed a usability checklist to minimize deviation of evaluation results for realizing usability quantification [4-5]. In this paper, we introduce this checklist, validate it, and apply it to system operation products.
Development of Quantitative Usability Evaluation Method
253
The authors made sure to judge each item as “has a problem”, “No problems” or “irrelevant” for a clarifying procedure, an evaluation target, and a gauge. Moreover, to stop evaluators’ different intelligibilities and interpretations causing results to blur, samples in a checklist and a collection of terminology definitions were also prepared. Visualization of the effect to the user. An evaluation axis often consists of an element directly connected with design and development, such as a layout, and a button because it is assumed that a specialist in UI design and a developer generally use a checklist, and the effect for the user is hard to determine. To measure the degree to which the effect of each item of the checklist satisfies the user, it is important to weight the qualities of each item. The authors weighted the items using the analytic hierarchy process (AHP) method [7], and evaluation results were decided on the basis of four qualities: “efficiency”, “ease to learn”, “errors”, and “ease to memorize”. 2.2 Maintenance of a Checklist Selection of the item. The authors referred to the various standards and guidelines, made a rough draft of the checklist, selected items, improved them through verifying them, and then evaluated them. Moreover the AHP method of weighting the items was contemplated, and the chapter construction of the item was decided so that no part of the layer became too deep (table 1). Table 1. Chapter of the checklist
1 2 3
Section name Consistency of indication/ operation Legibility of information Presentation of the present state
4 5
Conformability to the user/ environment Conformability to the work
Item
Procedure
Content
Fig. 1. Items of the checklist
The number of items 17 8 22 18 19
Target
Weight
254
S. Fukuzumi, T. Ikegami, and H. Okada
Procedure of an evaluation. This checklist consists of five sections and 84 items and is equipped with “evaluation procedure”, “evaluation target”, and “the weight (4 axes)” for every item (figure 1). The flow of the evaluation was procedure-ized, and a gauge and a result in each step were described clearly to make sure that evaluation results could be judged correctly for each designated evaluation target. When a button or pull down menu was right clicked and the appropriate operation was performed, the judgment result was “No problems”. When the appropriate operation was not performed, the result was “Has a problem”. Additionally, in case that necessity which fits in is low which it may occur problems not to satisfy the precondition and a presence of customization, it is judged “Irrelevant”. Figure 2 shows items of the checklist and a case by an illustration.
Evaluation item Evaluation target
Evaluation Procedure
Example
Fig. 2. Item of the checklist (details and example)
Development of Quantitative Usability Evaluation Method
255
Weight of item. To decide the weight of each item, the AHP method was applied. The feature of the AHP method is to apply paired comparison to an evaluation target according to some gauges. This method can calculate the high score of the validity compared with deciding about the weight of each element overall. Of the five special usability qualities Nielsen advocates to writers [1], the authors chose four: “ease to learn”, “errors”, “ease to memorize” and “efficiency” by a gauge and decided the weight of all of these qualities in each item. The value of a paired comparison of the item was decided by conference of 3 specialists of user interface.
3 Effectiveness Evaluation of Checklist 3.1 Experiment Method Evaluation targets. Three or five GUI windows of an e-mail software were selected as evaluation targets. Participants (Evaluators). In total, 50 people participated in this experiment. Of these, 30 were novice college students without much experience or understand of the software’s usability. On the other hand, the remaining 20 participants were expert evaluators who had experience and understanding of the software’s usability. They are the researchers who are emplyed by a company. Procedure. In accordance with the gauge described in section 2, participants evaluated some of the GUI windows prepared as evaluation targets, as explained above. By comparing novices and experts’ reported results, novice participants’ results can be verified to see whether it is possible to obtain results the same as those of experts. 3.2 Experimental Result Each evaluation result was judged as “Has a problem”, “No problems”, and “Irrelevant”. By comparing results, the possibility of both experts and novices obtaining the same results was tested. As an index the degree is identical between experts and novices’ results, concordance rate is defined as follows [4]. The concordance rate (%) = 100 * (number of novices whose results agreed with those of an expert (%))/(Number of novices) The average concordance rate obtained in this experiment was 73.75%. The average concordance rates of “Has a problem”, “No problems”, and “Irrelevant” are show in Table 2. Table 2. Average concordance rate
Evaluation results Has a problem No problems Irrelevant
Concordance rate 50.6 % 78.7% 80.6%
256
S. Fukuzumi, T. Ikegami, and H. Okada
3.3 Discussion As shown in Table 2, concordance rates of “No problems” and “Irrelevant” are relatively high while that of “Has a problem” is relatively low. When problems existed, it can be said that in a lot of cases novice evaluators overlooked the problem. However, the concordance rate of 50.6% means that the rate of “more than one person of inside of n persons is correspondence” is 1- (1-0.506) n, and big sufficiently with 94.0% at n=4, 87.9% at n=3. From this, a right result is expected to be obtained when more than three people estimate a result.
4 Practical Use of the Checklist This section describes the operation procedure of this checklist. 4.1 Application of the Evaluation Items An evaluator judges whether an evaluation target is described as “Has a problem”, “No problems”, or “Irrelevant” in accordance with an evaluation procedure. The goodness of fit of the evaluation result is calculated by the weight of the judgment result and the item. Next, the methods of judging and calculating the goodness of fit of the evaluation result are described. Judgment of a result. In the method for evaluating represented information and the consistency of operation, these items are applied to the whole screen. When they found a part of a screen with a problem by the item, evaluators judged it as “Has a problem”. If there was a problem only in a part of the screen among screen groups, even when being unified during other pictures, evaluators judged it as “Has a problem” (figure 3). Each screen and part that was an evaluation target was evaluated in terms of the items regardless of the consistency.
Without problem Consistency of Display / operation Layout of button
There is a problem There is a problem
Layout order of table / list
Fig. 3. Evaluation of consistency (the arrangement location of the button)
Calculation of the goodness of fit of the evaluation results. Even in the same item, there exist several evaluation targets that had different a judgment results, so goodness of fit of the evaluation results needs to be calculated by integrating the results. An
Development of Quantitative Usability Evaluation Method
257
evaluator passes a basic overall judgment by prioritizing “Has a problem”, over “No problems”, and then “Irrelevant”. For example, if one “Has a problem” is in the results, the overall judgment is also “Has a problem”. The item judged as “No problems” is weighted to check the goodness of fit of the evaluation results in accordance with the respective evaluation axes. 4.2 Calculation of an Evaluation Result The amount the results concur by the same weight of each quality is the overall score on the evaluation axes. Evaluation result examples for three similar products are indicated in figure 4. Since product A clearly has the highest efficiency and product B is obviously the easiest to learn, it is possible to grasp the special quality of each product. Thus it becomes possible to confirm from four angles the evaluation result about usability by using this checklist. Efficiency Product A Product B Product C
Little error
Easy to learn
Easy to memorize
Fig. 4. Example of Evaluation Result
5 Summary The authors have developed a usability quantification method in which a checklist is used that excludes blurring of results by detailing an evaluation target, an evaluation procedure, and acceptance standard. Even when an evaluator knows little about system usability, he or she can obtain objectives results on it by using this evaluation method.
References 1. Nielsen, J.: Usability Engineering. Academic Press, London (1993) 2. Ravden, S., Johnson, G.: Evaluating Usability of Human-Computer Interfaces: A Practical Method. Prentice-Hall, Englewood Cliffs (1989) 3. ISO 13407: Human-centred design processes for interactive systems (1999)
258
S. Fukuzumi, T. Ikegami, and H. Okada
4. Ikegami, T., Okada, H., Yoshizaka, S., Fukuzumi, S.: Proposal of usability quantification method (1) –checklist for exclusion blurring among evaluators. In: Annual conference of Information Processing Society Japan (2008) (in Japanese) 5. Okada, H., Ikegami, T., Yoshizaka, S., Fukuzumi, S.: Proposal of usability quantification method (2) –experiment for validation of a checklist. In: Annual conference of Information Processing Society Japan (2008) (in Japanese) 6. Kato, S., Horie, K., Ogawa, K., Kimura, S.: A Human Interface Design Checklist and Its Effectiveness. Transaction of Information Processing Society of Japan 36(1), 61–69 (1995) (in Japanese) 7. Ham, D.-H., Heo, J., Fossick, P., Wong, W., Park, S.-H., Song, C., Bradley, M.: Model-based approaches to quantifying the usability of mobile phones. In: Jacko, J.A. (ed.) HCI 2007. LNCS, vol. 4551, pp. 288–297. Springer, Heidelberg (2007)
Reference Model for Quality Assurance of Speech Applications Cornelia Hipp and Matthias Peissner Fraunhofer Institute for Industrial Engineering (IAO), Nobelstr. 12, 70569 Stuttgart, Germany {cornelia.hipp,matthias.peissner}@iao.fraunhofer.de
Abstract. The acceptance of speech applications is still very low in Germany. The German market of speech industry identified this problem and makes an effort to improve the quality of speech applications, which should lead to higher user acceptance. To ensure higher quality standards, a reference model has been developed with special regard to the needs of interactive voice response systems (IVR). This model includes instructions to improve the process quality of the development process as well as methods, measurements and quality criteria to evaluate the product quality. Furthermore, the presented reference model differentiates between eight application types of IVR and describes which methods, measurements and quality criteria are especially important for each application type. Keywords: Quality, speech, interactive voice response, automatic speech recognition, measurement, method, voice, speech interaction, reference model, application type.
assurance within the areas of software management [3][4], classical engineering and usability engineering [5][6]. Additionally there are first solutions for evaluating speech applications [7][8] and guidelines to improve the usability of speech applications [9]. But there is a lack of knowledge regarding the systematic improvement in quality of speech applications within an adequate reference model. There are ambitions of the German Initiative Voice Business with their yearly congress Voice Days and their Voice Award which is awarded to best German-based speech application [10]. Although the used method of testing is a solid basis for quality measurement, it is not usable within development projects as a model for continuously controlling and optimising the quality of speech applications.
2 Intention For the reasons mentioned above, we aimed to find a suitable approach for measuring and improving the quality of speech applications. We realized quickly that a holistic approach to improve the dialogue and the user experience of interactive voice response systems is needed. Different factors within the development of speech applications are affecting the quality of the IVR and they are dependent on each other. Within this paper we describe a new holistic reference model for quality assurance of speech applications. The model has been developed at Fraunhofer Institute for Industrial Engineering (IAO) Germany in cooperation with the Initiative Voice Business (IVB) and with exchange of 36 experts within the German speech application market. Additionally, the efforts made were supposed to encourage the voice industry on the German-speaking market and to find solutions which are industry-orientated and easy to transfer to concrete projects. The work is supposed to sharpen the public awareness of the subject of quality in speech applications and to show possibilities of improving the quality.
3 Description of Reference Model First, we adopted a common approach within the software development, the differentiation between product quality and process quality [4]. Product quality refers to the product itself, in this case the IVR. Process quality in contrast refers to rules, strategies and requirements of the development process. Securing a high-quality process doesn’t assure a high quality of the end product but chances are very good, that it is highly qualitative. Therefore in this holistic model, actions are described to monitor the process development as well as to continuously evaluate the product quality. For the product quality, we identified ten quality criteria explicitly for speech applications [11]. These ten criteria describe a good speech application in a holistic way so that criteria are defined within the areas voice user interface and usability as well as strategy and business logic, dialogue platform and integration and speech technology and linguistics. But the quality criteria are not assigned precisely to one specific area and can affect several.
Reference Model for Quality Assurance of Speech Applications
261
Fig. 1. Overview of the Presented Reference Model with Components and their Dependencies
The ten quality criteria can be evaluated with the help of measurements. In total, there are 34 measurements defined to the special needs of IVRs, like e.g. caller frequency or no match rates. Accordant to the quality criteria, the measurements are defined for the four different areas and do not have to be assigned to only one area. The measurements are used to supply concrete values, which can be used for comparison, either between different speech applications or different versions/ development stages of one application. With the help of methods, the described measurements can be performed to achieve concrete values. In total, there have been 23 methods identified with special regard to speech applications, like load test, wizard of oz test and expert evaluation. The methods are important elements within the reference model for the process quality. They are assigned to different steps of the process, which are differentiated between project preparation and analysis, concept and design, implementation, integration and bringing into service and operation. Methods and measurements are supposed to be used in an iterative process to control and achieve good performance.
262
C. Hipp and M. Peissner
Furthermore, quality criteria, measurements and methods have different weight for different application types. Therefore eight different application types have been identified [12] with partly very different priorities for criteria, measurements and methods. In the first step, when using the reference model, it has to be identified which of the defined application types is going to be implemented. Subsequently, important quality criteria can be looked up and then, which measurements are meaningful for the particular criteria. After that, it can be looked up, which methods are usable in order to get results for the measurements. 3.1 Quality Criteria Within the introduced reference model ten quality criteria are defined to show what applications have to achieve to be considered as a good IVR. They are described with regard to the holistic approach of the reference model and cover the four mentioned predefined themes voice user interface and usability, strategy and business logic, dialogue platform and integration and speech technology and linguistics. With the aid of measurements, data can be collected to check whether the quality criteria are achieved. Following the ten quality criteria are listed: Appropriate Functionality Coverage and Content Offering: A speech application is good, if an added value is created for the costumer by means of an attractive and complete offer of functionality. Faultless Operability and Capability: A speech application is good, if a secure and faultless functioning with high performance is assured – on peak loads as well. Administrability and Efficient Operations: A speech application is good, if technical efforts after launching the IVR can be kept at a minimum. Expandability and Scalability: A speech application is good, if the system architecture easily allows future enhancements and changes. Profitability: A speech application is good, if the service is economical profitable. Reliable Recognition of User Utterances: A speech application is good, if speech recognition reliably recognizes an appropriate amount of prospective user utterances. Effective Management of Errors: A speech application is good, if recognition errors and errors of usage do not cause major damage. Effective and Flexible Dialogue Flow: A speech application is good, if the navigation structure supports users to reach their aim fast and secure. Comprehensible and Goal-Orientated System Output: A speech application is good, if the acoustic system output supports the user within orientation and in formulating goal-orientated utterances. Impression and emotional addressing: A speech application is good, if a positive and appropriate attitude of the user towards the speech application, its use and its operator can be reached.
Reference Model for Quality Assurance of Speech Applications
263
3.2 Measurements Within the presented reference model, 32 measurements have been worked out. With their aid, data can be collected in order to identify whether a quality criteria is fulfilled or not. The measurements are performed by means of the defined methods. Measurements are defined based on the following characteristics: name, synonyms, brief description, reference to quality criteria, reference to application components, actual use in practice, usable methods for collecting data for this measurement, appraisal of profitability. As an example, the measurement routing rate is displayed in the following: Name: routing rate Synonyms: correct routing rate Brief Description: percentage of calls which can be successfully transferred (according to the wishes of the costumer) in proportion to the total amount of callers Reference to Quality Criteria: faultless operability and capability, profitability, reliable recognition of user utterances, effective management of errors, impression and emotional addressing Reference to Application Components: model, view, control and access – an optimal co-operation between all components is necessary (referring to the architectural pattern model-view-controller, enhanced with the component access) Actual Use in Practice: frequently – comment: measurement is only relevant for applications where routing has high importance Usable Methods for Collecting Data for this Measurement: logfile analysis and reporting Appraisal of Profitability: very high 3.3 Methods 23 different methods are defined within this reference model for quality assurance of speech applications. They are attached to specific process steps, but do not necessarily have to be attached only to one process step. By means of the methods, data can be collected for different measurements. Subsequently, with the aid of the measurements, quality criteria can be evaluated. Within the reference model, differentiation has been done between methods which should be necessarily done and methods which should be applied to achieve an excellent process. This classification differs between the eight application types, listed later on. Methods are defined based on the following characteristics: name, synonyms, brief description, which measurements can be covered or optimized, reference to quality criteria, reference to application components, relevance to thematic areas, reference to process steps, maturity of method, actual use in practice, potential for use in practice, requirements for this method, appraisal of profitability As an example, the method in-service test is displayed subsequently: Name: in-service test Synonyms: watchdog test, keep-alive test, availability test
264
C. Hipp and M. Peissner
Brief Description: The in-service test is a permanent test regarding the availability and functionality of a speech portal in operation. The system generates external controlling calls cyclically over a long timeframe. Time intervals, type of calls and test scripts are freely selectable. Which Measurements can be Covered or Optimised: service availability, service accessibility, answering time for costumer, correctness of system output Reference to Quality Criteria: faultless operability and capability, profitability Reference to Application Components: access – comment: affected are the telephone functionalities (referring to the architectural pattern model-view-controller, enhanced with the component access) Relevance to Thematic Areas: dialogue-platforms and integration Reference to Process Steps: in operation Maturity of Method: high Actual Use in Practice: occasional/ seldom Potential for Use in Practice: high Requirements for this Method: test equipment, test services Appraisal of Profitability: good – minor costs compared to high benefit 3.4 Product Quality The final aim of the reference model is to reach a high quality of the final product of the IVR. This can be achieved with the help of high process quality and checked with the ten defined quality criteria. 3.5 Process Quality The described reference model differentiates between process quality and product quality. While product quality focuses on evaluation and optimisation of the final result (the IVR) the process quality focuses on the process while developing the product. The notion that improving the process quality will lead subsequently to an improvement of the product quality is the reason why process quality has an important part in the reference model. The process steps which are differentiated within the development of a speech application are defined as project preparation and analysis, concept and design, implementation, integration and bringing into service and operation. These process steps are not strictly separated and should be seen as an iterative process. Additional evaluation should be done through the whole process and not included at a specific time-frame of the development. Within the reference model, methods are defined which should be carried out at specific process steps to ensure a high quality of the process. Furthermore, the model discriminates between an ensurance of a minimum of process quality and the achievement of an excellent process. Therefore, it defines methods which should be necessarily done for the first and lists additional methods for the latter. For instance, it is necessarily recommended to do functional tests during the implementation phase of self service portals. To achieve an excellent process, it is recommended in addition to already do a friendly user test during this phase.
Reference Model for Quality Assurance of Speech Applications
265
3.6 Application Types Measurements, methods and quality criteria are differently useful and meaningful for different application types. E.g. the quality criterion impression and emotional addressing is very important for marketing applications, but not that much for authentication services. Therefore the reference model does not allocate data to specific measurements for all of the speech applications, but differentiates between unequal application types. With the help of this discrimination, it is possible to compare results between applications within the same application type and identify potentials and weaknesses of them. Within the reference model, the following eight application types are defined, with different benchmarks for quality criteria, measurements, methods and process steps: Call Routing: Incoming calls are sorted thematically, and are transferred to the correct person in charge within the call center. Information Service: Costumer can receive information inexpensive, swiftly and upto-date via speech-based information service. Reminding and Alerting Service: Alerting services trigger automatic calls in the event of an emergency (e.g. catastrophes like earthquakes, hurricanes). Reminding services are calling in case of pre-defined important appointments (e.g. delivery dates or taking medication). Authentication Service: Speech application verifies whether caller is in fact the person he is declaring to be based on the unique characteristics of the human voice. Automated Telephone Switchboard: In case of absence of the callee, the automated telephone switchboard can e.g. transfer the call to a colleague or start an answering machine. Track & Trace-System: With the aid of track & trace-systems, companies can permanently provide information of the actual state of their services. Marketing Application: Companies can use IVR for marketing purposes, like lottery, advertisement or usage of voices of prominent people. Self Service Portal: Costumers can execute different transactions and employ information services using self service portals by themselves.
4 Concluding Remarks In Germany, the need for improving the quality of speech applications has been realized, due to the reason that German-speaking voice industry has an issue with the user acceptance in IVR. The potential of speech interaction is not adequately exploited yet and several German speech companies are working together to find consolidated solutions. Quality criteria, methods and measurement have already been defined with special regards of eight application types. But there are still open questions, e.g. how to compare applications within the same type, but with different degree of complexity. Furthermore, the ongoing work should conclude in an acknowledged standard to sensitize customers to quality differences in of speech applications.
266
C. Hipp and M. Peissner
Acknowledgements. Ongoing work to find solutions for quality of speech applications is done with great support of 36 experts in Germany. We would like to thank the following companies and persons: Cirquent (Dr. Bettina Attallah), D+S solutions (Kerstin Sehnert), E.ON Hanse (Frank Oldorf), Genesys Telecommunications Laboratories (Giancarlo Boi), HFN Medien (Dr. Frank Wanning), IBM Germany Research and Development (Ludovica De Sio, Dr. Carsten Günther, Dr. Marion Mast), mind Business Consultants (Sebastian Paulke, Bernhard Steimel), NEXT ID (Ralf Poplawski), Nortel (Dr. Oliver Huber), SemanticEdge (Jörn Kreutel, Dr. Lupo Pape), Sikom Software (Jürgen Hoffmeister, Dietmar Kneidl), Sparda-Bank Hamburg eG (Jürgen Mehring), SpeechConcept (Dr. Uwe Lay), Strateco (Mark Gutmann), Sympalog Voice Solutions (Dr. Jürgen Haas), tech2biz (Dr. Christian Dugast), Deutsche Telekom Laboratories (Caroline Clemens, Dr. Florian Metze, Prof. Dr. Sebastian Möller, Wiebke Johannsen), Telenet Communication Systems (Dr. Florian Hilger, Markus Kesting), T-Mobile Deutschland (Dr. Guntbert Markefka), T-Systems Enterprise Services (Frank Oberle), Unisys Deutschland (Andreas Schaub), VMA (Dr. Guntbert Markefka, Andreas Schaub, Dr. Frank Wanning), voiceandvision (Tom Houwing), Voice & Visual Design (Paul Hubert Vossen), 4Com (Dennis Jehne).
References 1. Peissner, M., Sell, D., Steimel, B.: Acceptance of Speech Applications (German orig. Akzeptanz von Sprachapplikationen). Fraunhofer Institute for Industrial Engineering (IAO), Stuttgart (2006) 2. Wu, E.: Bill Gates predicts software revolution. MIS ASIA (August 14, 2008), http://mis-asia.com/news/articles/ bill-gates-predicts-software-revolution 3. Sommerville, I.: Software Engineering. Pearson Education Germany GmbH, Munich (2001) 4. Ludewig, J., Licher, H.: Software Engineering. dpunkt.verlag, Heidelberg (2007) 5. Nielsen, J.: Usability Engineering. Academic Press, San Diego (1993) 6. Mayhew, D.: The Usability Engineering Lifecycle: A Practitioner’s Handbook for User Interface Design. Academic Press, San Diego (1999) 7. Dybkjaer, L., Hemsen, H., Minker, W.: Evaluation of Text and Speech Systems. Springer, Dordrecht (2007) 8. Möller, S.: Quality of Telephone-Based Spoken Dialogue Systems. Springer Science + Business Media, New York (2005) 9. Hempel, T. (ed.): Usability of Speech Dialog Systems. Springer, Berlin (2008) 10. Initiative Voice Business, http://www.voice-award.de 11. Peissner, M., Hipp, C., Steimel, B.: Quality Criteria, Measurements and Methods for Speech Applications (German orig. Qualitätskriterien, Maße und Verfahren für Sprachapplikationen). Fraunhofer Institute for Industrial Engineering (IAO), Stuttgart (2007) 12. Hipp, C., Paulke, S., Peissner, M., Steimel, B.: Quality Guideline – Cookbook for Good Speech Applications (German orig. Qualitätsleitfaden – Kochbuch für gute Sprachapplikationen). Fraunhofer Institute for Industrial Engineering (IAO), Stuttgart (2008)
Toward Cognitive Modeling for Predicting Usability Bonnie E. John1 and Shunsuke Suzuki2 1
Human-Computer Interaction Institute, Carnegie Mellon University, 5000 Forbes Ave. Pittsburgh, PA, 15213, USA [email protected] 2 NEC Corporation, 8916-47, Takayama-cho, Ikoma, Nara 630-0101, Japan [email protected]
Abstract. Historically, predictive human performance modeling has been successful at predicting the task execution time of skilled users on a desktop computer. More recent work has predicted novice behavior in web searches. This paper reports on a collaborative effort between industry and academia to expand the scope of predictive modeling to the mobile phone domain, both skilled and novice behavior, and how human performance relates to the perception of usability. Since, at this writing, only preliminary results to validate models of mobile phone use are in, we describe the process we will use to progress towards our modeling goals. Keywords: Cognitive modeling, GOMS, KLM, CogTool, Information Foraging.
Despite research progress creating and validating theory, UI developers have not adopted predictive human performance modeling as a frequently-used tool for design. Recent work has embodied these theories into tools that allow practicing developers to achieve the benefits of modeling without investing considerable time in learning to model and constructing each new model (e.g., [7, 8, 9]). However, it is difficult to make a trustworthy tool for practical design problems. This paper explains the process of doing so in the context of collaborative research between NEC, PARC and Carnegie Mellon University. Our project is aimed at producing a tool for predicting task execution time of skilled users, novice exploration to accomplish a goal, and subjective perception of the usability of mobile phones.
2 The Process of Making a Trustworthy, Practical Tool for Design The process of making a trustworthy, practical tool for design is shown in Figure 1. Each time a new domain is entered, or a new metric is added, the theory, tool and models must be validated with data from appropriate users to produce a trustworthy tool for prediction. If the models’ predictions do not match the human data sufficiently, either the theory or the tool, or both, must be revised until valid predictions are produced. The next question is whether the tool is learnable and usable by UI designers in their work process. User-centered design techniques should be used to design, evaluate, and redesign, until the tool is practical for design.
Fig. 1. General process of human performance modeling research that leads to a practical tool for design
Our project started with CogTool, a tool that allows UI designers to create valid Keystroke-Level Models in one tenth the time of doing them by hand as originally demonstrated by Card, Moran and Newell [2]. It has been shown to be easily learnable by users with no background in psychology or cognitive modeling ([8] and
Toward Cognitive Modeling for Predicting Usability
269
tutorials at professional conferences like HFES, BRIMS, and HCII). Recent research with CogTool has extended it beyond KLM and predictions of skilled task execution time to information foraging theory and predictions of novice exploration behavior [10, 11]. From this starting point, we set out to expand CogTool’s ability to predict human behavior to a new domain, mobile phones, and to a new metric, subjective impressions of usability as measured by the Mobile Phone Usability Questionnaire (MPUQ) developed by Ryu [12, 13]. Thus, this project will touch all the points in Figure 1. We start by using CogTool as it exists to make predictions of skilled execution time and novice exploration behavior and test those predictions against human data on mobile phones, fully expecting that adjustments to the underlying theory and tool will need to be made. After making changes to the theory and tool to produce valid predictions of these metrics, we intend to correlate various aspects of the predictions with people’s perceptions of usability. After verifying that we can make trustworthy predictions, we will determine whether CogTool can be used by mobile phone designers and adjust CogTool’s UI until it becomes a practical tool. At this writing, we are at the first part of the process, making predictions with CogTool as it exists and comparing those predictions to human data. The remainder of this paper will describe the current state of the research.
3 The New Domain – Mobile Phones Mobile phones were chosen as the domain in which to pursue this approach. This product category is important to the corporation and the discrete nature of the tasks users perform on mobile phones makes them relatively easy for collecting human data and to model. In addition, CogTool had previously been shows to make good predictions of skilled use of a similar hand-held device (PDAs, [14, 15]). Although the project will evaluate several different mobile phones, this paper will use the N905i, shown in Figure 2, as an example of our research process. The tasks we are examining are varied, as follows. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
Call a number from a phone book Store a number into a phone book Put an event into a Schedule Change a Security Setting View a previously sent mail message Set a previously stored picture to be the wallpaper. Delete a previously stored picture Add a function into a shortcut Check memory info Shoot a movie, check it, and save it
Data was collected from skilled users who had owned their phones for at east two months and from novice users who had never used this model of phone. The phone screen was captured on video, which was later transcribed to identify which buttons
270
B.E. John and S. Suzuki
Fig. 2. N905i mobile phone shown at the screen that was the start for each task
were pressed, when each button was pressed, and how long it took the phone to respond to each button press (system response time).
4 CogTool and Initial Models CogTool is a prototyping and cognitive modeling tool created to allow UI designers to rapidly evaluate their design ideas. A design idea is represented as a storyboard (inspired by the DENIM Project [16]), with each state of the interface represented as a node and each action on the interface (e.g., button presses) represented as a transition between the nodes. Figure 3 shows the start state of the storyboard, where buttons are placed on top of an image of the phone. Figure 4 shows a storyboard for six instances of the first task, calling a person who is already listed in the phone’s contact list. The first action at the start state is to press the down button called out in Figure 3. Because different contacts are located at different points of the phone book, the task takes different paths from the start screen to completion of the task. We will use Calling Person4 as the example in the remainder of this paper. After creating the storyboard, the next step is to demonstrate the correct actions to do the task. CogTool automatically builds a valid Keystroke Level Model from this demonstration. It creates ACT-R code that implements the Keystroke-Level Model and runs that code, producing a quantitative estimate of skilled execution time and a visualization of what ACT-R was doing at each moment to produce that estimate (Figure 5). Since mobile phones are a new domain for CogTool, we do not expect that the predictions it makes “out of the box” will be very accurate when compared to human data. We expect to have several iterations of comparing the predictions to human data and fixing the underlying theory and CogTool’s implementation of that theory, before we can make trustworthy predictions to help design. The next section presents preliminary analysis of one such iteration.
Toward Cognitive Modeling for Predicting Usability
271
Each Button in the picture of the phone has a button “widget” drawn on top. Actions on widgets follow transitions as defined in the storyboard (Fig 4).
If you tap the down button, CogTool transitions to the next frame in the storyboard (Fig. 4)
Fig. 3. Start screen of the CogTool prototype
Start screen
Calling Person6
Calling Person5
Calling Person1
Calling Person2
Calling Person3
Calling Person4
Fig. 4. Storyboard of the screens a person would pass through to accomplish Task 1 (making a call from the phone book) for six different instances of the task, i.e., calling six difference people in the phone book. We will use the instance of Calling Person4 as an example throughout this paper.
272
B.E. John and S. Suzuki
4.1 Comparing Initial Models to Human Performance Data The first step in comparing human performance data to the predictions of models is to make sure the same metrics are used in both the data and the models. For example, CogTool models predict not only when a button will be pressed, but also the thinking time and visual perception that precedes pressing the button. Only button presses were recorded in the empirical study, so we cannot directly compare the “total” time predicted by the model against the “total” time observed in the experiment. Adjusting for this difference, and comparing the time from first key press to the appearance of “Calling Person4” on the screen, the CogTool model predicted 11.049 seconds. The mean of five skilled participants was 9.770 seconds, an over-prediction of the average by 13% and an average absolute percent error of 15% between the predicted time and each observed time. This level of prediction is within the 20% error typically claimed by KLM and is an excellent prediction for an initial foray into a new domain and device. The next step is to go to a deeper level of comparison and look at the predictions for each individual action. We expect the quantitative comparisons to get worse, as explained by Card, Moran and Newell [1], but we are looking for patterns in behavior at this point, not absolute quantitative match. The first types of patterns we hope to see are those predicted by the model. Consider Figure 5, a timeline of the model’s predictions for the Calling Person4 task provided by a CogTool as a visualization of its behavior. The rows in the timeline represent different types of actions in the model. The changes in the phone’s screens are on the top gray line, with the estimates of system wait time (between button press and when the screen can be read) in the second row (light gray). The three purple rows show activity associated with vision: Vision-Encoding, Eye Move–Execute and Eye Move–Preparation. The central gray row represents the cognition that controls behavior, both the long “Mental operators” empirically established by Card, Moran and Newell, and the short ACT-R cognitive acts that control vision and hand motions. The bottom red row shows the button presses, in this case, with the right hand (the thumb). The model predicts a pattern: 1 press (at time=0), pause, 6 presses, pause, 8 presses, pause, 1 press.
Fig. 5. Timeline of a CogTool model prediction
Consider Figure 6, where the data from five participants are placed below the model’s timeline, aligned so that their first key presses all start at 0.0 sec. The top four participants display a pattern in keeping with the model’s prediction (1 press, pause, 6 presses, pause, 8 presses, pause, 1 press), except for P9 who does not pause for long
Toward Cognitive Modeling for Predicting Usability
273
before the last key. However, the bottom participant, P1 does not show this pattern at all. When we went back to the video of this participant, we found that although P1 used the same number of keys to complete the task, he did not use the same keys as the other participants or the model. Further investigation is needed to understand whether this was due to an error or whether it represents an alternative correct method for this task. Either way, in the majority of cases of this small sample, CogTool automatically predicted a pattern of behavior that was observed in human performance, even without modifying CogTool for the mobile phone domain.
Fig. 6. Timeline of a CogTool model prediction with keypress data from five participants aligned below it
Looking more closely at the data of the people who used the same keystrokes as the model (the top four), another pattern can be seen, one not predicted by the unmodified CogTool. Each participant shows two grouping of keys pressed close together in time, one of six keys in the beginning of the task and one of eight keys at the end of the task. Of these eight groupings, six show a distinct pause before the last keystroke in the group (P6, 1st group; P9, both groups; P12, 2nd group; P15, both groups). The groupings of six are repeated pressing of the S3 key to move across a set of icons at the top of the screen, some of which drop down a list of items that can be selected. When the desired icon is reached and its list drops down, the user then hits the Down key eight time to move down to the desired contact and hits the Call button to complete the task. The pauses come before the last S3 key press and the last Down
274
B.E. John and S. Suzuki
key press. In both cases, the user is watching a highlight move across (or down) the screen and can anticipate when the next key press will bring the highlight to the desired item. The pause before the last key press might represent a strategy to avoid over-shooting. This monitoring activity was not included in the original systems tested by Card, Moran and Newell and therefore is not represented in the original Keystroke-Level Model. Thus, we have identified a case where we may need to develop new theory about monitoring and anticipatory keystrokes (i.e., iterate on the theory) and build it into the tool (i.e., iterate on the tool) before we can produce trustworthy predictions in this domain. Another case where the predictions do not match the data is in the inter-keystroke times. All four users who did the task in the same way as the model pressed the same key far faster than CogTool did, as seen by the denser grouping of keystrokes in the participants’ timelines than in the model timeline. In this case, we will have to iterate on the underlying theory of motor movement to allow it to produce faster keystrokes. The timeline shows us that CogTool inserts visual perception of a key between each keystroke, which likely to be wrong for repeated keystrokes, especially given the monitoring activity described above where the user’s eyes are presumably on the screen not the buttons. With a model of just one instance of one task and data from five participants, the timeline visualization has suggested that the model is making reasonable predictions of the grouping of actions but is missing some important patterns of human behavior. More tasks and more data will have to be analyzed to be sure it is necessary to change the underlying theory and build it into CogTool to get trustworthy predictions. However, this small example illustrates the process of model validation this project has undertaken.
5 Future Work In addition to following the process in Figure 1 for skilled task execution time predictions on mobile phones, this project will also examine the prediction of novice exploration behavior, with CogTool-Explorer ([10, 11], a version of CogTool that predicts novice behavior). As with skilled behavior, we do not expect CogTool-Explorer to be able to predict a new domain (mobile phones instead of web searches) in a new language (Japanese instead of English) without iteration on the theory and tool. We have already identified improvements to the tool required for mobile phones, for example, mobile phones have “soft keys” where the label of the key is displayed on the screen instead of being printed on the key and CogTool-Explorer was not originally designed to represent that relationship. Perhaps more interestingly, when we have succeeded in producing trustworthy predictions of behavior, we intend to correlate this behavior with subjective impressions of usability as measured by the Mobile Phone Usability Questionnaire (MPUQ) developed by Ryu [12, 13]. Unlike empirical methods that can only correlate observed behavior, like time on task or number of errors, with questionnaire results, we can extract much more varied metrics from the models against which to correlate
Toward Cognitive Modeling for Predicting Usability
275
subjective impressions. For example, total time on task may not correlate with subjective impressions, but time spent in cognition may. Or more complex measures may be needed, like time spent in cognition that is not in parallel with motor movements for skilled users. Or number of keys looked at by CogTool-Explorer before making a choice, for the subjective impressions of novice users. Or amount of system response time not in parallel with cognition (i.e., making the user wait). Because CogTool produces a process model of perception, cognition and motor actions necessary to do a task, many combinations of actions can be explored to see if any can explain a significant part of the variance in subjective impressions. If a significant correlation can be found, then the predictive human performance models will be extended to a subjective metric, moving the field closer to the holy grail.
References 1. Card, S.K., Moran, T.P., Newell, A.: The Psychology of Human-Computer Interaction. Lawrence Erlbaum Associates, Hillsdale (1983) 2. Card, S.K., Moran, T.P., Newell, A.: The Keystroke-Level Model for User Performance Time with Interactive Systems. Commun. ACM 23(7), 396–410 (1980) 3. Card, S.K., Moran, T.P., Newell, A.: Computer Text-Editing: An Information-Processing Analysis of a Routine Cognitive Skill. Cognitive Psychology 12, 32–74 (1980) 4. Pirolli, P., Card, S.: Information Foraging in Information Access Environments. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI 1995), pp. 51–58. ACM Press/Addison-Wesley Publishing Co., New York (1995) 5. Anderson, J.R., Bothell, D., Byrne, M.D., Douglass, S., Lebiere, C., Qin, Y.: An Integrated Theory of the Mind. Psychological Review 111(4), 1036–1060 (2004) 6. Fu, W.-T., Pirolli, P.: SNIF-ACT: A Cognitive Model of User Navigation on the World Wide Web. Human-Computer Interaction 22, 355–412 (2007) 7. Blackmon, M.H., Kitajima, M., Polson, P.G.: Tool for Accurately Predicting Website Navigation Problems, Non-Problems, Problem Severity, and Effectiveness of Repairs. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI 2005, pp. 31–40. ACM, New York (2005) 8. John, B.E., Prevas, K., Salvucci, D.D., Koedinger, K.: Predictive Human Performance Modeling Made Easy. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI 2004, pp. 455–462. ACM, New York (2004) 9. Wu, C., Liu, Y.: Usability Makeover of a Cognitive Modeling Tool. Ergonomics in Design 15(2), 8–14 (2007) 10. Teo, L., John, B.E.: Towards Predicting User Interaction with CogTool-Explorer. In: Proceedings of the Human Factors and Ergonomics Society 52nd Annual Meeting, HFES, pp. 950–954, Santa Monica (2008) 11. Teo, L., John, B.E., Pirolli, P.: Towards a Tool for Predicting User Exploration. In: CHI 2007 Extended Abstracts on Human Factors in Computing Systems, CHI 2007, pp. 2687– 2692. ACM, New York (2007) 12. Ryu, Y.S.: Development of Usability Questionnaires for Electronic Mobile Products and Decision Making Methods, Doctoral dissertation, State University, Blacksburg, VA, USA (2005) 13. Ryu, Y.S., Smith-Jackson, T.L.: Reliability and Validity of Mobile Phone Usability Questionnaire (MPUQ). Journal of Usability Studies 2(1), 39–53 (2006)
276
B.E. John and S. Suzuki
14. Luo, L., John, B.E.: Predicting Task Execution Time on Handheld Devices Using the Keystroke-Level Model. In: Proceedings of the International Conference on Human Factors in Computing System (CHI 2005), pp. 1605–1608. ACM Press/Addison-Wesley Publishing Co., New York (2005) 15. Luo, L., Siewiorek, D.P.: KLEM: A Method for Predicting User Interaction Time and System Energy Consumption during Application Design. In: Proceedings of the 11th International Symposium on Wearable Computers (ISWC 2007), pp. 69–76. IEEE Press, New York (2007) 16. Lin, J., Newman, M.W., Hong, J., Landay, J.A.: Denim: An Informal Tool for Early Stage Web Site Design. In: CHI 2001 Extended Abstracts on Human Factors in Computing Systems (CHI 2001), pp. 205–206. ACM, New York (2001)
Webjig: An Automated User Data Collection System for Website Usability Evaluation Mikio Kiura, Masao Ohira, and Ken-ichi Matsumoto Graduate School of Information Science, Nara Institute of Science and Technology, 8916-5 Takayama, Ikoma, Nara, Japan {mikio-k,masao,matumoto}@is.naist.jp
Abstract. In order to improve website usability, it is important for developers to understand how users access websites. In this paper, we present Webjig, which is a support system for website usability evaluation in order to resolve the problems associated with the existing systems. Webjig can collect users’ interaction data from static and dynamic websites. Moreover, by using Webjig, developers can precisely identify users’ activities on websites. By performing an experiment to evaluate the usefulness of Webjig, we have confirmed that developers could effectively improve website usability. Keywords: Web usability, usability evaluation, analysis of user interactions, dynamic websites.
However, these systems are designed to collect data only from static websites. Developers cannot figure out users’ interactions on a dynamic website (e.g., automatically created webpages by CGI or server-side script, and webpages developed by JavaScript in a Web browser). By using JavaScript, developers can implement an interface that can switch the displayed contents by tabs, drop down menus, or dragand-drop methods without URL transitions. On a website using such interfaces, the existing systems cannot obtain the previously displayed contents accessed by users because these contents would change. In this paper, we propose Webjig, which is a support system for website usability evaluation for both dynamic and static websites; this system records users’ interactions related to the contents that are displayed in users’ Web browser. Developers can exactly understand users’ interactions on a website by using Webjig. Thus, they can efficiently improve website usability.
2 Related Work The traditional approach to resolving problems of website usability it to use the Web server accesses logs [4]. Developers can know various kinds of information including users’ IP address, accessed time, request data, and Web server’s response from the Web server access log. The advantage of using the Web access sever log is that the access log is automatically saved in a Web server and can be used by developers with low cost. If developers can easily use the Web access log to improve website usability, however, they cannot know users’ interactions such as mouse motions, mouseclick positions, and mouse-click timings on a website [5]. Several systems have been proposed to automatically collect the data of users’ interaction on a website (e.g., MouseTrack [6], UsaProxy [7]). The systems solved the problem above, by identifying users’ mouse motions, mouse-click positions, and mouse-click timings by embedding JavaScript codes into a webpage. The systems helped developers understand users’ interactions on a website at a considerably low cost. Previous studies have suggested that there is a correlation between the point of gaze and the position of mouse cursor. Chen et al. have reported that there is a strong correlation between the point of gaze and the position of mouse cursor; further, the developer can predict a point in the website where the user interested in and they may chart a pattern of the user by users’ interaction [8]. In addition, Muller et al. reported that 35% of users traced a sentence with a mouse cursor when they read the sentence in a website. These results show that developers can detect the problems of website usability by studying users’ interaction on it.
3 Webjig In this paper, we introduce Webjig, which is a new system used to solve the problems of the existing systems. Webjig can handle data from static and dynamic websites. By analyzing DOM (Document Object Model) of HTML, Webjig can collect the data of contents clicked by users, including timings, positions, and motions. This mechanism
Webjig: An Automated User Data Collection System for Website Usability Evaluation
279
allows usability engineers and developers to solve the problems associated with the existing system, i.e., the existing system could not precisely identify users’ interactions on a dynamic website. We present the system architecture of Webjig in Fig.1. Webjig is a client/server system. The client is implemented by using JavaScript, which executes in a Web browser. The server is implemented by using PHP. The system consists of Webjig::Fetch, Webjig::Analysis, and Webjig::DB. Webjig::Fetch is a subsystem that automatically collects the data of users’ interactions on a website. Webjig::Analysis is a subsystem that shows the information of users’ interactions to developers. Webjig::DB is a subsystem that holds the data of users’ interactions and provides API to access the data.
Fig. 1. System architecture of Webjig
3.1 Webjig::Fetch Webjig::Fetch is a subsystem that automatically collects the data of users’ interactions on the website. Table 1 shows the data collected and stored by Webjig. During the time in which a user stays on a webpage, the data may be changed, except for the name and version of the Web browser. The system monitors a change in the data at intervals of dozens of milliseconds and sends the data to Webjig::DB at intervals of few seconds and at the time when the user exits the webpage. Table 1. Collected data usign Webjig Data type Name and version of Web browser Inner size of Web browser Position of scroll bar Position of mouse cursor Timing and type of mouse click Timing and type of key pressed Contents displayed in a Web browser
Timing of data collection Loaded Changed Changed Changed Pressed Pressed Changed
Timing of data transmission Loaded Intervals and exit Intervals and exit Intervals and exit Intervals and exit Intervals and exit Intervals and exit
280
M. Kiura, M. Ohira, and K. Matsumoto
For collecting users’ interactions data, developers have to install Webjig::Fetch in a webpage. what developers have to do is only to insert a line <script src=”URL of Webjig::Fetch”> in the HTML source code of the webpage that targets the usability evaluation using Webjig. Fig.2 is an example of Webjig installed in an HTML source code. Webjig works even if the developer may insert the script tag at the any place in the HTML source code. However, a mainstream Web browser interprets the HTML source code from the top and displays the contents. Therefore, we recommend inserting the script tag at the bottom of the HTML source code so that Webjig does not disturb the original contents. Sample Page
Sample Content
<script src=”http://example.com/webjig.js” > Fig. 2. An example of HTML source code
3.2 Webjig::Analysis Webjig::Analysis has various features for supporting website usability evaluation. For instance, Webjig::Analysis can replay users’ interactions such as mouse motions, mouse click, and keyboard input related to the displayed contents in a movie format by using the collected data. In Fig.3, we show a screenshot of Webjig::Analysis when it replays the users’ interactions. The system consists of displayed contents in a Web browser and some floating windows that control the system and show various kinds of information. Developers can replay users’ interactions such as play, stop, forward, and rewind anytime by using various control buttons, seek bar, or slider available on the control window. In addition, the system can also generate a heat map, which shows where the users often click, and presume the portions where the users read and do not read on a webpage. By using these features, developers can examine the following questions.
• Are there any confusing graphics in links?
• Do users pay attention to the content that developers want them to read? • Where do users look or not look? • How do users access the website? • What do user wrong operation on the way to the goal? • How do users use a dynamic interface? • Where do users pause when they input into forms? • Where did the user view before exiting the website? • and so forth.
Webjig: An Automated User Data Collection System for Website Usability Evaluation
281
Fig. 3. Screenshot of Webjig::Analysis
4 System Evaluation 4.1 Overview We performed an experiment to evaluate the usefulness of Webjig. 54 graduate students (39 males and 15 females, average age 20) participated in the experiment as subjects. 54 subjects were divided into three groups. Each group worked on different tasks described in the next subsection. 4.2 Experiment Procedure and Task We executed the experiment according to the following procedures. Step 1. We provided 24 uses (subject of Group A) five tasks. Each task required the subjects to find a specified product from a dynamic menu implemented using JavaScript. Webjig recorded users’ interactions during task execution. Step 2. Based on the collected data in Step 1, three subjects who had a role of developers (Group B) analyzed the users’ interactions during task execution using
282
M. Kiura, M. Ohira, and K. Matsumoto
Webjig::Analysis. The developers planned for an improved structure of menu. Step 3. We provided 27 different users (subjects of Group C) tasks similar to Step 1. The difference between Step 1 and Step 2 is that the subject of Group C used the improved menu. Webjig recorded users’ interactions during task execution. Step 4. Finally, comparing the task execution time of Step1 and Step3, we checked the validity of the change in the structure of the menu. Fig.4 is the dummy website for the experiment. Table 1 shows target products and categories where the products exist.
Fig. 4. Screenshot of the dummy website for the experiment Table 2. Target products and category for each task Task Name Task 1 Task 2 Task 3 Task 4 Task 5
Category Audio & visual Cameras Health Office House & appliance
4.3 Experiment Results Developers can know where users look in the webpage by using Webjig. Table 3 shows what percentage of the subject of Group A firstly clicked on which categories. The grayed rectangle in Table 3 means the correct category where a specified product
Webjig: An Automated User Data Collection System for Website Usability Evaluation
exists for each task. For example, 54% of the subjects first clicked on the category of house & appliance, thought dry cell belonged to the category of audio & visual. When using existing systems, developers cannot know such the information. Table 4 shows the changed structure of the menu which was planned by the developers based on the result of Table 3. The plan is made from an idea that if there was the category more clicked by users than the current category, a target product should be moved to a proper category. In case of task 1 where subjects searched a dry cell, a dry cell belonged to the audio & visual category, but many subjects first pay attention to the house & appliance category. Therefore, the developers moved the dry cell to the category of house & appliance. Further, in case of task 4 where subjects searched an electronic dictionary, an electronic dictionary belonged to the category of office equipment, and the majority of the subjects first paid attention to the office equipment category. Therefore, the developers did not move it to any other category. Table 4. Change plan for the menu of the categories Task Name Task 1 Task 2 Task 3 Task 4 Task 5
Original category Audio & video Cameras Health Office House & appliance
Destination category House & appliance Computers House & appliance Office Office
We perform the experiment after changing the website, as shown in table 7. We show the experiment result in Fig.5. From Fig.5, the task execution time has been reduced in tasks 1, 2, and 3 by applying the changed plan. Fig. 5 shows the results of the execution time for each task in Step1 and Step 31. We can confirm that the execution time in Step 3 is shorter than that in Step 1, that is, the improved menu structure based on the developers’ analysis using Wegjig was effective. 1
Since the structure of the menu was changed in Task 4, we could not confirm the significant difference between the results in Step1 and Step3.
284
M. Kiura, M. Ohira, and K. Matsumoto
Fig. 5. Result of the task execution time in Step1 and Step3
5 Discussion By using Webjig, developers can obtain information which they would not have got with the existing systems. For this reason, developers can detect problems in website usability and create a plan for improving website usability by collecting data of users’ interactions, as performed in this experiment. In the experiment where users choose the items from the menu, the developers can determine the execution time for each task by using existing systems. Thus, they can detect the problems of usability by comparing the execution time of each task and pinpoint the task where the execution time is longer than that taken by another task. In Fig.5, the execution time of tasks 1, 2, 3, and 5 is longer than that of task 4. For this reason, a developer can hypothesize that there remains problems of website usability. However, it is difficult to eliminate the problem if they cannot understand the cause of the problem. By using Webjig, a developer can efficiently detect the problem of website usability. In case of task 1 (subjects find a dry cell), we show the experiment result in table 3; dry cell belongs to audio-visual equipment, but many subjects pay attention to household appliance. The developer hypothesized that “Many users think that a dry cell belongs to a household appliance” and moved the dry cell from audio-visual equipment to household appliance. As a result, the execution time is reduced before changing the category. According to Fig.5, the task execution time of the changed website is less than that of the original website. In tasks 1, 2, and 3, we can observe significant improvement in the execution time. However, in task 5, we did not observe any significant improvement in the execution time.
Webjig: An Automated User Data Collection System for Website Usability Evaluation
285
Table 5. Priority for the improvement Task Name Task 1 Task 2 Task 3 Task 5
Correct category (A) 4% 13% 25% 29%
Current Category (B) 54% 46% 71% 46%
B/A 13.5 3.5 2.8 1.6
We explain the reason for this. In table 5, we compare the rate of users who pay attention to the correct category with the rate of users who pay attention to the changed category. In case of task 1, 4% of users pay attention to the correct category (a category of audio & visual) when searching for dry cell and 54% of users pay attention to the wrong category (a category of house & appliance) when searching for dry cell. This has a difference of 13.5 times. Similarly, task 2 has a difference of 3.5 times, task 3 has a difference of 2.8 times, and task 5 has a difference of 1.6 times. As a result, we can say that if there is not a big difference in the rate of users who pay attention to an original category and the rate of users who pay attention to a changed category, we cannot confirm an effect in the change. Therefore, developers have to examine whether the usability is improved by understanding users’ interactions and not by the reason that the task execution time was longer than others. By using Webjig, a developer can exactly understand users’ interactions and examine whether the usability is improved. However, it is difficult to examine the improvement of website usability by using existing systems because exact users’ interactions cannot be obtained. However, developers cannot use the Webjig instead of user testing because they can know the gaze point by using the eye tracking system and they can know the intention of the user by interviewing him/her during user testing. But we saw that there was the point that could be improved website usability by using Webjig. Therefore, developers may efficiently improve website usability by combining user testing and Webjig.
6 Conclusion and Future Work In this paper, we proposed a Webjig support system for static and dynamic websites. As a result of the experiment, we show that developers can improve website usability effectively by using Webjig. In the future, we are going to think about the cost of website usability evaluation between existing systems and Webjig and compare usability testing with Webjig to determine the efficiency of website usability evaluation.
Acknowledgements This study is supported by Information-technology Promotion Agency, Japan (IPA), Exploratory IT Human Resources Project (MITOU Program) in the fiscal year 2008.
286
M. Kiura, M. Ohira, and K. Matsumoto
References 1. Nielsen, J., Landauer, T.K.: A mathematical model of the finding of usability problems. In: The INTERACT 1993 and CHI 1993 conference on Human factors in computing systems, pp. 206–213 (1993) 2. Dumas, J.S., Redish, J.C.: A Practical Guide to Usability Testing. Ablex Publishing, Norwood, New Jersey (1993) 3. Barnum, C.M.: Usability Testing and Research. Longman, London (2001) 4. Hong, J.I., Landay, J.A.: WebQuilt: a framework for capturing and visualizing the web experience. In: The 10th international conference on World Wide Web (WWW 2001), pp. 717–724 (2001) 5. Etgan, M., Cantoe, J.: What does getting WET (Web Event-logging Tool) mean for web usability? In: 5th Conference on Human Factors and the Web, HFWEB 1999 (1999), http://zing.ncsl.nist.gov/hfweb/proceedings/ etgen-cantor/index.html (accessed February 27, 2009) 6. Arroyo, E., Selker, T., Wei, W.: Usability tool for analysis of web designs using mouse tracks. In: CHI 2006 extended abstracts on Human factors in computing systems, pp. 484– 489 (2006) 7. Atterer, R., Schmidt, A.: Tracking the interaction of users with AJAX applications for usability testing. In: The SIGCHI conference on Human factors in computing systems (CHI 2007), pp. 1347–1350 (2007) 8. Chen, M.C., Anderson, J.R., Sohn, M.H.: What can a mouse cursor tell us more?: correlation of eye/mouse movements on web browsing. In: CHI 2001 extended abstracts on Human factors in computing systems, pp. 281–282 (2001) 9. Mueller, F., Lockerd, A.: Cheese: tracking mouse movement activity on websites, a tool for user modeling. In: CHI 2001 extended abstracts on Human factors in computing systems, pp. 279–280 (2001)
ADiEU: Toward Domain-Based Evaluation of Spoken Dialog Systems Jan Kleindienst, Jan Cuřín, and Martin Labský IBM Research, Prague, Czech Republic {jankle,jan_curin,martin.labsky}@cz.ibm.com
Abstract. We propose a new approach toward evaluation of spoken dialog systems. The novelty of our method is based on utilization of domain-specific knowledge combined with the deterministic measurement of dialog system performance on a set of individual tasks within the domain. The proposed methodology thus attempts to answer questions such as: “How well is my dialog system performing on a specific domain?”, “How much has my dialog system improved since the previous version?”, “How much is my dialog system better/worse than other dialog systems performing on that domain?” Keywords: Dialog, evaluation, scoring, multimodal, speech recognition.
is particularly missing in this area is (1) a measurement of performance for a particular domain, (2) possibility to compare one dialog system with others, and (3) evaluation of a progress during the development of a dialog system. By the ADiEU1 scoring presented herein we attempt to address these three cases. 1.2 The Elements of ADiEU Metric The ADiEU score consists of two ingredients both of which range from 0 to 1: A) Domain Coverage (DC) score, B) Dialog Efficiency (DE) score. We describe both scores in the following chapters. Note that the results of domain coverage and dialog efficiency may be combined into a single compound score to attain a single overall characteristic (the eigen value) of the assessed dialog system. The ADiEU score relies on a good understanding of the dialog domain that is described in the form of a domain task ontology. The more expert knowledge is projected into the domain ontology, the more reliable results we expect from the ADiEU score.
2 Capturing Domain Ontology The cornerstone of our approach is to evaluate spoken and multi-modal dialog systems within a predefined, well-known (and typically narrow) domain. In our labs we have developed many speech and multimodal applications for various domains, such as music selection, TV remote control, in-car navigation and phone control; using grammars, language models and natural language understanding techniques. In order to compare two spoken dialog systems that deal with the same domain, we first describe the domain diligently using the task ontology. This restricted ontology represents the human expert knowledge of the domain and is encoded as a set of tasks with two kinds of relations between the tasks: task generalization and aggregation. Individual tasks are defined as sequences of parameterized actions. Actions are separable units of domain functionality, such as volume control, song browsing or playback. Parameters are categories of named entities, such as album or track title, artist name or genre. Tasks are labeled by weights, which express the relative importance of a particular task with respect to other tasks. The ontology may also define task aggregations which explicitly state that a complex task can be realized by sequencing several simpler tasks. Table 1 shows a sample task ontology for the music control domain. For example, the task volume control/relative with weight of 2 (e.g. “louder, please”) is considered more important in evaluation than its absolute sibling (e.g. “set volume to 5”). This may be highly subjective if scored by a single human judge and thus a consensus of domain experts may be required to converge to a generally acceptable ontology for the domain. Once acknowledged by the community, this ontology could be used as the common etalon for scoring third-party dialog systems. 1
We call our measurement the Automatic Dialog Evaluation Understudy, ADiEU.
ADiEU: Toward Domain-Based Evaluation of Spoken Dialog Systems
289
Table 1. Speech-enabled reference tasks for the jukebox domain. Tasks are divided into groups. Both group as well as tasks within the group are assigned relative importance points by an expert. These points are normalized to obtain per-task contribution to the domain’s functionality. ITC shows ideal turn count range for each task. Group Points Share Volume 2 15.50% Playback 4 31.01%
Play mode 0.5 3.88% Media library 6 46.51%
Menu 0.4
3.10% 100%
Task Description relative absolute mute play stop pause resume next, previous track next, previous album media selection shuffle repeat browse by criteria play by criteria search by genre search by artist name up to 100 artists more then 100 artists search by album name up to 200 albums more than 200 albums search by song title up to 250 songs more than 2000 songs search by partial names words spelled letters ambiguous entries query item counts favorites browse and play add items media management refresh from media add or remove media access online content quit switch among other apps
3 The Proposed Method of ADiEU Evaluation The actual dialog system evaluation metric that is in the heart of our method consists of two indicators: Domain Coverage (DC) - computed over the task ontology and Dialog Efficiency (DE) that quantifies the outcome of user test sessions. The DC expresses how the evaluated system covers the set of tasks in the ontology for a particular domain; while the DE indicates the performance of the evaluated system on those tasks supported by the system.
290
J. Kleindienst, J. Cuřín, and M. Labský
3.1 Scoring of Domain Coverage The domain coverage (DC) is a sum of weights of tasks supported by the system (S) over the sum of weights of all tasks from the ontology (O).
DC ( S , O) =
∑ ∑
t ∈su pported tasks ( O ) t *∈all tasks ( O )
wt
wt *
(1)
Table 1 shows a sample domain task ontology for the music management domain that shows the raw points assigned by a domain expert and their normalized versions that are used to assess the relative importance of individual tasks. The expert may control the weights of whole task groups (such as Playback control) as well as the weights of individual tasks that comprise these groups. Generally, the ontology can have more than two levels of sub-categorization that are shown in the example. 3.2 Scoring of Dialog Efficiency The actual efficiency of dialog is measured using the number of dialogue turns [9, 10] needed to accomplish a chosen task. In spoken dialog systems, a dialog turn corresponds to a pattern of user speech input followed by the system’s response. We introduce a generalized penalty turn count (PTC) that measures overall dialog efficiency by incorporating other considered factors: number of help requests, number of rejections, and user and system reaction times.
Where TC is the actual dialog turn count, NHR is the number of help requests, URT is user response time and SRT is system response time and the lambdas represent weights of each contributor to the final penalty turn count (PTC)2. The obtained penalty turn count in then compared to an ideal number of turns for a particular task. We define a key property, the ideal number of turns (INT), as being determined by at least the following factors. The INT is (F1) directly proportional to a number of information slots to be filled and (F2) indirectly proportional to a size of the block of information slots commonly accepted as coherent. INT (t ) =
number of in formation slots to be filled size of a block of in formation slots commonly accepted as coherent
(3)
For example, the concept of “date” consists of three information slots (day, month, and year) that need to be filled. Here, the number of information slots (F1) is three, which is in this case the same as the size of a coherent block expected by the users. The INT for the “date” concept is thus 1 (=3/3). In the contemporary art the INT property is determined manually by a human judgment. 2
In our experiments, we set λNHR=0.5, λNRP=1, and λURT=λSRT=0 since for the music domain the user reaction time was not indicative of dialog quality and both applications responded instantly.
ADiEU: Toward Domain-Based Evaluation of Spoken Dialog Systems
291
The actual score of the dialog efficiency (DE score) for an individual task is then counted as a fraction of difference between INT and PTC against current PTC, i.e.: ⎛ PTC (t ) − INT (t ) ⎞ DE (t ) = 1 − max ⎜⎜ , 0 ⎟⎟ PTC (t ) ⎝ ⎠
(4)
To avoid subjective scoring we typically use several human testers as well as several trials per one task. For example for task “play by artist” the following set of trials can be used: “Play something by Patsy Cline”, “Play some song from your favorite interpreter”, or “Play some rock album, make the final selection by the artist name”. Each of these trials has assigned its ideal number of turns (this is why INT for tasks in the ontology are given by range in the Table 1.) The task dialog efficiency score is then computed as an average over all human testers and dialog efficiency for each trial. Samples of trials used in the evaluation of music management domain are given in Table 2. 3.3 The ADiEU Score The ADiEU score is then counted as a sum of products of domain coverage and dialog efficiency for each task in the domain ontology, i.e.:
ADiEU ( S , O) =
∑
t ∈su pported tasks ( O )
wt ⋅ DE (t )
∑t∈su pported tasks (O ) wt
(5)
4 Case Study: ADiEU Scores for Music Management Domain We applied the ADiEU scoring to our two dialog systems developed at different times and both partially covering the music management dialog domain. Both allow their users to play music by dynamically generating grammars based on meta tags found in users’ mp3 files. The first one, named A-player, is simpler and covers a limited part of the music management domain. The second, named Jukebox, covers a larger part of the domain and also allows free-form input using a combination of statistical language models and maximum entropy based action classifiers. For both applications, we collected input from a group of 10 speakers who were asked to accomplish tasks listed in Table 2. Each of these user tasks corresponded to a task in the domain task ontology and there was at least one user task per each ontology task that was supported by either A-player or Jukebox. The subjects were given general guidance but no sample English phrases were suggested to them that could be used to control the system. In order not to guide users even by the wording of the user tasks, the tasks were described to them in their native language. All ten subjects were non-native but fluent English speakers.
292
J. Kleindienst, J. Cuřín, and M. Labský Table 2. Specific tasks to be accomplished by speakers using A-player and Jukebox Task Start playback of arbitrary music Increase the volume Set volume to level 10 Mute on Mute off Pause Resume Next track Previous track Shuffle Play some jazz song Play a song from Patsy Cline Play Iron Man from Black Sabbath Play the album The Best of Beethoven Play a song Where the Streets Have No Name Play a song Sonata no. 11 (ambiguous) Play a rock song by your favorite artist Reload songs from media
A-player Jukebox ITC x x 1 x 1 x 1 x 1 x 1 x 1 x 1 x x 1 x x 1 x x 1 x 1 x x 1 x x 1 x x 1 x x 1 x x 2 x x 3 x 1
Table 3. Computation of coverage, task completion score and ADiEU for A-player and Jukebox Task sup volume relative volume absolute mute play stop pause resume next, prev. track next, prev. album shuffle browse by criteria play by criteria search by genre search by artist <= 100 artists > 100 artists search by album <= 200 albums > 200 albums search by song <= 250 songs > 2000 songs word part. search ambiguous entries media refresh
Table 3 shows the computation of the ADiEU score and its components: domain coverage (DC) and domain efficiency (DE). For A-player, which is limited in functionality, the weighted domain coverage only reached 43.99%, whereas for Jukebox
ADiEU: Toward Domain-Based Evaluation of Spoken Dialog Systems
293
this was 83.17%. On the other hand, A-player allowed its users to accomplish the tasks it supported more quickly than Jukebox; this is documented by the weighted dialog efficiency score reaching 82.6% for A-player and 66.7% for Jukebox. This was mainly due to Jukebox being more interactive (e.g. asking questions, presenting choices) and due to a slightly higher error rate of a dictation-based system as opposed to a grammar-based one. The overall ADiEU score was higher for Jukebox (55.4%) than it was for A-player (36.3%). This was in accord with the feedback we received from users from ongoing evaluations who claimed they had better experience with the Jukebox application. The two major reasons were the support of free-form commands by the Jukebox and its broader functionality.
5 Human Evaluation in Progress The HCI methodology [10] advocates several factors that human judges collect in the process of dialog system evaluation. These key indicators include accuracy, intuitiveness, reaction time, and efficiency. When designing the evaluation method we attempted to incorporate the core of these indicators into the scoring method to ensure good correlation of the ADiEU metric with the human judgment. We are currently collecting data form the evaluation test where the human judges act as personas [11]. The results of the evaluation either confirm or reject the assumption of the ADiEU scoring correlation with human judgment.
6 Practical Considerations of the ADiEU Scoring The application of the ADiEU scoring to an arbitrary dialog system has several practical considerations. Generally, there are two possibilities how to evaluate a thirdparty dialog system by our metric: 1) agreed API contract supported by the external system or 2) rich enough tracing and logging information. Both approaches will typically require cooperation with the supplier of the measured system. The API approach asserts there exists a runtime API that supports e.g.: simulating input to the system, changing the dialog state, obtaining notification about dialog state changes with sufficient introspection, possibility to read output of the system. The logging approach demands the application to write all the required information to a log file, ideally in a format compliant with the ADiEU score measuring tool. This usually means tight cooperation with the dialog system engineers, but it is easier and more straight forward than changing the application API in the case it does not provide access to all information needed by the ADiEU metric. Having the test run in the form of log has the advantage of the possibility to send the logs to the scoring tool hosted as a web service and the possibility to evaluate the system against multiple domain ontologies or ontology versions of the same domain. We have experimented with both approaches while evaluation our systems.
7 Conclusion We introduce a method for quantitative evaluation of spoken dialog system that utilizes the domain knowledge encoded by a human expert. The evaluation results are
294
J. Kleindienst, J. Cuřín, and M. Labský
described in the form of a comparison metric consisting of domain coverage and dialog efficiency scores allowing to compare relative as well as absolute performance of a system within a given domain. This approach has an advantage of comparing incremental improvements on an individual dialog system that the dialog designer may want to verify along the way. In addition, the method allows to cross-check the performance of third-party dialog systems operating on the same domain and immediately understand the strong and weak points in the dialog design. Human evaluations are currently conducted to estimate the correlation between the ADiEU score and human judgment. The subjectivity of human scoring and consensus on the ontology coverage are subject of further investigation.
References 1. Weizenbaum, J.: ELIZA - A Computer Program for the Study of Natural Language Communication between Man and Machine. Communications of the Association for Computing Machinery 9, 36–45 (1966) 2. Allen, J., Chambers, N., Ferguson, G., Galescu, L., Jung, H., Swift, M., Taysom, W.: PLOW: A Collaborative Task Learning Agent. In: Twenty-Second Conference on Artificial Intelligence, AAAI-2007 (2007) 3. Cassell, J., Stocky, T., Bickmore, T., Gao, Y., Nakano, Y., Ryokai, K.: Mack: Media lab autonomous conversational kiosk. In: Imagina 2002 (2002) 4. Graesser, A.C., VanLehn, K., Rosfie, C.P., Jordan, P.W., Harter, D.: Intelligent tutoring systems with conversational dialogue. AI Mag. 22(4), 39–51 (2001) 5. Jurafsky, D., Martin, J.H.: Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition (International edn.). Prentice-Hall, Englewood Cliffs (February 2000) 6. Gandhe, S., Traum, D.: Evaluation understudy for dialogue coherence models. In: Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue, Columbus, Ohio, June 2008, pp. 172–181. Association for Computational Linguistics (2008) 7. Walker, M., Kamm, C., Litman, D.: Towards developing general models of usability with paradise. Nat. Lang. Eng. 6(3-4), 363–377 (2000) 8. Hajdinjak, M., Mihelific, F.: The paradise evaluation framework: Issues and findings. Comput. Linguist. 32(2), 263–272 (2006) 9. Le Bigot, L., Bretier, P., Terrier, P.: Detecting and exploiting user familiarity in natural language human-computer dialogue. In: Asai, K. (ed.) Human Computer Interaction: New Developments, pp. 269–382. InTech Education and Publishing (2008); ISBN: 978-9537619-14-5 10. Nielsen, J.: Heuristic evaluation. In: Nielsen, J., Mack, R.L. (eds.) Usability Inspection Methods, pp. 25–64. John Wiley & Sons, New York (1994); ISBN: 0-471-01877-5 11. Carroll, J.: Human Computer Interaction in the New Millennium. ACM Press, New York (2001)
Interpretation of User Evaluation for Emotional Speech Synthesis System Ho-Joon Lee and Jong C. Park Computer Science Department, KAIST 335 Gwahangno, Yuseong-gu, Daejeon 305-701 Republic of Korea hojoon@nlp.kaist.ac.kr, park@cs.kaist.ac.kr
Abstract. Whether it is for human-robot interaction or for human-computer interaction, there is a growing need for an emotional speech synthesis system that can provide the required information in a more natural and effective manner. In order to identify and understand the characteristics of basic emotions and their effects, we propose a series of user evaluation experiments on an emotional prosody modification system that can express either perceivable or slightly exaggerated emotions classified into anger, joy, and sadness as an independent module for a general purpose speech synthesis system. In this paper, we propose two experiments to evaluate the emotional prosody modification module according to different types of the initial input speech. And we also provide a supplementary experiment to understand the apparently prosody-independent emotion, or joy, by replacing the resynthesized joy speech information with original human voice recorded in the emotional state of joy. Keywords: Emotional Speech Synthesis, User Evaluation, Emotional Prosody Modification, Affective Interaction.
In order to identify and understand the characteristics of these emotions and their effects, we propose in this paper a series of user evaluation experiments on an emotional prosody modification system that can express either perceivable or slightly exaggerated emotions as an independent module for general purpose speech synthesis systems.
2
Emotional Speech Synthesis System
For the analysis of prosody structure through a more precise level of units, we annotated the Korean emotional speech corpus, distributed by the Speech Information Technology & Industry Promotion Center [4], with the K-ToBI labeling system. This speech corpus was recorded by six professional actors and actresses in a sound-proof room, and is composed of emotionally neutral ten sentences with six different emotions (joy, anger, sadness, fear, boredom, and neutral). An AKG C414-B ULS microphone was used with a 16KHz sample rate, and each speech was stored as a 16bit Windows wave format. We used eight sentences spoken by six speakers, as described in Table 1, considering four emotions (joy, anger, sadness, and neutral). The number of Ejeols (words separated by a space) was evenly distributed from 1 to 6. Table 1. Eight sentences used for prosody structure analysis
Ejeol 1 1 2 3 3 4 5 6
Sentence . (Yes.) . (No.) . (I don’t know either.) , . (See, let’s end it now.) . (It really is.) ? (Where are you going now?) . (This is not what I wanted.) . (I shut the door closed asking her not to leave.)
예 아니요 나도 몰라 야 이제 그만하자 정말 그렇단 말이야 지금 어디 가는 거야 이건 내가 원하던 게 아니야 난 가지 말라고 하면서 문을 닫았어
The Korean emotional speech corpus had passed manufacturer’s perception test performed by twenty subjects (eighteen males, two females), and Table 2 below shows the results. Among the emotions, anger turned out to be the most perceivable emotion (94.3%), and fear, the most confusing one (80.3%). However, the overall acceptance rate is more than 80%. For the analysis of dominant emotional prosody patterns, we annotated eight sentences spoken by six speakers with four emotions, or 192 pieces of speech in total with the K-ToBI labeling system [5]. And for the statistical verification of the K-ToBI labeled data, we performed Pearson’s Chi-square tests. As shown in Fig. 1, the results support the null hypothesis that each emotion has distinct Intonational Phrase (IP) boundary patterns that can distinguish one emotional state from the rest. Then we calculated adjusted residuals to find the distinct pitch contour pattern or patterns. If the calculated value of the adjusted residual is bigger than 2, that feature can be statistically
Interpretation of User Evaluation for Emotional Speech Synthesis System
297
interpreted as the dominant pattern of a certain emotion. Pearson’s Chi-square tests and adjusted residual were performed by SPSS software. From the statistical analyses of pitch contour patterns, we were able to find very strong tendencies between anger and HL%, joy and LH%, sadness and H%, and neutral and L%.
Fig. 1. Chi-square test and adjusted residual calculation results Table 2. Perception test result done by twenty subjects Speaker CWJ KKS LHJ MYS PYH YSW Average
Neutral 89.5 62.5 83.5 84.5 85.0 95.4 83.3
Joy 93.5 90.5 67.5 91.5 95.0 89.5 87.9
Anger 88.5 92.0 98.0 90.0 99.0 98.5 94.3
Sadness 85.5 80.5 84.5 89.5 94.0 89.5 87.3
Fear 59.0 85.5 88.5 93.5 61.5 93.5 80.3
Boredom 93.0 82.0 84.0 81.0 94.5 81.0 85.9
To incorporate these analyzed and distinct Intonational Phrase boundary patterns for different emotional states, we propose a prosody-unit-level emotional prosody modifier that produces distinct pitch contour, intensity contour, and speech duration according to the three different emotional states: anger, joy, and sadness. The emotional prosody modifier is a simple, coarse-grained prosody re-synthesis module that consists of a pitch contour mapping function, a pitch exaggeration function, an intensity variation
298
H.-J. Lee and J.C. Park
function, and a duration variation function. We set the empirical value of each prosodic parameter based on the previous findings in the literature [1, 2], also taking into account language specific phenomena for Korean including the speaker’s gender information, short and long vowel sound disambiguation [6, 7], and prosodic structure of discourse markers [8], captured from various Korean speech corpora. Equation 1 below shows the algorithm of our pitch contour modification function. This pitch contour modification function generates the base emotional pitch contour of speech including the synthesized results of Text-to-Speech (TTS) systems and recorded human voice for each emotion. (1) where t
∈[t1,t 2] ;
y y′
original pitch value as a function of time t ; modified pitch value;
a
maximum / minimum pitch range ; initial position of pitch contour; final position of pitch contour (rising tone: 0.5, rising-falling: 1); and declination / ascent level.
b
c d
After the modification of the base emotional pitch contour, we apply a pitch exaggeration function to characterize the difference in pitch variation according to the difference in emotion types. First, this module detects eight pitch points per unit. Then we exaggerate the difference in each pitch pair by adding 6Hz for joy and anger, and 40Hz for fear and sadness. Next, we adjust the intensity with the intensity contour modification function which is similar to the pitch contour modification function in Equation 1, but much simpler. Then we control the duration of each unit preserving the intrinsic value of f0. All these four modules are implemented in a PRAAT [9] script supporting not only commercial TTS systems, but recorded human voice also. We used the Python language for the interface of PRAAT software and TTS output or human voice, and therefore this module supports both Linux and Windows environments.
Fig. 2. Pitch and intensity traces of original speech, spoken in a neutral emotional state
Interpretation of User Evaluation for Emotional Speech Synthesis System
299
Fig. 3. Pitch and intensity traces of prosody modified speech to a sad emotional state
이건 내가 원하던
Fig. 2 shows the prosody trace of a recorded Korean utterance “ .” which means in English “This is not what I wanted.” spoken neutrally by a professional actress, and Fig. 3 shows its modified prosody trace as a sad emotional state produced by our emotional prosody modifier. The blue line (upper line) indicates the pitch contour, and the green line (lower line) the intensity. In Fig. 3, the entire duration is lengthened from 1.753 seconds to 2.805 seconds without any side effect such as f0 contour lowering. Pitch contour is spread more widely, and intensity is weakened.
게 아니야
3
Evaluation of Emotional Speech Synthesis System
For the identification and understanding of the characteristics of three basic emotions and their effects, we prepared three stages of experiments. The first and second experiments are designed to evaluate the emotional prosody modifier according to different types of the initial input speech, such as monotonous-prosody speech and excited-prosody speech. The supplementary experiment is performed to identify apparently prosody-independent speech. The subjects of these three experiments are fourteen kindergarten teachers, twelve of them females and two males. They are 29.6 years old on average. We did not carry out any prior training for the fourteen subjects, and answers were not notified to the subjects after the experiments. At the beginning of the experiments, subjects were asked to choose one most likely emotion among anger, joy, sadness, and neutral. We used five semantically neutral sentences as show in Table 3. For the first experiment, five neutrally recorded speech files were used as a monotonous input speech, and the emotional prosody modifier produced fifteen results with three emotional states. The test sequences of first and second experiments were randomly organized. Table 3. Input sentences for the evaluation of emotional prosody modifier
Sentence
야, 이제 그만하자. (See, let’s end it now.) 정말 그렇단 말이야. (It really is.) 지금 어디 가는 거야? (Where are you going now?) 이건 내가 원하던 게 아니야. (This is not what I wanted.) 난 가지 말라고 하면서 문을 닫았어. (I shut the door closed asking her not to leave.)
300
H.-J. Lee and J.C. Park
Table 4 shows the evaluation results of the emotional prosody modification with monotonous input speech. From the analysis of the results of the first experiment, we find that anger is very sensitive to emotional prosody structure (80% of perception rate). And sadness also shows a strong relationship with prosody structure. It is rather surprising to note that none of the subjects perceived joy from the monotonous input speech, even though we modified the prosody structure of joy based on the analyses of real speech, exactly as we did for anger and sadness. Table 4. Evaluation result for monotonous input speech
Anger Joy Sadness
Anger 56 (80.0%) 12 (17.1%) 4 (5.7%)
Joy 3 (4.3%) 0 (0%) 2 (2.9%)
Neutral 6 (8.6%) 16 (22.9%) 23 (32.9%)
Sadness 5 (7.1%) 42 (60.0%) 41 (58.6%)
Total 70 70 70
For the second experiment, we used five pieces of excited voice as the input for the emotional prosody modifier, and generated fifteen randomly organized test sets. Table 5 indicates the results of the second perception experiment. Table 5. Evaluation result for excited input speech
Anger Joy Sadness
Anger 56 (80.0%) 18 (25.7%) 3 (4.3%)
Joy 7 (10.0%) 15 (21.4%) 38 (54.3%)
Neutral 6 (8.6%) 15 (21.4%) 18 (25.7%)
Sadness 1 (1.4%) 22 (31.4%) 11 (15.7%)
Total 70 70 70
Interestingly, anger preserved prosody sensitivity when the type of input was changed from monotonous-prosody speech to excited-prosody speech. From the second experiment, two major changes were observed: an increase in the perception rate of joy, and a decrease in the perception rate of sadness. The decrease in the perception rate of sadness can be caused by the sudden change of the test environment. In order to indentify the cause of this sudden change, we proposed the third experiment. However, the expected response of the perception rate of joy was still very weak. To identify the characteristics of the emotional prosody structure of joy, and to validate the hypothesis above on a sudden change of sadness, we performed the third experiment with the same subjects and in the same sequence as the second experiment. The only difference between the second and third experiments was just the replacement of the modified joy speech with the original human voice recordings in the emotional state of joy, which had passed the manufacturer’s perception test at the rate of 91.5%.
Interpretation of User U Evaluation for Emotional Speech Synthesis System
301
Table 6. Evaluatio on result for repeated test with human voice recordings
Anger Joy Sadness
Anger 58 (82.9% %) 32 (45.7% %) 10 (14.3% %)
Joy 7 (10.0%) 12 (17.1%) 18 (25.7%)
Neutral 4 (5.7%) 15 (21.4%) 19 (27.1%)
Sadness 1 (1.4%) 11 (15.7%) 23 (32.9%)
Total 70 70 70
After the third perceptio on test, we made three interesting interpretations from the results shown in Table 6. First, F the same sequence in the repeated experiment did not seem to influence the percception rate of anger. There was only a slight movem ment from neutral to anger. This allows us to define anger as a primarily prosody-sensittive emotion. Second, we found that some s part of the decreased perception rate was due to the sudden change of the test en nvironment. So it is a possible interpretation that there w was a confusion of sadness in the second experiment. Despite the result of the secoond experiment, it appears that sadness s is also a prosody-sensitive emotion. Third and most importan nt, we could not find any meaningful relationship betw ween the prosody structure and the t emotion of joy, even though we used real voice whhich had passed the manufacturrer’s perception test at the rate of 91.5%. This leads uss to conclude that joy is not a prosody sensitive emotion, which forces us to find othher, effective approaches to ex xpress the emotion of joy through an emotional spoken language generation system m.
4
Discussion
For the accurate understan nding of each evaluation result, a quantitative compariison method that can also desccribe the influence of wrong answers is called for. For example, the perception rate of the first experiment related to anger is just equaal to that of the second experimeent. But for the same category, it is very hard to figure out the influence of errors su uch as joy and sadness. For this kind of interpretattion including error analysis, we suggest a Euclidean distance based quantitattive comparison method. Fig. 4 describes a Euclidean distance model of tetraheddron designed for the analysis off four types of category.
Fig. 4. Euclidean distance model for tetrahedron
302
H.-J. Lee and J.C. Park
From this point of view, we can calculate and compare each distance described in Table 4, Table 5, and Table 6. When the size of n is 70, the maximum distance of each category is approximately 98.99, and the minimum distance is 0. Table 7. Euclidean distance of Table 4
Anger Joy Sadness
Anger 16.31 73.38 81.06
Joy 87.67 84.05 82.76
Neutral 85.24 69.46 62.53
Sadness 86.06 34.41 37.28
Table 8. Euclidean distance of Table 5
Anger Joy Sadness
Anger 16.79 60.32 79.86
Joy 84.51 63.70 38.44
Neutral 85.33 63.70 65.41
Sadness 89.34 55.48 72.51
Considering both correct answers and errors, we conclude that synthesized anger based on the monotonous input speech is slightly closer to the position of anger than that based on the excited speech, even though they have the same perception rate. And for the synthesis of anger, the change of initial input speech from monotonous to excited one increases the distance of joy by 3.16, but decreases the distance of neutral by 0.09 and sadness by 3.28.
5
Conclusion
In this paper, we proposed an emotional prosody modification system, and evaluated the performance of the system, in order to find a relationship between prosody structures and emotions. First, we proposed a prosody-unit-level emotional prosody modification system that produces distinct pitch contour, intensity contour, and speech duration according to three different emotional states: anger, joy, and sadness. And during the evaluation process, anger and sadness were identified as prosody sensitive emotions, whereas joy was not. Consequently, this difference led us to discover the possibilities and limitations of prosody modification for the generation of emotional spoken language expression systematically. Further analyses of emotional speech data are necessary, taking into account various speakers, speaking environment, and speaking styles. And more organized evaluation and interpretation strategies are essentially needed for further work. Acknowledgments. This research was performed for the Intelligent Robotics Development Program, one of the 21st Century Frontier R&D Programs, funded by the Korea Ministry of Knowledge Economy.
Interpretation of User Evaluation for Emotional Speech Synthesis System
303
References 1. Schröder, M.: Emotional Speech Synthesis: A Review. In: Eurospeech 2001, vol. 1, pp. 561–564 (2001) 2. Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., Taylor, J.G.: Emotion recognition in human-computer interaction. IEEE Signal Processing Magazine 18(1), 32–80 (2001) 3. Lee, H.-J., Park, J.C.: Customized Message Generation and Speech Synthesis in Response to Characteristic Behavioral Patterns of Children. In: Jacko, J.A. (ed.) HCI 2007. LNCS, vol. 4552, pp. 114–123. Springer, Heidelberg (2007) 4. SiTEC Emotional Speech Corpus, http://www.sitec.or.kr/English/index.asp 5. Jun, S.-A.: K-ToBI (Korean ToBI) Labeling Convention. Korean Journal of Speech Science 7 (2000) 6. Lee, H.-J., Park, J.C.: Lexical Disambiguation for Intonation Synthesis: A CCG Approach. In: Korean Society for Language and Information, pp. 103–118 (2005) 7. Lee, H.-J., Park, J.C.: Vowel Sound Disambiguation for Proper Intonation Synthesis. In: 19th Pacific Asia Conference on Language, Information and Computation, pp. 131–142 (2005) 8. Lee, H.-J., Park, J.C.: Characteristics of Spoken Discourse Markers and their Application to Speech Synthesis Systems. In: 19th Annual Conference on Human and Cognitive Language Technology, pp. 254–260 (2007) 9. PRAAT, http://www.praat.org
Multi-level Validation of the ISOmetrics Questionnaire Based on Qualitative and Quantitative Data Obtained from a Conventional Usability Test Jan-Paul Leuteritz1, Harald Widlroither1, and Michael Klüh2 1
Abstract. Qualitative and quantitative data, collected during a usability evaluation of two innovative prototypes of a small display touch screen device, have been used to perform a multi-level assessment of the questionnaires used within the trial. The use of different validation methods is depicted and discussed concerning their advantages and disadvantages. The conclusions from the validation study are depicted, revealing that the usage of the ISOmetrics for testing uncommon prototypes may result in insufficient validity of the instrument. Keywords: Validity, questionnaire, ISOmetrics, AttrakDiff, small display devices, shower control.
Multi-level Validation of the ISOmetrics Questionnaire
305
under repeating conditions; they work, for example, with the same user group or a similar test pattern or they usually evaluate prototypes from a certain line of products. Hence, they could use data from their own tests to cross-validate their survey instruments and see what kind of information they yield. This solution is fine, as long as the cross-validation procedure does not consume too much effort. In order to find out if such an approach could be recommendable, Fraunhofer Institute of Industrial Engineering (Fraunhofer IAO) conducted the study described in this article. An evaluation project commissioned by the German shower technology manufacturer Hansgrohe AG served as the basis of the multi-level validation approach. A usability test design was developed that would not just answer the respective evaluation questions but that would also provide data for multi-level validation procedures of the questionnaires used. It was paid attention to keep the additional efforts which only served the validation task as low as possible. This article presents the outline of the evaluation study and the detailed results of the multi-level validation approach. It aims at inviting other usability professionals to use and/or refine this method.
1.2 The Evaluation Project The devices to be tested were two prototypes of a wall-mounted device for controlling the different functions of a modern comfort-shower: hand showers, overheadmounted shower plates offering various combinations of water rays, wall-mounted shower-heads, steam-bath functions, coloured lighting, and a music-player. The designs, including the interaction concept, had been created by Phoenix Design GmbH & Co.KG, Stuttgart. Prototype A (Fig. 1) was a touch-screen device that featured two additional buttons and a pusher-and-rotator switch. Prototype B (Fig. 2) had a smaller screen that did not respond to touch input. It was instead controlled by a number of buttons, including a
306
J.-P. Leuteritz, H. Widlroither, and M. Klüh
set of four arrow-buttons, an “OK”-button, a “menu”-button, a back-button in form of a u-turn-arrow. Prototype B also featured the pusher-rotator switch. The usability test was meant to identify the prototype with the better usability, which would then be finalised, while the other prototype would be discarded. Furthermore, the test had to provide information on how to improve the better prototype in the next design development phase.
2 Theoretical Background 2.1 Definition of Usability The definition of usability on which this validation study is based had been taken from ISO 9241-11. The main advantage of ISO 9241 is that it is an international standard and therefore widely accepted. Furthermore, other definitions of usability (like Nielsen’s definition, see Nielsen 1993) seemed to be less adequate for a validation study, as it was suspected that their subordinate constructs might not be independent factors and hence increase the preparatory efforts to be undertaken. ISO 9241-11 defines usability as “the extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use.” 2.2 Measuring Usability According to ISO 9241, efficacy and efficiency are best measured by so-called objective data, which means behaviour data such as error rates or the time needed to complete a task. This data can be collected during a standardised experiment. The measurement of satisfaction is more difficult, because satisfaction is the user’s subjective reaction to the interaction with the product (ISO 9241). Hassenzahl (2004) states that user satisfaction is an emotion which results from the user comparing his expectations of the system to his actual experiences with it. Satisfaction is therefore only to be measured by asking the user about his feelings towards the system. With regards to the above given argumentation, it was assumed that • The most valid measure or criterion for the efficacy of use would be the number of tasks people were not able to finish by themselves. • The most valid measure or criterion for the efficiency of use would be either the number of mistakes people made during the trial or the time they needed to complete all tasks. • The most valid measure or criterion for the users satisfaction with the interface would be either the result of a questionnaire, most probably a semantic differential, or a quantified item on their preference or choice of prototype after the test. 2.3 Selection and Purpose of the Questionnaires After collecting information about the available psychometric instruments, it was decided to use two questionnaires within the study:
Multi-level Validation of the ISOmetrics Questionnaire
307
1. The ISOmetrics (Gediga & Hamborg, 1999), which is supposed to measure usability, using the set of seven dimensions for the design of dialogue systems defined in ISO 9241-10. It’s a five-point Likert-scale questionnaire. As the experiment focused on the dialogues of the shower system, the ISOmetrics seemed to be adequate. As the ISOmetrics is based on the ISO standard, it was expected to fit well into the theoretical approach chosen. According to the above given definitions of criterions, the ISOmetrics was in this study not the main source for usability measures but rather an additional instrument, the validity of which was to be examined. It was planned to compare the questionnaire results with the criterions for efficacy and efficiency and with the qualitative data collected during the test. 2. The second questionnaire, the AttrakDiff (Hassenzahl et al., 2003), is a sevenpoint-scaled semantic differential questionnaire, which is supposed to measure the attractiveness of a system to a user. Although Hassenzahl et al. (2003) do not directly state that the AttrakDiff questionnaire measured satisfaction, the construct of attractiveness, seems to reflect quite well the whole range of expectations a user can have. Hence, this was the instrument selected for the measurement of satisfaction. Validating the AttrakDiff in this context was more difficult because there is hardly a better criterion for the users emotions towards a technical system than their responses to an emotion-focused questionnaire. The only other criterion is the subsequent behaviour towards the system after the test – the motivation to carry on with the communication with the system. This is reflected in quantitative preference judgement, which was therefore selected as criterion for the AttrakDiff.
3 Method 3.1 Sample 22 users (12 women, 10 men) participated in the study, each providing both quantitative and qualitative data for the validation project. They had an arithmetic mean age of 39.1 years (SD = 14.5 years). The sample consisted of 10 potential customers, 4 elderly users (60+, selected for their lack of experience with information technology) and 8 additional users from Fraunhofer Institute. 3.2 Experimental Setting The prototypes were simulated on a touch-screen monitor, mounted within the wall of a trade-fair mock-up of a shower cabin. The test was done without having water pour from the showers, the users wore normal clothing. Therefore, a video of the shower’s functions was shown at the beginning of the test. Each participant tested both of the prototypes; the sequence was matched according to person characteristics. Each prototype test consisted in a set of tasks the participants had to complete and a questionnaire given after the completion of the task-set. The experiment ended with final questions, asking for a comparison between the tested devices. It was assured that every participant completed all tasks. Whenever the participant was unable to complete a task by himself, the experimenter provided the information
308
J.-P. Leuteritz, H. Widlroither, and M. Klüh
for the next step and placed a marker in the log-file, indicating that help had been given. If the participant was able to continue by himself after receiving a hint, no further advice was given. Otherwise every assistance was rendered that was needed to complete the task. Participants were instructed to complete each task as fast as possible, without thinking aloud or giving comments. This should guarantee the reliability of the time-measures. The test was conducted in German language, including instructions and questionnaires. Each test lasted between 90 and 120 minutes. All tests were conducted by the same instructor, using a written instruction. The first trials were supervised. 3.3 Variables Collected For each participant’s interaction with each prototype, the number of tasks was counted that he/she could not complete without the help of the test instructor (number of hints). For every task of each participant, the number of errors1 they committed was counted and the time to complete the task was measured, using an automatic logging technique. The questionnaire given to the participants after each of their two trials contained: 1. The ISOmetrics in a shortened version. Items that did not apply to shower controls had been deleted. The subscale “suitability for individualization had been removed entirely as none of the items fit. This shortened version is referred to as ISOmetricsSDD (ISOmetrics for small display devices) in the text below. 2. The AttrakDiff in its full version. 3. Additional items, including • one item to determine which of the two prototypes the user would prefer in the end and • one item that asked to quantify the superiority of the preferred prototype on a five-point-scale. Qualitative data was taken from the participants’ statements and comments during and after each task. All test sessions were videotaped in order to allow a thorough analysis of all the statements the users gave and all their actions, including errors that did not appear in the log-files (e.g., touching the screen of prototype B).
4 The Validation Procedure and Its Results 4.1 Reliability As the instruments were not new but commonly used ones, there wasn’t any attention paid to the factorial structure of the answers. The reliability of the results was calculated rather to exclude a reliability problem that would render all validation attempts useless. Cronbach’s α was chosen as a correct factorial structure of the instruments had been assumed. 1
“Errors” were all intended button-pushes that did not contribute to the solution of the task. Due to the specifications of the log-file, special exceptions were phrased to exclude, for example, unnecessary rotating of the pusher-and-rotator switch from the errors count.
Multi-level Validation of the ISOmetrics Questionnaire
309
Table 1. Reliability Estimation of the ISOmetricsSDD subscales, using Cronbach’s α Scale
No. of items
prototype A
prototype B
Suitability for the task
7
.70
.90
Self-descriptiveness
4
.75
.87
Controllability
4
.68
.81
Conformity with user expectations
5
.76
.78
Error tolerance
3
.48
.74
Suitability for learning
4
.79
.90
4.2 Content Validity A survey with three usability experts from Fraunhofer IAO, conducted before the usability evaluation of the shower prototypes, did not yield any majority vote calling for the deletion or the addition of a specific item or aspect to/from the ISOmetricsSDD. The lowest mean-estimation of a subscale’s validity was 82% (see table 3). Additionally, it has to be stated that the interviewed specialists did not even know the shower control prototypes and hence demanded the inclusion of items that would generally be useful but that had no application in this study. Table 2. Consolidated ratings of the content validity of the ISOmetricsSDD
No. of evaluators requesting a change
Mean estimation of validity
Number of items to eliminate
Number of aspects missing
Suitability for the task
1
83 %
0
2
Self-descriptiveness
2
82 %
0
1
Controllability
1
90 %
2
2
Conformity with user expectations
0
88 %
0
1
Error tolerance
0
95 %
0
0
Suitability for learning
2
85 %
1
2
4.3 Criterion-Based Validity Extreme-group-validation The ISOmetricsSDD questionnaire was clearly able to identify the “better” prototype, preferred by 20 of 22 participants. Prototype A yielded with 4.12 (SD = 0.50) a significantly higher sum-score than Prototype B with 3.29 (SD = 0.32) (t(21) = 5.90, p = .00). Hence, using the ISOmetricsSDD would have led to the correct decision which prototype to discard.
310
J.-P. Leuteritz, H. Widlroither, and M. Klüh
Correlation of problem-counts and subscale-means Another validation method applied here repeated a procedure that had already been used in a study which reported satisfying validity of the ISOmetrics questionnaire (Ollermann, 2004). A category system was created for all the usability problems encountered. Sources were the statements of the participants, the notes of the test instructor and the log-files. For each problem category, the number of occurrences was counted. Then, each problem category (40 for prototype A and 31 for prototype B) was assigned to one ISOmetricsSDD subscale. Afterwards, for each subscale the numbers of appearances of each assigned problem category were summed. This way, the whole sum of all usability problems encountered was split between the questionnaire subscales. Finally, the Pearson-correlation between the number of problems and the mean score of the subscale was calculated for both prototypes. The application of Ollermann’s method yielded less promising results: For prototype A the correlation between the usability problems encountered and the arithmetic mean scores of the ISOmetricsSDD subscales was r = -0.259 (N = 6; p = .310). For prototype B this correlation was r = 0.020 (N = 6, p = .485).2 Correlation of ISOmetricsSDD and metric criteria The Pearson-correlation between the score-differences of the ISOmetricsSDD and the differences in errors committed was statistically not significant with r = -.11 (N = 22, p = .66). The Pearson-correlation between the score-differences of the ISOmetricsSDD and the differences in the time needed to complete all tasks was statistically not significant with r = -.29 (N = 22, p = .19). The Pearson-correlation between the (A-B) difference in number of hints (the number of tasks that could only be completed with the instructor’s help) and the score-differences of the ISOmetricsSDD was r = .386 (N = 22, p = .076). Correlation of AttrakDiff and the preference item With a single item it was intended to create a criterion for the validity of the AttrakDiff questionnaire. The item requested the participant to describe the degree superiority of the better prototype to the weaker one using a 5-point Likert-Scale. The score was Pearson-correlated to the difference of the AttrakDiff sum scores (not-preferred prototype minus preferred prototype). The result was r = -.44, statistically significant with p = .04 (N = 22), which due to the value coding shows that indeed those participants who perceived their favourite to be to a great extent superior to the other prototype also yielded a higher difference in the AttrakDiff sum-scores, pointing in the same direction.
5 Discussion of the Results The results of the survey among usability experts show that there are no severe problems concerning the content of the ISOmetricsSDD items. They apparently represent quite well the ISO definition of the different constructs describing the usability of dialogue systems. However, the correlation between the numbers of problems 2
N in this case is not the number of participants but the number of subscales used.
Multi-level Validation of the ISOmetrics Questionnaire
311
assigned to each subscale and the mean scores of the subscales does not support these validity assumptions. Correlations with N=6 should not be over-interpreted and significances are not to be expected in any case. However, if one looks at the whole correlation matrix, he will find that the Pearson-correlation between the ISOmetricsSDD scores of prototype A and prototype B was r = 0.416 (N = 6; p = .206) and that the problem counts of A and B correlated by r = 0.627 (N = 6, p = .091). This indicates that the data are not totally random. There are coherences between the ISOmetrics – scores and between the problems found for both prototypes. So the question is: Why do neither the sum-scores of the subscales correlate with the problem count, nor do the entire sum-scores correlate with the most objective measures of usability – user mistakes and time-to-complete? There is just no match between the usability problems and the questionnaire results. In Ollermann’s study, the first correlation coefficient found was r = 0.277. As there was one subscale that seemed to be responsible for this low result, this subscale was eliminated, resulting in the correlation jumping up to r = 0.756 (p = .019) (Ollermann, 2004). This procedure did not seem acceptable in this study because for the two prototypes, different subscales messed up the correlation. Even more disappointing were the correlations of the ISOmetricsSDD scores with number of errors, as well as with time to complete. The AttrakDiff questionnaire yielded promising results. Regarding the fact that it was validated using just one item, resulting in a possibly low reliability of the criterion, a correlation of r = -.44 can be considered sufficiently high to indicate that the results of the questionnaire do more or less reflect the constructs named in the respective theory (see Hassenzahl et al., 2003). As a consequence of the described findings, it was assumed that the ISOmetricsSDD instrument had in this case not been measuring the system’s usability. What did it measure instead? It was presumed that the ISOmetricsSDD had failed because it tried to make usability-experts out of the users. Even for the authors of the study, the categorisation of the encountered usability problems to the questionnaire’s subscales was a difficult task. Expecting a user to remember all the problems he had encountered and to correctly map them to questionnaire items seems impossible, especially if the user is asked to do so after testing an unknown system for 90 minutes. Most probably, the test participants will rather rely on their general perception of the system, on the emotional substrate of their recent experiences. Two findings support this presumption: 1. The mean scores of the different subscales were quite similar. For prototype A the standard deviation of those subscale-means is SD = 0.15, for prototype B it is SD = 0.32, which seems small for a five-point Likert-scale. Ives, Olson and Baroudi (1983, as cited in Hartson et al., 2000) say that participants tend to fill in satisfaction questionnaires quite homogeneously. This might also apply to questionnaires like the ISOmetricsSDD. 2. The correlation between the differences of the AttrakDiff-scores (A-B) and the differences of the ISOmetricsSDD-scores (A-B) of the participants was r = 0.81 (N = 22; p = .00). This means that the ISOmetricsSDD perfectly measured the emotional value that the participants gave to the system, closely linked to what is called “satisfaction”.
312
J.-P. Leuteritz, H. Widlroither, and M. Klüh
6 Conclusions 6.1 Concerning the Findings of the Study When confronted with a system for the first time, users are probably unable to remember the usability problems they encountered and to cluster them correctly, producing a valid score in all the subscales of a questionnaire like the ISOmetrics. Participants rather seem to use the instrument to convey their overall satisfaction with the system to the test instructor. Therefore the use of questionnaires focusing on different categories of usability problems is not recommendable in certain test designs. According to the findings of this study, questionnaires like SUMI, QUIS and ISOmetrics need to be used carefully. 6.2 Concerning Multi-level Validations The aim of this article and the work depicted here is to encourage usability experts to evaluate their measurement instruments with a method similar to this multi-levelapproach. This approach of course has a downside, which is the small number of participants. In the above described case, only three usability experts have been interviewed, only two prototypes were used, only 22 participants have gone through the evaluation process, and only six subscales of the ISOmetrics were taken into account. Furthermore, aspects like the assignment of the encountered usability problems to certain scales could always be questioned. Finally, it could be argued that the changes done to the questionnaire (e.g. the deletion of items) had a bad effect on the validity of the whole instrument. The results of such a study may hence seem less apt for publication than the results of big validation studies, carried out with hundreds of participants. The advantage of this method is that without going into unreasonable costs of money or time, it combines different forms of validations and collects information that is usually just lost. Eventually, the question is if a usability practitioner’s primary interest is to win a scientific argument and publishing results or if s/he just wants to get a hint on whether a certain tool is recommendable for the planned task or not. In the second case, the common perception of usability evaluation itself would then also apply to the evaluation of the assessment tools: Little and possibly unreliable information is better than none (see Nielsen, 1993). So if the assumption is true that not just literally the validity of a questionnaire but more generally the value gained from its results is possibly dependent on the product tested, the users, and other context parameters, then the here promoted method becomes recommendable.
References 1. ISO 9241, Ergonomics of human-system interaction. International Organization for Standardisation (1998) 2. Gediga, G., Hamborg, K.-C.: IsoMetrics: Ein Verfahren zur Evaluation von Software nach ISO 9241/10. In: Hollingm, H., Gediga, G. (eds.) Evaluationsforschung, pp. 195–234. Hogrefe, Göttingen (1999)
Multi-level Validation of the ISOmetrics Questionnaire
313
3. Hamborg, K.-C.: Gestaltungsunterstützende Evaluation von Software: Zur Effektivität und Effizienz des IsoMetricsL Verfahrens. In: Herczeg, M., Prinz, W., Oberquelle, H. (eds.) Mensch & Computer 2002, pp. S.303–S.312. B.G. Teubner, Stuttgart (2002) 4. Hartson, H.R., Andre, T.S., Williges, R.C.: Criteria for Evaluating Usability Evaluation Methods. International Journal of Human-Computer Interaction, 2001 13(4), 343–349 (2000) 5. Hassenzahl, M., Burmester, M., Koller, F.: AttrakDiff.: Ein Fragebogen zur Messung wahrgenommener hedonischer und pragmatischer Qualität. In: Ziegler, J., Szwillus, G. (eds.) Mensch & Computer 2003, pp. 187–196. B.G. Teubner, Stuttgart (2003) 6. Hassenzahl, M.: Interaktive Produkte wahrnehmen, erleben, bewerten und gestalten. In: Thissen, F., Stephan, P.F. (eds.) Knowledge Media Design – Grundlagen und Perspektiven einer neuen Gestaltungsdisziplin, Oldenburg Verlag, München (2004) 7. Nielsen, J.: Usability Engineering. Morgan Kaufmann, Heidelberg (1993) 8. Ollermann, F.: Verhaltensbasierte Validierung von Usability-Fragebögen. In: Keil-Slawik, R., Selke, H., Szwillus, G. (eds.) Mensch & Computer 2004: Allgemeine Interaktion, pp. 55–64. Oldenburg Verlag, München (2004)
What Do Users Really Do? Experience Sampling in the 21st Century Gavin S. Lew User Centric, Inc. 2 Trans Am Plaza Dr, Ste 100, Oakbrook Terrace, IL 60181, USA glew@usercentric.com
Abstract. As practitioners we spend a great deal of effort designing and testing products within the confines of usability testing labs when we know that a rich user experience lies outside. What is needed is more research in “the wild” where people use the very interfaces we take so much time to design, test, iterate, and develop. Through innovative advancements in mobile technology, we can expand upon the tried and true “experience sampling” research techniques, such as diary or pager studies, to effectively solicit, monitor and receive data on users’ interactions at given points in time. This paper describes various research methodologies and recent advancements in mobile technology that can provide practitioners with improved research techniques to better assess the user experience of a product. The conference presentation will also include results from a pilot experience sampling method study focused on collecting data on usage and satisfaction of a product. Keywords: Experience sampling, in-situ research, mobile device research, pager study, diary study, mobile research, SMS studies.
What Do Users Really Do? Experience Sampling in the 21st Century
315
2 Common Research Techniques There is a number of common research techniques employed to understand the user experience of a product. These methods range in difficulty from easy to challenging, but each provides insight into different aspects of the user experience. 2.1 Usability Testing Usability testing with users is a critical component of any user-centered design process. Traditional usability testing involves task-based research in the lab where designs can be tested, iterated and validated. Within the confines of this controlled environment, this methodology is ideally suited to assess usability in a highly tactical and specific manner. Outcomes include answers to specific design questions. Usability testing is critical to product success because we must ensure that the core features are usable. However, the focus of usability testing on tasks is also a limitation because the lens tends to target the “walk-up-and-use” user experience of the product. Session time is often limited and the user experience typically does not involve a user who interacts with a device that he/she actually owns. As practitioners and designers, we accept the lack of external validity because of the benefits of usability testing to formative and iterative design. We apply the insights uncovered in the lab into the design and hope that they generalize to how the product is actually used in the real world. However, we understand that the usability, usage, and usefulness of a product are determined over time and not necessarily in the first hour of use in the lab setting. 2.2 Surveys and Focus Groups Often the data provided to describe the “real world” user experience is obtained through survey or focus group methodologies. While these research methods are quite useful in the early-stage development of feature importance, pricing, or intent to purchase, using this information for design is challenging. Results tend to be at a highlevel and we often need more tactical direction to meaningfully influence some of our design decisions. Even when these methods are directed toward answering design questions, the obtained data is largely retrospective in nature. We know that asking users to reflect on tasks done in the past is not as robust or credible as when the same question is asked during or immediately following the completion of the task. Satisfaction metrics can be obtained in surveys, but they would be much more useful when captured as close to the actual usage instance as possible (e.g., gathering satisfaction data after completing a task rather than asking in a focus group or survey months after the experience occurred). The benefit of a short latency between the action and the satisfaction request is more than simply measurement integrity. Specific feature and functionality questions can be asked immediately after use to acquire more insightful and relevant feedback with direct impact on design.
316
G.S. Lew
2.3 Ethnographic Research One method that avoids retrospection and any associated confabulation due to the long latency between action and question is ethnography. It involves observing user behaviors in a natural environment. However, there are obvious challenges that prevent its widespread use as a research technique. Setup and logistics necessary to observe natural behaviors are difficult (e.g., consider the case of trying to observe mobile devices where screens are small and interactions are very rapid). Fieldwork and analysis can be time-consuming. Sample sizes are often small. And most importantly, the likelihood that the output of the study will be actionable is low relative to more direct and tactical techniques such as usability testing. Because ethnography is best suited to uncovering insight that drives ideation rather than answering direct design questions, securing authorization and budget to conduct ethnographic research can be difficult. However, what cannot be refuted is that ethnographic research collects data in the environment where interactions occur and with products used by the users. 2.4 Longitudinal Research Longitudinal research captures data from users over time. With its foundations from developmental psychology, this methodology has been largely observational in nature, using correlational analysis to assess phenomena. However, the longitudinal approach has applicability to user experience research. While usability testing can be seen as tapping the user experience just once, the study could be extended over time to make multiple, repeated assessments on the same set of users over time. The study could have users perform tasks and provide feedback. Thus, learning can be an area of interest. Moreover, the methodology could assess how the user adapts and uses the product during critical periods of its lifecycle. Longitudinal research is compelling as it often involves fieldwork in a naturalistic environment with the benefit of having a more structured data collection technique. Questions, tasks, and observations can also be very design-focused and tactical. Moreover, it fills in the post-walk-up-and-use gap left open with a usability testing methodology. In short, longitudinal research offers access to the daily user experience of a product. Consider a mobile phone. Usability testing can assess the usability of core functions such as the ability to add a contact or determine whether or not there is sufficient affordance to use a specific keypad button to complete tasks. The problem is that when usability issues are uncovered, it is impossible to know if the feature that was difficult in usability testing can be learned and become second nature over time or will be left unused because users could not learn it. Information about how users interact with products over time is thus extremely valuable. Longitudinal methods could provide information about a product in the hands of users. Because assessments can be made over time, the technique can capture how the user learns to use the product. Given the possible potential of longitudinal research, why is it NOT widely used? At the 2007 CHI (ACM-SIGCHI) conference in San Jose, a new special interest group (SIG) on longitudinal research was formed. What was most interesting is that only 25% of the attendees of this SIG had actually conducted a longitudinal research study in the last couple of years. Possible reasons for why longitudinal research is rare include:
What Do Users Really Do? Experience Sampling in the 21st Century
317
A. Long timelines: The business challenge of a research project where data collection is stretched over time makes longitudinal research compete with “just in time” or “we need the data last week” research alternatives. B. Cost: Building a user panel where users are tapped for an extended period has a high cost and high panel attrition. Since timelines can extend across multiple product releases with benefits to different business groups, it is unclear which group should be charged for the study. Securing funding is inherently more difficult. C. Complex logistics: Study design and execution have a high initial setup cost because every aspect of the study must be coordinated. Any repeated measures technique will require allocation of resources to manage the study activities for an extended period of time. D. High effort: Data collection requires high effort from both researchers and users who must participate across multiple data sessions. Alternatively, data come in the form of written diaries where the coding process is non-trivial. E. Difficult analysis: Analyzing the large amount of data collected can be time consuming as data are essentially multiplied by the number of repeated measures. 2.5 Need for an Alternative Method If usability testing captures walk-up-and-use usability, ethnographic research gets us in the field and longitudinal research can reveal how users learn, what still seems to be lacking is usage and motivation. Consider the mobile phone example again. Manufacturers and mobile service providers know that a call was made and how long it lasted. What is unknown, however, is whether the user called “John” from their contacts or dialed the number directly. In terms of designing features, researchers and designers are blind as to whether the user ever entered John into their contact list or what motivates the user to even use the feature. All too often, when launched, the product becomes a mysterious “black box” and we do not know how users use the product or feature that took so much effort to design.
3 Experience Sampling Method Experience sampling method (ESM) refers to in-situ (Latin for “in place”) research where the phenomenon is examined in the place where it occurs. The methodology was developed in 1977 at the University of Chicago by Csikszentmihalyi, Larson and Prescott [1] to understand the experience of adolescent activities, but its applicability to other areas of user experience is clear. ESM is more commonly referred to as a “pager study” where users are asked to provide information via a diary. Users are prompted to enter information by a “page” sent to a device (e.g., “What are you doing now?”). Participants enter data into a paper diary. Prompting could be either controlled by a researcher or set to prompt at specific intervals. The data can be analyzed to understand user activity, motivation, and other cognitive and social dimensions. This methodology can be used to assess how users use products. 3.1 ESM Coupled with Advanced Mobile Technologies It would be great if the product could tell us how it is being used, but that is not necessarily practical, nor does it provide the rich user experience as interpreted and
318
G.S. Lew
provided by users. Imagine if a technology could retain the tactical and rigorous elements of “in-lab” research while capturing the richness and environmental cues associated with more natural settings. What if the satisfaction data are not retrospective, but closely tied to user behavior and actions? Through innovative advancements in mobile technology, researchers can now expand upon longitudinal and experience sampling research techniques to effectively solicit, monitor and receive data on users’ interactions at given points in time. These advancement tap directly into both application and operating system to provide the building blocks to take user experience research to new levels. 3.2 Using Mobile Technology to Capture Data Mobile device technology has advanced to a level where research can be more complex than simply paging users to ask them to write passages in a diary. The mobile device itself can be the conduit between the user and researcher. Imagine what research areas would be open if practitioners could conduct studies on a robust platform that prompts the user, collects data both from the user and from the device itself and handles logistics (e.g., compensation). Moreover, what if the device is the participant’s own personal mobile device? With full QWERTY keyboards on mobile phones, one can readily imagine feedback in the form of free-form text responses. Considering the abilities of the youth of today who can type 40 words per minute using a 12-key numeric keypad, the tremendous data collection benefit of a phone over diary input is easy to envision. In addition, the device can be leveraged as a powerful remote data collection tool where areas under investigation could be anywhere a user could go with their mobile device at their side. This opens up novel forms of research never before possible without specialized equipment specifically designed for the study. Using mobile devices, user input and feedback extends beyond making a simple selection or answering a series of questions. Users could speak their response and have it recorded. They could also respond by taking a picture or recording a video of their experience. The remote capabilities of a mobile device as a research tool create a wealth of research opportunities. LEOtrace MobileTM is a mobile technology that uses ESM to obtain data [2]. It runs on Windows Mobile 6, Symbian Series 60, and RIM Blackberry devices. User input and device information that can be collected is shown in Table 1. Table 1. Types of data that can be collected from ESM using LEOtrace MobileTM User-provided data A. Open-ended feedback B. Scaled feedback (binary, Likert-scale, slider ratings) C. Image selection D. Voice recording E. Camera image F. Video clip
Device-provided data A. Task completion (success/fail) B. Event (app start/end, SMS sent, picture taken, etc.)
What Do Users Really Do? Experience Sampling in the 21st Century
319
3.3 Event or Behavior Triggers Research using new mobile technologies could be further enhanced by analyzing user behaviors and feature usage to trigger prompts for user feedback. In this case, the user’s own actions are of interest and the behavior itself prompts the device to ask specific questions around the behavior captured. This differs from contrived tasks set up by a researcher for the user to complete. Algorithms can be designed to watch for specific situations to occur that would trigger research questions so feedback can be obtained very close to when the behavior happened. 3.4 Other Mobile Technologies This paper describes various research methodologies and recent advancements in mobile technology that can provide practitioners with improved research techniques to better assess the user experience of a product. Besides LEOtrace MobileTM there are several other technologies available – from those that sit on old Palm Pilots to those that run on the latest mobile devices; from techniques involving simple SMS text messages to ask for feedback to web surveys solicited via phone-based email or messaging, there are many mobile technologies that can be used to solicit data from users. As practitioners, the potential of remotely capturing user interactions in an ecologically valid manner while extending beyond walk-up-and-use usability is compelling. Experience sampling techniques can further our design practice by yielding more insight into user motivation, usage, and learning. Implications for future research are vast given the capability to more efficiently and remotely monitor user behavior and perception “as it happens.”
4 ESM Study Findings The conference presentation will include findings from an ESM study. Device usage and satisfaction data will be presented from a four-week study with a participant sample size of 100. Participants will use mobile devices they presently own. Software will be loaded on the devices to passively monitor usage. Users will also be asked to perform specific tasks. Success and failure will be reported with user feedback on their experience and satisfaction using device features. Acknowledgments. Many thanks to the development team at Nurago (www.nurago. com) for developing the LEOtrace Mobile™ software used in this research. The user experience research teams at both User Centric, Inc. (www.usercentric.com) and SirValUse Consulting GmbH (www.sirvaluse.com) deserve credit as their insight was essential to the research approach and execution of this study.
References 1. Csikszentmihalyi, M., Larson, R., Prescott, S.: The ecology of adolescent activity and experience. Journal of Youth and Adolescence 6, 281–294 (1977) 2. Lew, G.S.: The truth is out there: Using mobile technology for experience sampling. User Experience 7(3), 8–10 (2008)
Evaluating Usability-Supporting Architecture Patterns: Reactions from Usability Professionals Edgardo Luzcando, Davide Bolchini, and Anthony Faiola Indiana University Purdue University Indianapolis School of Informatics - HCI {eluzcand,dbolchin,afaiola}@iupui.edu
Abstract. Usability professionals and software engineers approach software design differently, which creates a communication gap that hinders effective usability design discussions. An online survey was conducted to evaluate how usability professionals react to Usability-Supporting Architecture Patterns (USAPs) as a potential way to bridge this gap. Members of the Usability Professionals Association (UPA) participated in a pretest-posttest control group design experiment where they answered questions about USAPs and software design. Results suggest that participants perceived USAPs as useful to account for usability in software architectures, recognizing the importance of the USAPs stated usability benefits. Additionally, results showed a difference in perception of the USAPs stated usability benefits between US and European participants. A better understanding of what the usability community thinks about USAPs can lead to their improvement as well as increased adoption by software engineers, which can lead to better integration of usability and HCI principles into software design. Keywords: Architecture Patterns, HCI, Usability, Usability Professionals, Software Design, USAP.
“covering up ill-suited infrastructure features with interface veneer, but there are limits to how far this can take us.” [6] He argues that infrastructure and interaction features need to be jointly designed, and not performed ad-hoc. To address this challenge, Usability-Supporting Architectural Patterns (USAPs) have been recently proposed as a strategy to systematically embed usability requirements in the early design of software architectures [5]. USAPs are a blend of HumanComputer Interaction (HCI) and Software Engineering (SE) principles that provide a framework to design recurrent software and user requirements (e.g. provide the user a way to undo operations). USAPs are enriched with indications on how these requirements may impact the components of the system architecture, and with examples of how to deal with it at this level. The foreseen benefits of leveraging USAPs in software design are many, including: (a) the opportunity for software engineers to consider and take into account the needs of the user experience in making strategic architectural decision; (b) to provide a shared language between usability professionals and software engineers to discuss design decisions in the light of both system and user requirements; (c) to offer reusable solutions (patterns) which capitalizes on previous design expertise. Initial studies suggest that USAPs are effective when applied by software engineers [7]. However little is known about USAPs understanding and acceptance by usability professionals [8, 9]. Acknowledging the proposal of USAPs as an important step towards bridging the communication gap between software engineers and usability professionals, we have conducted a study aimed at assessing the perceived value of USAPs among the community of usability professionals. There is the risk, in fact, that the original value of USAPs (improving mutual understanding) may be weakened amongst usability professionals by the way USAPs are proposed and described: still using concepts, terminology and notation familiar only to software engineers. The study consisted in a focused online survey administrated to usability professionals and was based on the following multi-part hypothesis: H.1 - Usability professionals can perceive Usability-Supporting Architecture Patterns as relevant in their everyday work. H.2 - Usability professionals consider the usability benefits of Usability-Supporting Architecture Patterns important for their everyday work. H.3 - If Usability-Supporting Architecture Patterns are communicated in more natural HCI terminology to usability professionals, they can better appreciate the value of Usability-Supporting Architecture Patterns in their everyday work. The remainder of the paper is organized as follows. Section 2 describes the methods and instrument used to conduct the experiment. Section 3 presents the qualitative and quantitative results. Section 4 covers the discussion of the findings, and section 5 summarizes the paper with concluding statements.
2 Methods and Instruments 2.1 Participants This study surveyed a convenience sample of usability professionals from the Indianapolis Usability Professionals Association (UPA) and the Swiss UPA. The sample included approximately 80 participants that have academic training in HCI, HCI
322
E. Luzcando, D. Bolchini, and A. Faiola
professional experience, or both. The study did not differentiate between professionals and students, but it was expected that most participants would have some degree of professional experience in HCI or related fields given their involvement with the UPA. 2.2 Survey Design The study is based on a mixed-methods research design to analyze an area where little research has been conducted, following a Concurrent Triangulation Strategy [10] and the quantitative data given higher priority during the analysis. The quantitative portion of the experiment used a Pretest-Posttest Control Group Design [11] with a classic between-subjects design where participants are randomly assigned to any of two groups during the data collection phase. Participants in the experiment group received a treatment in the form of specific USAP materials consisting of a software design scenario and an USAP example. Participants in the control group did not receive the treatment. A questionnaire format was used for the pretest as well as the posttest, including both quantitative and qualitative questions. Demographic information was solicited after the questionnaire, in addition to the opportunity to provide additional comments. The survey questions were created leveraging survey design techniques from Dillman [12], and several questions were constructed based on previous questions from Schuman and Presser [13] used to survey attitudes. The online survey was constructed from scratch with a combination of PHP and MySQL technologies available at IUPUI. All data was collected and stored in university infrastructure. 2.3 Procedure The survey introduction provided a brief history about the desire to improve usability in software products. Participants were then given pretest questions to record their existing knowledge and experience. Following the pretest, the treatment introduced USAPs (to the experiment group) and explained how leveraging USAPs could facilitate the communication between usability professionals and software engineers. The treatment provided a software design scenario (canceling a command) describing the communication challenges regarding usability in software design and presented a USAP example. During the posttest, participants were asked to rate the importance of USAP usability benefits from an HCI perspective using a Likert scale. This was done with the nine original USAP usability benefits as well as a newly worded set of usability benefits meant to find if different terminology would improve acceptance. Although all nine USAP usability benefits were rated, the study focused on two Table 1. USAP Usability Benefit Comparison Design Usability Benefit Original Wording
USAP usability benefits: Accelerates error-free portion, and Reduces the impact of slips, as shown Table 1. An initial pilot study and conversations with HCI peers suggested that these two used terminology that was confusing to a usability professional. Additional posttest questions explored further perceptions about USAPs and software design, asking participants to state their opinions about USAPs and their potential applications in practice. The survey was designed to flow as one continuous questionnaire where participants were unaware of the difference between the pretest questions or posttest questions.
3 Results From the convenience sample of 80 usability professionals, 67 participants began the survey, 49 completed the pretest and 45 completed the posttest. Of the 45 participants that completed the pretest and posttest only the results of 35 participants were complete and summarized in this section. There were 17 participants in the experiment group and 18 in the control group, and 15 were from the Swiss UPA (Region 1), and 20 from the Indianapolis UPA (Region 2). Of the 34 participants that provided demographic information 20 had a masters, doctorate or post-graduate degree, 12 had a bachelors degree, and 2 did not have any degree. From these, 25 reported six or more years of experience. When asked to what extent they agreed that usability is an important aspect of software design, all 35 participants agreed, and when asked if they had worked in close contact with software engineers, 28 of 35 participants agreed. When asked to what extent they agreed that USAPs would assist usability professionals identify usability concerns that impact the architecture of a software system, 23 of 35 participants agreed (66%). When asked if they found it challenging to apply usability principles in software design projects, 30 of 35 participants answered yes, and when asked if there is a communication gap between usability professionals and software engineers, 33 of 35 participants answered yes. Additionally, participants volunteered comments about the existence of a communication gap between usability professionals and software engineers, as summarized in Table 2. When participants were asked if they were familiar with any methodologies that would improve communication between usability professionals and software engineers, 21 of 35 answered yes (60%). In addition, those participants who answered yes where asked to list the known methodologies to substantiate their quantitative answer, and their responses are summarized in Table 3. Participants were asked to rate the importance of original USAP usability benefits as well as the newly worded version using the following scale: Very Important =1, Important=2, Somewhat Important=3, Not Important =4, and Don't Know=5. The Don’t Know answers were filtered out. The results are summarized in Fig. 1 using the following weighted average: Very Important =16, Important=12, Somewhat Important=8, and Not Important =4.
324
E. Luzcando, D. Bolchini, and A. Faiola Table 2. Identified Reasons for the Communication Gap between Groups1 Answer
Identified Issue
Yes
Knowledge: software engineers only know software development and usability professionals only know usability. They don’t know each other’s disciplines. Core focus in project: software engineers focus on getting all system parts to work, and usability professionals only focus on system parts that impact the user interface. Mutual understanding: Both groups struggle to understand each other’s needs. Awareness: software engineers have not been exposed to usability and usability professionals have not been exposed to software engineering. Process: The software design process is may or may not include usability. Availability of usability people: Not all project benefit from the participation of usability professionals. Stated there is gap, but did not elaborate on the reason. No gap
No
Participants 5
7
4 2
1 2 2 1
Table 3. Reported Methods to Improve Communication2 Listed Methods MILE+ Open communications (e.g. meetings, workshops) AWARE HCI-driven methodologies Using prototypes and mockups Software development methodologies Conceptual Comics
Participants 2 10 1 1 3 6 1
An independent groups t test was used to test the difference in the mean response or rated importance of the target USAP usability benefits Accelerates error-free portion and Reduces impact of slips. Respondents from Region 2 (M = 1.76) showed a lower mean response than those from Region 1 (M = 2.29), t(30) =2.09, p < .05, r = .36. The rating of USAP usability benefits also collected qualitative data by asking participants to provide any comments if any of the USAP usability benefits were not clear to them. The targeted USAP usability benefits Accelerates error-free portion and Reduces Impacts of Slips received the most comments, mostly about ambiguous meaning and language that was not familiar. The other (non-targeted) USAP usability benefits did not receive similar comments. 1 2
Included five additional responses from the pretest that were not part of the 35 clean data sets. Included three additional responses from the pretest that were not part the 35 clean data sets.
When asked if they found that leveraging USAPs would be useful for their software design activities, 24 of 35 participants agreed. However, there was a directional difference between the control group and the experiment group. Of the 24 that agreed, 15 were from the control group and 9 were from the experiment group. The experiment group experienced an increase from 0 to 6 participants in the selection of the no opinion choice when compared to the control group. When asked if there is a communication gap between usability professionals and software engineers, 29 of 35 participants agreed. When participants were asked how likely it would be for them to go and learn more about USAPs after completing the survey, 25 of 35 agreed.
4 Discussion H.1 predicted that usability professionals expect to get benefits of UsabilitySupporting Architecture Patterns in their everyday work. During the pretest 66% of the participants agreed that USAPs could enable usability professionals identify usability concerns that impact the architecture of a software system. However, it is unclear why 66% agreed because no participants reported to have a priori knowledge of USAPs, and of the 60% that reported knowing methodologies to improve this gap, none reported USAPs. One possible explanation for this result could be that the term “usabilitysupporting” along with “architecture-patterns” could lead to an implicit belief that USAPs are beneficial. In the posttest, 68% of the participants reported USAPs as
326
E. Luzcando, D. Bolchini, and A. Faiola
useful for software design activities based on what they had learned in the survey. However, agreement was directionally different between the control group (62%) and the experiment group (38%). This difference could stem from the participants comfort in selecting the no opinion choice. The selection of the no opinion choice could be an effect of receiving the treatment. It is possible that after participants received the treatment and were exposed to the USAP scenario, they did not understand its purpose or were perhaps confused by the presentation of the materials. For example, it could be that the USAP scenario of canceling a command did not easily apply to their experience, and therefore did not add clarity about the usefulness of USAPs. Conversely, it is possible that participants that did not receive the treatment and did not see the USAP materials were able to imagine (or construct) their own idea of what USAPs are, which in their view might be more effective than the actual USAPs. However, there was no effect found for the treatment in determining the difference between the pretest and posttest difference (p> .10). H.2 predicted that usability professionals can perceive the importance in using Usability-Supporting Architecture Patterns for their everyday work. During the pretest, 100% of the participants acknowledged that usability is an important aspect of software design, and 86% of participants acknowledged they have previously found it challenging to apply usability principles in a software design project. This suggests that participants understood the importance of usability in software design and the challenges of applying usability principles therein. Hence, the fact that 71% of participants responded that they would likely investigate USAPs further and learn more about them is a potential indication of their usefulness. However, it is possible that the perceived importance of USAPs is a result of recognizing that any technique to improve usability is innately important to usability professionals. This study did not analyze this further. H.3 predicted that if Usability-Supporting Architecture Patterns are communicated in more natural HCI terminology to usability professionals, they can better appreciate the value of Usability-Supporting Architecture Patterns in their everyday work. We predicted that when participants received the treatment they would rate USAP usability benefits as more important since they had (in the treatment) been exposed to a positive introduction of USAP usability benefits and potential use in software design. The effect of the treatment was non-significant (p > .10) for the ratings. When contrasting the control group with the experiment group, the targeted USAP usability benefits Accelerates error-free portion and Increases efficiency exhibited an 18% reduction in rating of importance when compared to their newly worded counterparts Increases Efficiency and Reduce the impact of errors. However, there was no significant effect found for the treatment (p = 0.63). An unexpected yet interesting result of the experiment was that participants in Region 1 (Europe) responded differently than those in Region 2 (US) when rating the importance of the target USAP usability benefits Accelerates error-free portion and Reduces impact of slips. US usability professionals rated the target USAP usability benefits more important than European usability professionals, which is a potential indication that USAPs are more difficult to understand for European usability professionals than for US usability professionals.
5 Conclusion This study suggests that usability professionals' initial perception of USAPs is positive. Participants agreed that USAPs are relevant for considering usability concerns in software design, and that usability professionals recognize there is a communication gap with software engineers. However, exposure to USAP materials did not conclusively affect their perception of USAPs. The study suggests that usability professionals generally accept the notion of USAPs without understanding USAP details. This effect was more prominent for US participants in the study, in contrast with their European counterparts. More studies would need to be performed to evaluate additional characteristics of USAPs and their potential acceptance by usability professionals.
Acknowledgments Thanks to Dr. Mark Pfaff for his guidance in conducting the statistical analysis for several parts of this study.
References 1. Snyder, C.: Paper Prototyping: The Fast and Easy Way to Design and Refine User Interfaces. Morgan Kaufmann, San Francisco (2003) 2. Karat, J.: Taking Software Design Seriously. Academic Press, San Diego (1991) 3. Preece, J., Rogers, Y., Sharp, H.: Interaction Design: Beyond Human-Computer Interaction. John Wiley & Sons, New York (2002) 4. Seffah, A., Gulliksen, J., Desmarais, M.C.: Integrating Usability in the Development Process. In: Seffah, A., Gulliksen, J., Desmarais, M.C. (eds.) Human-Centered Software Engineering: Integrating Usability in the Software Development Lifecycle, pp. 3–14. Springer, Dordrecht (2005) 5. John, B.E., Bass, L., Sanchez-Segura, M.I., Adams, R.: Bringing Usability Concerns to the Design of Software Architecture. In: 9th IFIP Working Conference on Engineering for Human-Computer Interaction and 11th International Workshop on Design, Specification and Verification of Interactive Systems, Hamburg, Germany (2004) 6. Edwards, W.K.: Infrastructure and Its Effect on the Interface. In: Erickson, T., McDonald, D.W. (eds.) HCI Remixed: Reflections on Works That Have Influenced the HCI Community, pp. 119–122. MIT Press, Cambridge (2008) 7. Golden, E., John, B.E., Bass, L.: The value of a usability-supporting architectural pattern in software architecture design: a controlled experiment. In: 27th International Conference on Software Engineering ICSE, St. Louis, Missouri, p. 460 (2005) 8. Adams, R., Bass, L., John, B.E.: Experience with using general usability scenarios on the software architecture of a collaborative system. In: Seffah, A., Gulliksen, J., Desmarais, M.C. (eds.) Human-Centered Software Engineering: Integrating Usability in the Software Development Lifecycle, pp. 87–112. Springer, Dordrecht (2005) 9. John, B.E.: Evidence-Based Practice in Human-Computer Interaction and Evidence Maps. ACM SIGSOFT Software Engineering Notes 30, 1–5 (2005)
328
E. Luzcando, D. Bolchini, and A. Faiola
10. Creswell, J.W.: Research Design: Qualitative, Quantitative, and Mixed Methods Approaches. Sage Publications, Thousand Oaks (2003) 11. Campbell, D.T., Stanley, J.C.: Experimental and Quasi-Experimental Designs for Research. Houghton Mifflin Company, Boston (1963) 12. Dillman, D.A.: Mail and Internet Surveys: The Tailored Design Method. John Wiley & Sons, New York (2000) 13. Schuman, H., Presser, S.: Questions and Answers in Attitude Surveys. Academic Press, New York (1981)
Heuristic Evaluations of Bioinformatics Tools: A Development Case Barbara Mirel and Zach Wright University of Michigan (bmirel,zwright}@umich.edu
Abstract. Heuristic evaluations are an efficient low cost method for identifying usability problems in a biomedical research tool. Combining the results of these evaluations with findings from user models based on biomedical scientists’ research methods guided and prioritized the design and development process of these tools and resulted in improved usability. Incorporating heuristic evaluations and user models into the larger organizational practice led to increased awareness of usability across disciplines. Keywords: Usability, heuristic evaluation, biomedical research, organizational learning, user models.
problems in the results, set priorities for fixes, and raise developers’ awareness of user needs beyond surface fixes to better build for users’ cognition in scientific analysis. Our outcomes have been positive. We argue that for our bioinformatics tools, positive results hinge on combining domain-based, user-informed heuristic evaluations with organizational processes that break down boundaries isolating usability from development, modification request decisions, and UI design.
2 Relevant Research Heuristic evaluations involve “evaluators inspect[ing] a user interface against a guideline, be it composed of usability heuristics or cognitive engineering principles, in order to identify usability problems that violate any items in the guideline”[8]. This method is known to produce many false positives and likely omissions of problems related to users’ cognitive tasks. It nonetheless is one of the most popular usability assessment methods due to its low costs and efficiencies [2]. Thus it is important to improve the effectiveness of HEs without diminishing their benefits. Researchers have found several ways to achieve these improvements. They include conducting heuristic evaluations with many evaluators and combining them with evaluator training and reliability testing increase the effectiveness of HEs [10,12]. Heuristic evaluation results also improve when evaluators have prior knowledge of usability and the tool; when heuristics are adapted to domain tasks and knowledge; and when HE findings are compared with results from user performance studies [3]. Finally, improvements come from using sets of heuristics that are “minimal” (not overlapping) yet inclusive [10]. For example, some researchers have evaluators jointly consider heuristics and problem areas, thereby assessing to a “usability problem profile” [2]. Establishing an optimal set of heuristics, however, is still a black box. To compensate for elusive “ideal heuristics,” many usability researchers advocate integrating findings from user performance studies with HE. Demonstrably, heuristic and user performance evaluations combined uncover more problems than either method does alone. Yet quality not just quantity of problems is critical. For better quality, some researchers claim that what is missing in Nielsen’s standard set of heuristics is that they are not “related to cognitive models of users when they interact with system interfaces” [8]. Cognitively-oriented heuristics are especially important when tools support complex tasks. Recent attempts to construct heuristics that address cognition include Gerhart-Powel’s [5] cognitive engineering principles and Frokjaer and Hornbaek’s [3] principles based on metaphors of thinking. So far findings about the superiority of such heuristics have been mixed [4,8]. Running in parallel with these academic efforts, some studies by specialists in production contexts aim to improve the effectiveness of HEs by advantageously combining them with organizational processes. Hollinger [7], for example, reports on positive efforts at Oracle—against great organizational resistance at first—to combine bug reporting processes with HE findings, thereby “mainstreaming” reviews of outcomes. This mainstreaming increased usability awareness across different teams and functional specialties, incited interactive team discussions about usability, initiated tracking the costs and benefits of usability improvements, and resulted in fixing more usability defects. Moreover, results included “significant improvements in the quality of the user interface” [7].
Heuristic Evaluations of Bioinformatics Tools
331
Exploiting organizational processes is promising but, to the best of our knowledge, few production context studies report on combining HE with even more organizational processes than Hollinger [7] describes or on combining organizational processes with the established methods of improving HE outcomes by comparing them with usability performance findings, assuring evaluator familiarity with the tools, and adapting heuristics to the task domain.
3 Methods Our methods are tied to achieving the same effectiveness with HE that other researchers seek by combining them with other factors. Unfortunately, due to resource constraints we could not conduct extensive training of evaluators or involve numerous evaluators. We could, however, get several evaluators familiar with the tools, adapt and pilot test heuristics to our domain and tools, and introduce several new organizational processes. We also introduced the novel process of reframing surface problems found by HEs into more substantial problems-based on user models. 3.1 Tools We report on heuristic evaluations of one open source, web-based query and analysis tool. The tool is the front end for querying our center’s innovatively integrated protein interaction database. The query and analysis tool lets users query by gene(s), keyword(s), or gene list and provides tabular query results of relevant genes, attributes, and interactions. The tool is non visual but links to visualization tools. 3.2 User Task Models User models were derived from longitudinal field studies of 15 biomedical researchers using our tools and others to conduct their systems biology analysis [9]. These models directed both our adaptations and interpretations of heuristic evaluations. The user models are unique in bioinformatics because they captures scientists’ higher order cognitive and analytical flow for research aimed at hypothesizing and not only lower level tasks that are typically studied in usability tests, cognitive walkthroughs, or cognitive task analysis. Specifically, the user models capture moves and strategies for verifying accuracy, relevance, and completeness and uncovering previously unknown relationships of interest. These tasks involve manipulating data to turn it into knowledge through task-specific combinations of sorting, selecting, filtering, drilling down to detail, and navigating through links to external knowledge bases and literature. Additionally, to judge if genes and interactions are interesting and credible, scientists analyze high dimensional relationships and seek contextual cues from which to draw explanatory inferences. Ultimately, they examine conditions and causes in interactive visualizations, tools outside the scope of this article. This empirically-derived model of higher order cognition was critical to adapting standard Nielsen heuristics to our domain and tool.
332
B. Mirel and Z. Wright
3.3 Adapted Heuristics We adapted Nielsen’s standard set of 10 usability heuristics to our domain and uses of our tools to include the following: The presence of external links to multiple data sources and internal links to details and the large amounts of heterogeneous data in result sets; the core need for statistics and surrogates for confidence; and the variety of interactions needed for validating, sensemaking, and judging results. 3.4 Heuristic Evaluations and Evaluators Three evaluators pilot tested the adapted heuristics with other query and analysis tools developed by our center to refine their applicability to the domain and users tasks. One evaluator is trained in usability and visualizations and the other two evaluators specialize, respectively, in portal architecture and systems engineering and in web administration and marketing communications. All were knowledgeable about the tools and moderately aware of users’ tasks and actual practices through discussions with the usability director about field study findings. No reliability testing was done due to time constraints. Instead, inter-evaluator differences were analyzed by exaining comments entered in the comments field in the instrument. After heuristic evaluations were conducted, outcomes and comments were summarized and grouped by agreement and severity. Relevant design changes were suggested. 3.5 Integration of Additional Processes Concurrent with the heuristic evaluations, the following organizational and software development life cycle processes were instituted with enhanced usability in mind: • Usability categories and severity levels were built into in the modification request (MR) system. Levels were: Minor, serious, major, critical, and failure, and they were coordinated with a newly instituted Technical Difficulty ranking. • Operational processes were put into place for turning MRs into long term development priorities and for raising awareness of user models and their requirements. Our processes included forming a new committee for setting priorities composed of the directors of computer science, life sciences, and usability along with the lead developer and project manager. • Informal and highly collaborative processes between developers, web designers, usability evaluators, and scientists were implemented to assure rapid prototyping and feedback • A research project was initiated into design requirements based on heuristic evaluation findings and user models.
4 Results 4.1 Evaluation Outcomes Conducting the heuristic evaluations took on average two hours/evaluator. Summarizing added another few hours to the effort. Sample summary outcomes are shown in Table 1. Those with agreed upon high severity are highlighted.
Heuristic Evaluations of Bioinformatics Tools
333
Table 1. Sample of results summarized from heuristic evaluations Heuristic
1. Currency of the tool web pages 2. Readable text 3. Hints for formulating a query for better results 4. Able to undo, redo, go back 5. Broken links 6. Examples included and prominently 7. Currency of the data; data sources cited 8. Clearly shows if no results occur 9. Able to change result organization 10. Vital statistics are available. 11. Information density is reasonable 12. Clear what produced the query results 13. Clear why results seem high or low 15. Can access necessary data for validating
Problem severity /agreement High/agreement
Problem(s)
Design change
No date present
High/agreement High/agreement
Small font No hints available
Indicate last update to web pages 12 point font Need query hints when the query fails.
High/agreement
No history tracking;
High/agreement
“Top of page” is broken Could use more examples and better emphasis Versions of dbs are listed but no dates of latest updates Shows, but the message isn’t clear Sort is available but not apparent What would those stats be? A lot of whitespace; too many sections
Range/no agreement (high to low) Range/no agreement (high to low) Range/no agreement (high to low) Range/no agreement (high to low) Range/no agreement (high to low) Range/no agreement (high to lo) Range/no agreement (high to low) Middle/agreement
Low/ agreement
Should redisplay search term so user ties it to results No explanations; I assume informed user knows why Not sure what the data would be
Provide history tracking Fix [list of broken links] Add 1-2 (bolded) examples under the search box Add a date for last updating to our database Change message to: [Suggestion] Need note that columns are sortable No agreement Get rid of the 5 nested boxes; No agreement
Not clear where the search term is “hitting.” No agreement
As Table 1 shows, highly ranked problems involved broken and missing features and web page omissions that could be added without programming. Middle-ranked problems were tied more to user task needs and subjective issues such as what constitutes either “enough” support or the criteria scientists use for judging reliability/validity. Problems with little agreement about severity level were tied even more to evaluators having to project and evaluate the importance of scientists’ task needs in this domain. For example, evaluators varied widely in judging the importance of validation in scientists’ ways of reasoning and knowing. Some actual problems were not caught by the heuristic evaluations, especially those involving novel and unexpected ways users might interact with the interface. These findings were provided by the field studies. Additionally, evaluators’ comments and the summarized design changes ranged from precise to vague. Typically, design changes for familiar problems in
334
B. Mirel and Z. Wright
generic UI design were precise; those tied to user task models for systems biology and complex exploratory analysis were not. 4.2 Integrating Organizational Processes Interpretations and the actions taken on the outcomes of heuristic evaluations took the following course organizationally. As noted in Methods, design changes were entered into the MR system and ranked for severity and degree of development effort. Low cost problems at the levels of failure, critical, major, and serious–e.g. broken links— were delegated and fixed immediately. Concurrently, areas where the heuristic evaluation outcomes combined with problems pertinent to scientists’ demonstrated practices in the field (as captured in the user models) were examined. From these analyses, important combinations of problems found by the HE surfaced—combinations that implied problems related to higher order cognitive task needs. For example, problems 3, 6, 8, 9, 12, and 13 in Table 1 were observed as a recurrent cluster in the field observations as part of scientists’ higher order task for locating interesting genes and relationships expediently. For this task, scientists progressively narrow down results sets based on several meaningful attributes and on validity standards, such as genes/gene products enriched for neurodegenerative processes. Once combined, this set of HE problems revealed scientists’ difficulty manipulating queries and output sufficiently to uncover potentially interesting relationships. Thus beyond easy fixes—e.g. column cues for sorting— deeper implications of a tool’s actual usefulness were uncovered by the combined HE problems and user model. Shaped by the user models developed at our center and by ongoing research in our into design requirements, issues like the example above were presented to the usability and development teams and then brought to the priority setting committee. For example. problems related to users being able to narrow down to interesting results led to realizations that the tool needed to provide a more powerful search mechanism, extensive indexing, and interfaces that allowed users to construct/revise queries using multi-dimensional. Another priority setting issue suggested by the HE outcomes and better understood through the user models was the need for specific types of additional content for users’ validation purposes. Both needs received high priority. Additionally, as the software developers became more aware of the value of these usability techniques, we started to get requests for the heuristic evaluation instrument itself so that programmers could keep the criteria in mind while in the process of developing their software.
5 Discussion Developing the heuristic evaluation instrument was an iterative process as the evaluators discovered its weaknesses and strengths during the course of evaluations. Many of the heuristics turned out to be redundant and were either combined or discarded. Close inspection of the tools also engendered new heuristics as evaluators noticed additional usability problems. Accompanying comments proved to be crucial and were made mandatory for any problems found in later evaluations. The severity
Heuristic Evaluations of Bioinformatics Tools
335
numbering system also proved to be too abstract and will be replaced by ratings that mirror the ones used in the MR system. Finally, some heuristics in the instrument proved to be too theoretical or complex to be useful (e.g. “salient patterns stand out”) and had to be removed or refined. Some of these difficult heuristics were less concrete and were often better suited to incorporation and analysis within the user model. In tool assessments, heuristics alone identified isolated problems and a few inaccuracies. Combined with the user model, the heuristic evaluations enabled us to uncover problems related to integrated tasks associated with scientists’ higher order analysis and reasoning. Evaluators’ written comments, omissions, imprecision in some proposed design changes, and lack of agreement about certain items were vital in cuing us to further examine particular problems or combinations of problems in light of the user models. Had time and resources permitted, reliability testing would have diminished disagreements. A positive unintended consequence of these disagreements, however, was that they revealed where developers’ awareness of user tasks was incomplete. For example, in the heuristic evaluations, comments about “the ability to change the organization of results” indicated that the tool did not make it obvious that columns could be sorted. The user model revealed, however, that the untransparent sorting was only one shortcoming related to this specific heuristic. In actual practice, scientists’ analysis and judgments required tools to provide a combined set of sortingand-filtering interactions to rearrange results into multidimensional groupings—i.e. interesting relationships. Reframed to account for this need, this problem led to high priority, enhanced functionality. Unlike in Hollinger’s study, many usability problems—framed in ways that join heuristic evaluation outcomes and user models—were given high priority status. For such achievements, collaborations across specialties were critical—formally and informally. Developers, web specialists, project managers, scientifically expert bioinformatics specialists, and the usability, scientific, and computer science directors all played distinct roles in shaping the perspectives needed for strategically determining and then implementing a better match between tools and systems biology tasks. In the process, people across specialties grew increasingly aware of each others’ perspectives and began slowly evolving a shared language for articulating them. This process is often termed “double-loop learning” and is essential for innovation [1]. One example of this cross-organizational learning is the software developers’ requests fro the heuristics to help guide software development. Vital to this learning and the common grounding on which it rests is the perennial challenge of assuring that heuristics are expressed in the right grain size and language. As with other research focused on this goal, our center’s efforts have highlighted places to make heuristics more concrete and ways to join outcomes with user models.
6 Conclusions In our center’s case, collaborative communication, shared language, and greater awareness—i.e. double-loop organizational learning—were integrated into and developed from heuristic evaluations. We found a way to use this discount usability inspection method combined with user models and newly implemented organizational processes,
336
B. Mirel and Z. Wright
to reframe problems and to gain buy-in for short and long term usability improvements aimed at scientists’ cognitive task behaviors. Heuristic evaluations coupled with user modeling revealed problems related to the higher order cognitive flow of analysis Combined with organizational and software development processes that encouraged attention to usability, heuristic evaluations produced results and recommended changes that received high priority. Moreover, developers and directors who previously had not considered usability in choices they about knowledge representations or functionality now grew increasingly sensitive to the implication of their choices from a user perspective. Our center continues to refine the instrument and apply it to other tools and is simultaneously creating a complementary instrument for heuristic evaluation of interactive visualizations in bioinformatics tools.
References 1. Argyris, C., Schön, D.: Organizational learning II: Theory, method and practice. Addison Wesley, Reading (1996) 2. Chattratichart, J., Lindgaard, G.: A comparative evaluation of heuristic-based usability inspection methods. In: Proceedings of ACM CHI 2008 Conference, pp. 2213–2220. ACM Press, New York (2008) 3. Cockton, G., Woolrych, A.: Understanding inspection methods: lessons from an assessment of heuristic evaluation. In: Blandford, A., Vanderdonckt, J. (eds.) People & Computers XV, pp. 171–192. Springer, Berlin (2001) 4. Frokjaer, E., Hornbaek, K.: Metaphors of human thinking for usability inspection and design. ACM Transactions on Computer-Human Interaction 14, 1–33 (2008) 5. Gerhardt-Powals, J.: Cognitive engineering principles for enhancing human-computer performance. International Journal of Human-Computer Interactions 8, 189–211 (1996) 6. Hartson, H., Anndre, T.S., Williges, R.: Criteria for evaluating usability evaluation methods. International Journal of Human-Computer Interaction 13, 373–410 (2001) 7. Hollinger, M.: A process for incorporationg heuristic evaluation into a software release. In: Proceedings of AIGA 2005 Conference, pp. 2–17. ACM Press, New York (2005) 8. Law, E.L.-C., Hvannberg, E.T.: Analysis of strategies for improving and estimating the effectiveness of heuristic evaluation. In: Proceedings of ACM NordiCHI 2004, pp. 241–250. ACM Press, New York (2004) 9. Mirel, B.: Supporting cognition in systems biology analysis: Findings on users processes and design implications. Journal of Biomedical Discovery and Collaboration (forthcoming) 10. Nielsen, J.: Heuristic evaluation. In: Nielsen, J., Mack, R.I. (eds.) Usablity Inspection methods, John Wiley, Chichester (1994) 11. Nielsen, J.: Enhancing the explanatory power of usability heuristics. In: Proceedings of ACM CHI 1994 Conference, pp. 152–158. ACM Press, New York (1994) 12. Schmettow, M., Vietze, W.: Introducing item response theory for measuring usability processes. In: Proceedings of CHI 2008, pp. 893–902. ACM Press, New York (2008)
First Impression Does the tool fit the overall NCIBI look and feel? Does it look professional? Is the tool appropriately branded with funding source and NCIBI, CCMB, and UM logos? Does the tool link back to UM, CCMB, and NCIBI? Is it clear what to do and what to enter? (limitations are clear, how to format query is clear, what options user has, if a user needs to enter terms from some taxonomy/ontology access to those terms is available for user to choose from) Are there examples shown and are they prominent? Is the display consistent with user conventions for web pages/apps? Is it clear why use the tool and to what purpose? Does it require minimal steps to get started quickly? Is the cursor positioned in the first field that requires entry? Is help readily available? Is it clear how current the data are? Is it clear how current the website is? Are data sources cited and identified? Are appropriate publications cited? Are there any broken links? Are the page titles (displayed at the top of the browser) meaningful and change for different pages? Are page elements aligned (e.g. in a grid) for readability? Is the site readable at 1024x768 resolution? Is the text readable? (e.g. size, font, contrast)? Does the page have appropriate metadata tags for search engines? Search / Results Is the length of processing time acceptable? Do adequate indicators show system status and how long it may take? Clearly shows if there are no query results? Clearly shows how many results query produces? Is it clear what produced the query results? Is it easy to reformulate query if necessary? Are there hints/tips for reformulating query for better results? If the query results seem high or low is it clear why? Are the results transparent as to what results are being shown and how to interpret it? Are the results displayed clearly and not confusing? Is there an ability to detect and resolve errors? Interaction with Results Is there an ability to filter or group large quantities of data? Is there an ability to change the organizations of results? Is there ability to undo, redo, or go back to previous results ? Are the mechanisms for interactivity clear? Is the logic of the organization clear? Are different data items (e.g. rows) kept clearly separate or delineated?
If there are links is it clear where they go? If there are icons is it clear what they do? Do the link outs provide reliable return? Are the vital statistics/counts of information available? Do the names/labels adequately convey the meaning of items/features? Are data items kept short? Is there too much/little information? Is the density of information reasonable? Can you access the necessary data to assure validity? (e.g. sources) Can results be saved? Are the results available for download in other formats? Can the pages be easily printed? Is vertical scrolling kept to a minimum? Is there horizontal scrolling? Comments Additional comments go here
A Prototype to Validate ErgoCoIn: A Web Site Ergonomic Inspection Technique Marcelo Morandini1, Walter de Abreu Cybis2, and Dominique L. Scapin3 1
School of Arts, Science and Humanities – University of Sao Paulo, Sao Paulo, Brazil m.morandini@usp.br 2 Ecole Polytechnique Montreal, Canada walter.cybis@polymtl.ca 3 Institut National de Recherche en Informatique et Automatique Rocquencourt, France dominique.scapin@inria.fr
Abstract. This paper presents current actions, results and perspectives concerning the development of the ErgoCoIn approach, which allows non expert inspectors to conduct ergonomic inspections of e-commerce web sites. An environment supporting inspections based on this approach was designed and a tool is being developed in order to accomplish its validation plan. Besides this validation, the actions to be undertaken will allow us to analyze the task of applying checklists and specify an inspection support environment especially fitted for that. This is of great importance as this environment is intended to be an open web service supporting ergonomic inspections of web sites from different domains. A wiki environment for this tool development is also being proposed. Keywords: Usability, Evaluation, Web Sites, Inspection, Web 2.0.
usability it can afford to its users [6]. Considering the software product quality model 1 proposed by ISO 9126 , ergonomics may be understood as an external quality of the software while the usability is the quality of its use [8]. Methods aimed to measure usability (usability tests) are known to be usually expensive and complex [13]. Alternatively, ergonomics of the user interfaces can be evaluated or inspected faster and at lower costs. A simple differentiation between evaluations and inspections can be established based on the type of the knowledge applied to the judgments involved with both techniques. Evaluators apply mainly implicit knowledge they accumulated from study and experience, while inspectors apply primarily the explicit knowledge supported by documents, such as checklists. Inspectors cannot produce fully elaborated or conclusive diagnosis, but their diagnoses are comparatively coherent and generally obtained at low cost. ErgoCoIn [5] is an approach designed to provide support to inspectors in order to allow them to perform objective web sites ergonomic inspections. With the goal of improving the quality of the diagnoses, this approach postulates several considerations about the web site’s context of use, including: users, tasks and environments attributes. Among them must be considered the ones concerning the interface of the web site under evaluation [9]. Content of interviews/questionnaires as well as of the others contextual data gathering activities are based on information demand presupposed by the approach’s knowledge base. Such strategy allows performing specific objective ergonomic inspections: only pertinent information gathering is proposed to the inspectors in the context of use analysis, and only applicable questions are presented to them while inspecting the web site. The ErgoCoIn checklists can support the inspectors by providing more homogeneous results when compared to those produced by ergonomic experts. This is an obvious consequence of having inspectors applying the same checklist set of questions and sharing decisions about their relative importance. This approach is interesting to web sites designers and evaluators due to the fact that questionnaires and checklists can be applied by the design staff, not necessarily experts in usability evaluation. Thus, the inspections can be usually performed quickly and at low costs. It can also be considered as a way to introduce ergonomic concepts to designers and to stimulate them in their daily work to be questioning human factors specialists when facing potentially serious ergonomics problems. In this paper we present details about both the ErgoCoIn logical architecture and the tool built to validate the approach: (i) low cost, (ii) objectivity and (iii) homogeneity of inspection diagnosis. The other requirements that were identified include the variety and novelty of the knowledge base. In order to achieve the fulfillment of these requirements, we propose the development of a collaborative effort aimed to insure that the ErgoCoIn knowledge base can be enriched continuously. We believe that inspections supported by an environment that incorporates these features can be more efficient and reliable. 1
In fact, ISO 9214:11 and ISO 9126:1 don’t agree completely about the terminology concerning the “a priory” and the “a posteriori” perspectives of usability. While the first standard employs “ergonomics” and “usability”, the second one employs “usability” and “in use quality” to denote these perspectives.
A Prototype to Validate ErgoCoIn
341
This paper contains 5 sections: Section 2 presents an overview of the ErgoCoIn approach. Section 3 presents the logic architecture of an environment aimed at supporting the software application, as well as introduces the tool that is being developed for validating the ErgoCoIn approach. Section 4 presents the motivation and proposal for developing a cooperative perspective to the development of a Wiki ErgoCoIn. And finally section 5 presents the conclusions that can be considered for this environment future development and use.
2 The ErgoCoIn Approach The ErgoCoIn approach development has been motivated by four considerations: (1) web sites development became achievable to a large spectrum of designers (through easily available design tools), not necessarily skilled in computer science or in ergonomics; (2) web sites are often designed along a fast and low cost design process supported by non expensive tools which may lead designers to include numerous and sometimes obvious ergonomic flaws; (3) usability evaluations using the “traditional” methods can be expensive and (4) their results may lack homogeneity [5]. The approach is divided into two main phases: web site Contextual Analysis and Ergonomic Inspection of the components and their attributes (see Figure 1). The Co-Description Phase is based mainly on surveys. Before conducting questionnaires and interviews, inspectors must identify the components of the user interface that will be inspected. The reason for that is to guarantee that, during surveys, the inspectors will collect only the contextual data that is appropriate to inspections of the actual user interface components. Surveys are supposed to be conducted with both users and designers. From users, inspectors are supposed to gather data concerning their profile, work environment and the strategies they apply to accomplish tasks using the web site. Task strategies are described simply as a sequence of pages that the users may access when accomplishing their goals. Satisfaction issues should also be gathered in surveys from users. From designers, inspectors should gather information about the expected context of use, including data concerning the user profile and task strategies. Results from surveys are examined in order to establish comparisons between context of use elements and particular task strategies as prescribed by both users and designers. The second phase of the approach is characterized by ergonomic inspections based on checklists. This sort of technique distinguishes themselves by their organization and content, and, specifically, are defined as a set of checklists items organized according to the Ergonomic Criteria [13] basically related to the ergonomics of web sites supporting e-commerce initiatives. This questions based approach was built from the examination of a large collection of ergonomic recommendations compiled by INRIA researchers [1,14]. Each recommendation selected was reformulated as a question and associated to one ergonomic criterion. Like any other inspection dynamics, application of each ErgoCoIn inspection question follows 3 decision phases: applicability, weighting and adherence. For objectiveness, the checklists should propose only questions which are applicable to the actual web site context of use and interface components. This is insured by having all questions in the ErgoCoIn knowledge base properly indexed to the context
342
M. Morandini, W. de Abreu Cybis, and D.L. Scapin
Fig. 1. The ErgoCoIn Approach Framework
of use aspects (user, task, environment and interface) as gathered from both users and designers. Further, each applicable question has to be weighted in order to allow the production of properly ranked results. Particular decisions about what is more important to be considered when inspecting e-commerce web sites were taken by the ErgoCoIn designers, but they can be modified by inspectors while inspecting web sites from different application domains. For simplicity, the level of importance of an ergonomic criterion may define the level of importance of each individual question associated to it. Finally, user interface adherence to a question (or requirement) must be judged by the inspectors. They do that based on the information concerning ergonomic requirements or questions (explanations, examples and counter examples) and also the data describing the web site´s context of use (concerning users, tasks and environment). Also, the ErgoCoIn application presupposes that information about the context of use should be directly collected from users and designers with the support of questionnaires and/or interviews. As a consequence, the approach can only be applied to web sites that are being used regularly. Furthermore, it is also necessary to have some designers and users available for the interviews or, at least, able to answer some questionnaires. The ErgoCoIn approach was designed to allow extensions and instantiations. The questions base can be extended to consider other type of perspectives, not just the ecommerce, but other domains, like e-learning for instance. Ergonomic Criteria and associated questions can be ranked differently in order to define a weight for the questions in accordance to the context of use of the web site under inspection. Another kind of extension that is being considered concerns the integration of the results from the analysis of usage log data produced with this approach. Such data can be collected using specific software tools for this purpose. In fact, a usability oriented
A Prototype to Validate ErgoCoIn
343
web analyzer called UseMonitor is being developed and associated to the ErgoCoIn approach [4]. This tool can present warnings about the “a posteriori” perspective on usability problems, i.e., interaction perturbations occurring while users are interacting with the web site in order to accomplish their goals. Basically, the UseMonitor can indicate when the observed efficiency rate is particularly low. Detailed efficiency indication is about the rates and time spent of unproductive users’ behaviors like solving error, asking help, hesitation, deviation, repetition and so on. Further, the UseMonitor can indicate web pages related to this kind of perturbations. A logic architecture based on the integration of (i) a typology of usability problems, (ii) the ergonomic criteria/recommendations and (iii) a model of interface components is also being defined. This will allow the UseMonitor warning the inspectors about a detailed interface aspect causing an actual usability perturbation (a posteriori result), while ErgoCoIn will be helping inspectors identifying the user interface component responsible for such perturbation as well as indicating how to fix it (a priori result). The integration of ErgoCoIn and UseMonitor defines the ErgoManager environment [4]. As a tool for usability evaluation such an environment will be automating both processes, the failure identification (by log analysis) and failure analysis (by guidelines processing) [1]. Details of this architecture are being defined and will be detailed in future publications.
3 The ErgoCoIn Environment and Validation Tool A computerized environment was designed in order to support mainly the data capture concerned by the inspection and inquiry techniques proposed by the current configuration of the ErgoCoIn approach [10]. Contextual analysis will be supported by two collectors consisted basically on a series of forms. The Contextual information collector is aimed at guiding inspectors while gathering information from designers and users. The Web site description collector will collect description data concerning web sites functions and interface components. Description questions concerned by these collectors are extracted from the environment Knowledge base. Data gathered (contextual data and site description) in this phase is stored in a Context of use data base. The support to Ergonomic Inspections starts with an Analytic evaluator, that is a system component that compares users’ and designers' information concerning the intended and real context of use features. This component will verify the existence of designer's misconceptions about users’ features, and if necessary, sends warnings to the Checklist builder. The main function of this builder is to create checklists concerning the overall web site and its pages according to the task strategies described by users and designers. It can highlight questions which could reveal ergonomic flaws due to the lack of correspondence between users and designers views about the context of use. These checklists will propose only applicable questions arranged according to their level of importance. A default order of importance is suggested, but it can be modified by the inspectors when considering the characteristics of the current web
344
M. Morandini, W. de Abreu Cybis, and D.L. Scapin
site context of use. Also, the inspectors’ judgments will be supported by the Ergonomic judgment support tool that will supply them with data about the context of use as well as the information about the questions. In order to validate the ErgoCoIn approach, we are developing a tool which follows the general architecture presented in Figure 2. This environment validation strategy consists in employing this tool to support different inspectors while accomplishing inspections of different web sites and by analyzing measures concerning effectiveness and efficiency of their actions as well as the homogeneity of their results.
Fig. 2. Overview of the Logical Architecture of the ErgoCoIn Environment
Based on the ErgoCoIn logic architecture, we have modeled data entities and created Entity-Relationship Models. We have also designed a Use-Case Map as well as a Sequence Diagrams for the main tasks. Figure 3 presents the Use Case Diagram for several registering tasks. Interactions for registering almost all kind of data defined in the EntityRelationship Model were designed according to the CDU (Create, Update & Delete) Model. They include the registering of inspectors, users, designers, web sites, tasks, web pages, interface components, ergonomic criteria and questions among others entities (see Figure 4). Doing so, we insure that interactions are quite homogeneous all over the interface tool. An exception is related to the interaction aimed at changing relative importance of the ergonomic criteria (see Figure 5). The first cycle of the ErgoCoIn´s implementation took place immediately after the conclusion of the design activities mentioned above. The first prototype is mainly concerned to ergonomic inspections and this version features a total of 182 questions registered that are linked to the 18 Ergonomic Criteria properly ranked.
A Prototype to Validate ErgoCoIn
345
Fig. 3. Use Case Diagram for the ErgoCoIn Validation Tool
Fig. 4. ErgoCoIn’s Users Storing Screen
The next step of development will be focused on the functions supporting activities of the others phases: Co-Description (screens concerned with users and designers questionnaires) and Inspection Reports (see Figure 1). The Ergonomic judgment support tool development will be undertaken in the future as well. Once the tool is completed, we will start accomplishing cycles of validation studies focusing not only on the tool, but also on the underlying approach. These cycles will be consisted on phases of (i) planning activities, (ii) inspections achievements, (iii) results analysis and (iv) proposals of revision. At each cycle, a number of inspectors will be invited to use the tool in order to perform inspections of a given e-commerce web site. Results from all inspectors, as well as the log of their actions will be gathered and analyzed from the homogeneity and objectiveness points of view [3]. The goal behind revision proposals is to get inspections more objective and reports more coherent. Validation cycles will be repeated until expected objectiveness and homogeneity criteria have being reached.
The inspections cycles will allow us to have a better understanding of the way tasks concerning ergonomic inspections of web sites are accomplished, and specify a tool specially fitted to those tasks. Indeed, we intent to specify an ErgoCoIn user interface able to support inspectors spread all over the world performing ergonomic inspections of web sites from different domains, not only the ones concerning e-commerce. The idea is to offer the tool to those who wants to make inspections, and wants to contribute to the enrichment of the ErgoCoIn knowledge base and programming code.
4 The Wiki-ErgoCoIn We propose to change the scope of the ErgoCoIn development in order to support a collaborative initiative. In fact, this kind of initiative is among the most interesting phenomena observed in the recent history of the web. Collaboration is authorized by special functions offered by web sites allowing users to create, share and organize the content by themselves. Best examples of socially constructed web sites are Facebook, Youtube, Flickr, Digg, del.ici.ous and Wikipedia. Particularly, the Wikipedia is the most successful example of collaboration concerning scientific content on the web. This socially constructed encyclopedia features remarkable internet traffic numbers as it is the 9th most visited web site in the whole Web. From 2001 to now, 7.2 million of articles were posted in Wikepdia. Those were produced by 7.04 million of editors following some style and ethic rules [16]. Wilkison and Huberman [17] performed a study concerning 52.2 million edits in 1.5 million articles in the English language Wikipedia posted by 4.79 million contributors between 2001 and 2006. They split out a group of 1,211 "featured articles", which accuracy, neutrality, completeness and style are assured by Wikipedia editors. Comparisons between featured and normal articles showed a strong correlation among the article quality, the number of edits and the number of distinct editors. In the same study, the authors could associate attractiveness of the articles (number of visits) to the edits novelty.
A Prototype to Validate ErgoCoIn
347
The goal of having ErgoCoIn as a collaborative web initiative is to increase the generality and attractiveness of its contents as well as the quality of the results this approach could afford. Indeed, the Wiki-ErgoCoIn is being designed in order to allow ergonomic inspectors all over the world to share efforts and responsibilities concerning the ErgoCoIn knowledge base extension and generalization. In doing so, we can expect that the Wiki-ErgoCoIn will always feature newly proposed questions concerning ergonomics of web sites from different application domains, interface styles and components. Contributions should fulfill a basic requirement: follow free-content collaboration rules like those developed by Wikipedia. We believe that the results obtained by such cooperative approach can be much more efficient and reliable than the ones that would be obtained solely by individual initiatives.
5 Conclusions ErgoCoIn is an inspection approach strongly based on knowledge about ergonomics of web site’s user interfaces. This knowledge is intended to guide inspectors while undertaking contextual data gathering and analysis, checklists based inspections and report actions. In this paper we described details of this approach and the environment designed to support it. We have also introduced the tool that is under development to validate its structure and contents. We will perform the validation activities following cycles of application-analysis-revisions until the approach reaches expected objectiveness and homogeneity goals. But the success of the ErgoCoIn initiative depends basically on the variety and the novelty of its knowledge. Nowadays, this approach is linked to the ergonomics of the current e-commerce web applications and interfaces technologies, styles and components. Indeed, all these aspects may evolve continuously using just e-commerce may be a very limited scope. Consequently, there is the need to undertake actions in order to face the challenge of continuously getting ErgoCoIn contents up to date and varied to support the production of inspection reports in different web sites domains. An open initiative is being proposed by which anybody knowledgable will be authorized to contribute to the enrichment of the Wiki-ErgoCoIn knowledge base. Consultative and executive boards will be created to define strategies and policies concerning implementation of this ergonomics inspection wiki. Participation demands are planned to be directly addressed to the authors.
References 1. Brajnik, G.: Automatic Web Usability Evaluation: What Needs to be Done? In: 6th Conference on Human Factors and the Web, Austin, Texas, USA (2000) 2. Cybis, W.A., Scapin, D., Andres, D.P.: Especificação de Método de Avaliação Ergonômica de Usabilidade para Sites/Web de Comércio Eletrônico. In: Workshop on Human Factors in Computer Systems, 2000, Gramado. Proccedings of 3rd Workshop on Human Factors in Computer Systems, vol. I, pp. 54–63. Ed. Sociedade Brasileira de Computação, Porto Alegre (2000)
348
M. Morandini, W. de Abreu Cybis, and D.L. Scapin
3. Cybis, W.A., Tambascia, C.A., Dyck, A.F., Villas Boas, A.L.C., Pagliuso, P.B.B., Freitas, M., Oliveira, R.: Abordagem para o desenvolvimento de listas de verificação de usabilidade sistemáticas e produtivas. In: Latin American Congress on Human-Computer Interaction, 2003, Rio de Janeiro. Proceedings of Latin American Congress on Human-Computer Interaction. Rio de Janeiro, vol. I, pp. 29–40 (2003) 4. Cybis, W.A.: UseMonitor: suivre l’évolution de l’utilisabilité des sites web à partir de l’analyse des fichiers de journalisation. In: 18eme Conférence Francophone sur l’Interaction Humain-Machine, 2006, Montréal. Actes de la 18eme Conférence Francophone sur l’Interaction Humain-Machine, vol. 1, pp. 295–296. ACM - The Association for Computing Machinery, New York (2006) 5. Cybis, W.A.: ErgoManager: a UIMS for monitoring and revising user interfaces for Web sites. Rocquencourt: Institut National de Recherche en Informatique et en Automatique, Research report (2005) 6. Cybis, W.A., Betiol, A., Faust, R.: Ergonomia e usabilidade: conhecimentos, métodos e aplicações, Novatec Editora, São Paulo (2007) 7. Farenc, P., Bastilde, C.R.: Towards Automated Testing of Web Usability Guidelines. In: Tools for Working with Guidelines, pp. 293–304. Springer, London (2001) 8. ISO/DIS 9126; Software engineering – Product quality – Part 1: Quality model. International Organisation for Standardization (1997) 9. ISO/DIS 9241; Dialogue Principles in Guiding the Evaluation of User Interfaces – part 11Guidance on Usability. International Organisation for Standardization (1997) 10. Ivory, M.Y., Heasrstam, M.A.: The State of the Art in Automating Usability Evaluation of User Interfaces. ACM Computing Surveys 33(4) (December 2001) 11. Leulier, C., Bastien, J.M.C., Scapin, D.L.: Compilation of Ergonomic Guidelines for the Design and Evaluation of Web Sites. Commerce & Interaction (EP 22287), INRIA Report (1998) 12. Molich, R., Bevan, N., Curson, I., Butler, S., Kindlund, E., Miller, D., Kirakowski, J.: Comparative Evaluation of Usability Tests. In: Proceedings of the Proceedings of the Usability Professional’s Association Conference (1998) 13. Scapin, D.L., Bastien, J.M.C.: Ergonomic Criteria for Evaluating the Ergonomic Quality of Interactive Systems. Behaviour and Information Technology 16(4/5) (1997) 14. Scapin, D.L., Leulier, C., Vanderbonckt, J., Mariage, C., Bastien, C., Palanque, P., Farenc, C., Bastilde, R.: Towards Automated Testing of Web Usability Guidelines. In: Tools for Working with Guidelines, pp. 293–304. Springer, London (2001) 15. Wammi: Website Analysis and MeasureMent Inventory (Web Usability Questionnaire) (n.d.) (2005), http://www.ucc.ie/hfrg/questionnaires/wammi (accessed, 2009) 16. Wikipedia, http://www.wikipedia.org (accessed February 2009) 17. Wilkinson, D., Huberman, B.: Assessing the value of cooperation in Wikipedia. First Monday 12(4) (2007), http://firstmonday.org/htbin/cgiwrap/bin/ojs/ index.php/fm/article/view/1763/1643
Mobile Phone Usability Questionnaire (MPUQ) and Automated Usability Evaluation Young Sam Ryu Ingram School of Engineering, Texas State University-San Marcos, 601 University Drive, San Marcos, TX 78666, USA yryu@txstate.edu
Abstract. The mobile phone has become one of the most popular products amongst today’s consumers. The Mobile Phone Usability Questionnaire (MPUQ) was developed to provide an effective subjective usability measurement tool, tailored specifically to the mobile phone. Progress is being made in the HCI research community towards automating some aspects of the usability evaluation process. Given that this effort is gaining traction, a tool for measurement of subjective usability, such as MPUQ, may serve as a complement to automated evaluation methods by providing user-centered values and emotional aspects of the product. Furthermore, experimental comparison of MPUQ assessments and automated usability analysis may enable researchers to determine whether automated usability tools generate metrics that correlate with user impressions of usability. Keywords: Usability, mobile user interface, subjective measurement, questionnaire, automating usability.
and 115 applicable to PDA/Handheld PCs) were retained from an original 512 items in the initial pool. To increase reliability and validity of this draft questionnaire, follow-up studies employing psychometric theory and scaling procedures were performed. To evaluate the items, the draft questionnaire was administered to a representative sample involving approximately 300 participants. The findings revealed a six-factor structure including (1) Ease of learning and use, (2) Assistance with operation and problem solving, (3) Emotional aspect and multimedia capabilities, (4) Commands and minimal memory load, (5) Efficiency and control, and (6) Typical tasks for mobile phones. The 72 items with the greatest discriminative power relating to these factors were chosen to include in the Mobile Phone Usability Questionnaire (MPUQ), which evaluates mobile phones for the purpose of making decisions among competing variations in the end-user market, alternatives of prototypes during the development process, or evolving versions during an iterative design process. Table 1. Development procedure of MPUQ
Phase
Goal
I
Generate and judge measurement items for the usability questionnaire for electronic mobile products
Approach Consider construct definition and content domain to develop the questionnaire for the evaluation of electronic mobile products based on an extensive literature review: • •
II
Design and conduct studies to develop and refine the questionnaire
Generate potential questionnaire items based on essential usability attributes and dimensions for mobile phone Judge items by consulting a group of experts and users focusing on the content and face validity of the items
Administer the questionnaire to collect data in order to refine the items by • • •
Conducting item analysis via factor analysis Testing reliability using alpha coefficient Testing construct validity using known-group validity
2 Automated Usability Evaluation and MPUQ Subjective usability measurements focus on an individual’s personal experience with a product or system. According to Ivory and Hearst [5], automation of usability evaluation does not capture important qualitative and subjective information. However, it is not yet known whether subjective impressions of usability are in fact correlated with metrics that automated usability approaches can capture. By conducting subjective usability evaluation using a questionnaire of the same interface as has been modeled with an automated usability prediction tool such as CogTool [6], we can perhaps determine whether it may be the case that a metric such as time taken to complete tasks can be correlated with subjective impressions of usability. One of the single greatest advantages of using questionnaires in usability research is that questionnaires can quickly and economically provide evaluators with feedback from the users’ point of view [7-9]. Since user-centered and participatory design is one
Mobile Phone Usability Questionnaire and Automated Usability Evaluation
351
of the most important aspects in the usability engineering process [10], questionnaires, applied with or without any other more ambitious method, can be a valuable tool, assuming that the respondents are validated as representative of the whole user population. There are many usability aspects or dimensions for which no established objective measurements exist, and those may only be measured by subjective assessment. New usability concepts suggested for the evaluation of consumer electronic products such as attractiveness [11], emotional usability [12], sensuality [13], pleasure and displeasure in product use [14] seem to be quantified effectively only by subjective assessment and those usability concepts are proving to be important these days. The MPUQ incorporated those dimensions; most of them are under the group of (3) Emotional aspect and multimedia capabilities. While other factor group items can be covered by other usability evaluation methods, the emotional aspects cannot presently be captured by any other practical approach than subjective measurement.
References 1. Vnnen-Vainio-Mattila, K., Ruuska, S.: Designing Mobile Phones and Communicators for Consumers’ Needs at Nokia. In: Bergman, E. (ed.) Information Appliances and Beyond: Interaction Design for Consumer Products, pp. 169–204. Morgan-Kaufmann, San Francisco (2000) 2. Sacher, H., Loudon, G.: Uncovering the new wireless interaction paradigm. ACM Interactions Magazine 9(1), 17–23 (2002) 3. Ketola, P.: Integrating Usability with Concurrent Engineering in Mobile Phone Development. Tampereen yliopisto (2002) 4. PrintOnDemand. Popularity of Mobile Devices Growing (2003), http://www. printondemand.com/MT/archives/002021.html (cited February 5, 2003) 5. Ivory, M.Y., Hearst, M.A.: The state of the art in automating usability evaluation of user interfaces. ACM Comput. Surv. 33(4), 470–516 (2001) 6. John, B.E., et al.: Predictive human performance modeling made easy. In: The Proceedings of SIGCHI Conference on Human Factors in Computing Systems, CHI 2004, ACM, New York (2004) 7. Kirakowski, J.: Questionnaires in Usability Engineering: A List of Frequently Asked Questions [HTML] (2003) (cited November 26, 2003) 8. Annett, J.: Target Paper. Subjective rating scales: science or art? Ergonomics 45(14), 966– 987 (2002) 9. Baber, C.: Subjective evaluation of usability. Ergonomics 45(14), 1021–1025 (2002) 10. Keinonen, T.: One-dimensional usability - Influence of usability on consumers’ product preference, University of Art and Design Helsinki, UIAH A21 (1998) 11. Caplan, S.H.: Making Usability a Kodak Product Differentiator. In: Wiklund, M. (ed.) Usability in Practice: How Companies Develop User-Friendly Products, pp. 21–58. Academic Press, Boston (1994) 12. Logan, R.J.: Behavioral and emotional usability; Thomson Consumer Electronics. In: Wiklund, M. (ed.) Usability in practice: How companies develop user friendly products, pp. 59–82. Academic press, Boston (1994) 13. Hofmeester, G.H., Kemp, J.A.M., Blankendaal, A.C.M.: Sensuality in product design: a structured approach. In: CHI 1996 Conference (1996) 14. Jordan, P.W.: Human factors for pleasure in product use. Applied Ergonomics 29(1), 25– 33 (1998)
Estimating Productivity: Composite Operators for Keystroke Level Modeling Jeff Sauro Oracle, 1 Technology Way, Denver, CO 80237 jeff@measuringusability.com
Abstract. Task time is a measure of productivity in an interface. Keystroke Level Modeling (KLM) can predict experienced user task time to within 10 to 30% of actual times. One of the biggest constraints to implementing KLM is the tedious aspect of estimating the low-level motor and cognitive actions of the users. The method proposed here combines common actions in applications into high-level operators (composite operators) that represent the average error-free time (e.g. to click on a button, select from a drop-down, type into a text-box). The combined operators dramatically reduce the amount of time and error in building an estimate of productivity. An empirical test of 26 users across two enterprise web-applications found this method to estimate the mean observed time to within 10%. The composite operators lend themselves to use by designers and product developers early in development without the need for different prototyping environments or tedious calculations.
Estimating Productivity: Composite Operators for Keystroke Level Modeling
353
way to estimate time-on-task benchmarks and to inform designers about the productivity of their designs as early as possible during product development. 1.2 Cognitive Modeling Rather than observing and measuring actual users completing tasks, another approach for estimating productivity is cognitive modeling. Cognitive modeling is an analytic technique (as opposed to the empirical technique of usability testing). It estimates the task completion time from generalized estimates of the low-level motor operations. Breaking up the task that a user performs into millisecond level operations permits the estimation of task completion times for experienced users completing error-free trials. The most familiar of these cognitive modeling techniques is GOMS (Goals, Operators, Methods and Selection Rules), first described in the 1970s in research conducted at Xerox Parc and Carnegie-Mellon and documented in the still highly referenced text The Psychology of Human Computer Interaction, by Card, Moran and Newell (1983) [1]. GOMS itself represents a family of techniques, the most familiar of which is Keystroke Level Modeling (KLM). In its simplest form, a usability analyst can estimate user actions using KLM with only a few operators (pointing, clicking, typing and thinking)—see [2] p.72 for a simple introduction. KLM, probably because of its simplicity, has enjoyed the most usage by practitioners. It has been shown to estimate error free time task completion time to within 10 to 30% of actual times. These estimates can be made from either live working products or prototypes. It has been tested on many applications and domains such as maps, PDAs, and database applications [3][4][5][6][7][8][9]. One major disadvantage of KLM is the tedious nature of estimating time at the millisecond level. Even tasks which take a user only two to three minutes to complete are composed of several hundred operators. One must remain vigilant in making these estimates. Changes are inevitable and errors arise from forgetting operations (Bonnie John, personal communication, October 12th, 2008). In our experience, two to three minute tasks took around an hour to two hours to create the initial model in Excel, then an additional hour in making changes. 1.3 Software to Model KLM Operators: Cog-Tool A better way of building the estimates comes from a software-tool called Cog-Tool, built and maintained at Carnegie Mellon [10] Cog-Tool itself is the results of dissatisfaction with manual GOMS estimating [7]. Cog-Tool is free to download and after some familiarity can be a powerful and certainly more accurate cognitive modeling tool than hand-tracked estimates. Cog-Tool builds the task time estimates by having the analyst provide screen-shots or graphics from the application and then define each object the users interact with (e.g., a button, a drop-down list, etc.). There is a bit of overhead in defining all the objects and defining the sequence of steps the users take during a task. Once completed, however, Cog-Tool provides an easy way to get updated estimates on the productivity of a task. User-interface designers can actually do the prototyping within Cog-Tool and this in-fact exploits the functionality since changes made within the prototyping environment will immediately lead to a new task-time estimate. If prototyping is done in another environment (which it is in our
354
J. Sauro
organization) then the analyst will need to import, define and update the objects and task-flows for each change made. 1.4 Consolidating the Operators Our organization has a rather complicated infrastructure of prototyping tools for designers so shifting our prototyping efforts into CogTool, while possible, would be a large undertaking surely met with resistance. We wanted a method to create estimates using KLM like Cog-Tool, that automated the tedious estimation process. We also wanted to allow designers to generate prototypes in whatever environment they preferred. Many requests for productivity come from the Marketing and Strategy teams who can use this information to support sales. We also wanted a method by which we could allow product managers and product strategists to generate their own estimates with little involvement from the usability team. 1.5 Looking to Industrial Engineering Some of the inspiration for GOMS (see [1], p. 274) came from work-measurement systems in Industrial Engineering which began in the early 1900s (e.g., Fredrick Taylor) and evolved into systems like MTM (Methods Time Management see [11]). Just like GOMS, these systems decompose work into smaller units and use standardized times based on detailed studies. These estimating systems evolved (MTM-2, MTM-C, MTM-V, etc.) to reflect the different domains of work and more sophisticated estimates. Generating task-times with these systems, while accurate, are often time consuming. A modification was proposed by Zandin [12] called the Maynard Operation Sequence Technique (MOST). MOST, also based on the MTM system, uses larger blocks of fundamental motions. Using MOST, analysts can create estimates five times faster than MTM without loss of accuracy [13]. Similar to the MOST technique, we wanted to describe user-actions at a higher level of work. Instead of building estimates at the level of hand-motions and mouse clicks, we wanted to estimate at the level of drop-down selections and button clicks. Each of these operations is still composed of the granular Card, Moran, and Newell operators, but the low-level details which caused the errors and were time consuming could be concealed from analysts.
2 Method To refine the KLM technique to a higher level of abstraction we first wanted to see if these higher-level composite operators could predict task times as well as the lowlevel operators. We used the following approach: 1. KLM Estimation: Estimate task times using the KLM technique with low level operators for a sequence of tasks. 2. Generate Composite Operators: Generate an estimate of the task times for the same tasks using the composite operators by identifying larger operational functions. 3. Empirically Validate: Validate the new composite operators by testing users completing the same tasks repeatedly.
Estimating Productivity: Composite Operators for Keystroke Level Modeling
355
4. Refine Estimates: Use empirical data to refine composite estimates (such as updating the system response time) and modify the mental operators to account for concurrent processing. 2.1 KLM Estimation Using the method defined in [1] and [5], we estimated the times. For example, the operators for the initial operations of the task “Create an Expense Report” are: 1. 2. 3. 4. 5.
M: Mental Operation: User Decides where to click (1.350s) H: Home: User moves hand to Mouse (.350s) P: Point: User locates the create expense report link target (1.1s) K: Key: User clicks on the link (.25s) R: System Response time as New Page Loads (.75s)
The system response time was updated based on taking some samples from the applications. 2.2 Generate Composite Operators Using the granular steps from above, the logical composite operator is clicking on a link, so the five steps above are replaced with: Click on Link/Button. The time to complete this operation is modeled as 1.350 + .350 + 1.1 +.250 +.75 = approximately 3.8 seconds. This process was repeated for all steps in the 10 tasks. While not a complete list, we found that a small number of composite operators was able to account for almost all user actions in the 10 tasks across the two web applications. The most commonly used actions are listed below: 1. 2. 3. 4. 5. 6. 7. 8. 9.
Click a Link/ Button Typing Text in a Text Field Pull-Down List (No Page Load) Pull-Down List (Page Load) Date-Picker Cut & Paste (Keyboard) Scrolling Select a Radio/Button Select a Check-Box
2.3 Empirical Validation We tested 26 users on two enterprise web-based applications (hereafter Product O and Product P). The products were two released versions of a similar travel and expense reporting application allowing users to perform the same five tasks. The participants regularly submit reports for travel and expenses and were experienced computer users. Ten of the participants had never used either of the applications, while 16 of them had used both. To reduce the learning time and to provide a more stable estimate of each operator, each participant was shown a slide slow demonstration of how to perform each task. This also dictated the path the user should take through the software. They then attempted the task.
356
J. Sauro
The participants were not asked to think out loud. They were told that we would be recording their task times, but that they should not hurry – rather to work at a steady pace as they would creating reports at work. If they made an error on a task, we asked them to repeat the task immediately. To minimize carry-over effects we counterbalanced the application and task order. We had each participant attempt the five tasks three times on both systems. The training was only showed to them prior to their first attempt. From the 30 task attempts (5*2*3=30) we had hundreds of opportunities to measure the time users took to complete the dozens of buttons, links, dropdowns and typing in text-boxes. These applications were selected because they appeared to provide a range of usable and unusable tasks and exposed the user to most of the interface objects they’d likely encounter in a web-application. The goal of this test setup was to mimic the verification methods Card, Moran, and Newell did in generating their granular estimates. They, however, had users perform actions hundreds of times. Comparatively, our estimates were more crudely defined. We intended to test the feasibility of this concept and were most interested in the final estimate of the task-time as a metric for the accuracy of the model. 2.4 Concurrent Validation When estimating with KLM one typically does not have access to user data on the tasks being estimated. It is necessary to make assumptions about the system response time and the amount of parallel processing a user does while executing a sequence of actions. System response time understandably will vary by system and is affected by many factors. Substituting a reasonable estimate is usually sufficient for estimating productivity. In estimating parallel processing, there are some general heuristics ([2], p. 77) but these will also vary with the system. For example, as a user becomes more proficient with a task they are able to decide where to click and move the mouse simultaneously. The result is the time spent on mental operators are reduced or removed entirely from estimate. In the absence of data, one uses the best estimate or the heuristics. Because our goal was to match the time of users and we had access to the system, we needed to refine the operators with better estimates of actual system response time and of the parallel processing. To do so, we measured to the hundred of a second the time it took users to complete the composite operations (e.g., clicking a button, selecting from a pull-down list) as well as waiting for the system to respond. We adjusted the composite operators’ total time by reducing the time spent on mental operation; in some cases eliminating them entirely (see also [14], for a discussion of this approach). The final empirically refined estimates appear in Table 1 below. Table 1. Composite Operators and the refined time from user times
Composite Operator Click a Link/ Button Pull-Down List (No Page Load) Pull-Down List (Page Load) Date-Picker Cut & Paste (Keyboard) Typing Text in a Text Field Scrolling
Refined Time (seconds) 3.73 3.04 3.96 6.81 4.51 2.32 3.96
Estimating Productivity: Composite Operators for Keystroke Level Modeling
357
Some of the operators need explanation. The Date-Picker operator will vary depending on the way the dates are presented. The Cut & Paste Keyboard option includes the time for a user to highlight the text, select CTRL-C, home-in on the new location and paste (CTRL-V). The estimate would be different if using context menus or the web-browser menu. Typing Text in a Text Field only represents the overhead of homing in on a text-field, placing the curser in the text-field and moving the hands to the key-board. The total time is based on the length and type of characters entered (230msec each). Finally, the refined times above contain a system response time which will vary with each system. That is, it is unlikely that clicking of a button and waiting for the next page to display will always take 3.73 seconds. Future research will address the universality of these estimates across more applications.
3 Results and Discussion Table 2 below shows the results of the KLM estimates using the “classic” Card Moran and Newell operators and the new composite operators for all 10 tasks. Both the number of operators used and the total task times are shown. Table 2. Comparison between Classic KLM Composite KLM Time & Operators Classic KLM Product
O O O O O P P P P P
# of Operators
Task Create Meeting Rprt
Composite KLM
Time (sec)
# of Operators
Time (sec)
81
62
23
98
51
52
21
46
43
26
15
35
32
18
6
26
149
88
32
55
169
134
36
156
93
74
21
82
65
46
13
60
48
31
11
43
131
118
23
111
Mean
86.2
64.9
20.1
71.2
SD
48.1
38.9
9.3
40.5
Update a Saved Rprt Edit User Preference Find an Approved Rprt. Create Customer Visit Rprt Create Meeting Rprt Update a Saved Rprt Edit User Preference Find an Approved Rprt Create Customer Visit Rprt
The data in Table 2 show there to be a difference of six seconds between the composite and classic KLM estimates of the mean task completion time but this difference is not significant [ t (17) = .727 p >.7]. The correlation in task time estimates between the two systems is strong and significant (r =.891 p <.01). The average number of operators used
358
J. Sauro
per task differed substantially—66 (86.2 vs 20.1) representing a 75% reduction. This difference was significant [ t (9) = 4.27 p <.01]. This reduction in the number of operators per task suggests estimates can be made 4 times faster using composites operators. 3.1 Do the Applications Differ in Their Composite KLM Times? Next we used the composite operators to estimate which product had better productivity (allowed users to complete the tasks faster) as this would be one of the primaryaims of estimating productivity. Table 3 shows the average of the KLM times for the sum of the operations for the five tasks for both applications. Table 3. KLM Composite Estimates between applications Task
Product P (Secs.)
Create Meeting Report Update a Saved Report Edit User Preference Find an Approved Report Create Customer Visit Report Average
Product 0 (Secs.)
Diff. (Secs)
% Diff.
156
98
58
37
82
46
36
44
60
35
25
42
43
26
17
40
111
55
56
50
90
52
38
42
Table 3 above shows the KLM composite estimates to predict Product O to be approximately 42% more productive (90-52)/90 than Product P. To validate these estimates we used the 3rd error-free completed task from each user for the empirical estimates. Table 4 below shows the mean and standard deviations for both products. Table 4. Mean Task Times in Seconds for All Participants for Their Last Trial (Completed & Error Free Attempts Only)
Task
Prod. P (SD)
Prod. O (SD)
Diff.
n
% Diff.
t
pvalue
1
Create Meeting Rpt
157 (24)
105 (14)
52
16
33
9.5
<.001
2
Update a Saved Rpt
81 (19)
54 (9)
26
13
32
6.1
<.001
3
Edit User Preference
52 (13)
34 (6)
18
15
35
5.0
<.001
4
Find an Approved Rpt
38 (10)
33 (11)
5
18
13
1.6
>.12
5
Create Cust. Visit Rpt
123 (19)
61 (14)
62
15
50
11.9
<.001
Ave
89 (49)
57 (29)
32
15
36
#
Estimating Productivity: Composite Operators for Keystroke Level Modeling
359
The third error free trial data shows the Product O application to be approximately 36% more productive (89-57)/89 than Product P. This difference represents an error of 14% (.42-.36)/.42. The estimates of 89 seconds and 57 seconds represent errors of 1% and 9% respectively. In assessing the accuracy of these estimates we are using the mean time from a set of users, which is in itself an estimate of an unknown population mean time of all users. There is therefore error around our estimate, which varies depending on the standard deviation and sample size of the task (just as in any usability test which uses a sample of users to estimate the unknown mean time). Some tasks have fewer users since not all of the 26 users were able to complete the third task on both systems without error. The means and 95% confidence intervals around the empirical estimates are shown in Figure 1 below. Also on the graph are the predicted KLM estimates using both the classic and composite methods.
3rd Trial Error-Free Times by Task Task # 1
2
3
4
5
95% CI for the Mean
Product O P O P O P O P O P 20
40
60
80 100 120 Task Time (seconds)
140
160
180
Fig. 1. Means and 95% Confidence Intervals for 3rd Error-Free Trial by Task and Product (blue circles and error bars). The black triangles are the Composite estimates and the black squares are the granular “classic” estimates.
Figure 1 shows visually the variability in the users’ mean times. When the KLM estimates are within the range of the blue error bars, there is not a significant difference between the KLM estimate and the likely population mean time. For example, both KLM estimates are not as accurate on Task 1 (especially the Classic KLM estimate) as both estimates are outside the range of the error-bars. On task 2 the estimates are more accurate as three out of the four KLM estimates are within the likely range of the actual user time.
360
J. Sauro
3.2 Limitations While the refined times of the operators displayed in Table 1 above estimated our total task time well, actual composite times will vary with each system. A major factor in each composite operator is the system response time. For desktop applications there might be little if any latency compared to the typical network delays one gets with a web-application. For each system, an analyst should define the composite operators, which would likely include many of the ones defined here.
4 Conclusion The data from this initial exploration into combining the granular operators into composite operators shows KLM estimates can be made four-times faster with no loss in accuracy. The estimates made with the composite KLM operators are within 10% of the observed mean time of error free tasks. Composite task times were not significantly different than those from classic KLM estimates (p > .7) and task level times correlated strongly (r=.89). While the composite operators and their times will vary based on the interface, the method of combining low-level operators into a highergrain of analysis shows promise. When productivity measures need to be taken and cognitive modeling is used as a more efficient alternative, using composite operators similar to those defined here show promise for faster and more approachable than millisecond level KLM estimates.
References 1. Card, S., Moran, T., Newell, A.: The psychology of human-computer interaction. Lawrence Erlbaum Associates, Hillsdale, NJ (1983) 2. Raskin, J.: The Humane Interface. Addison-Wesley, Reading (2000) 3. Baskin, J.D., John, B.E.: Comparison of GOMS analysis methods. In: CHI 1998 Conference Summary on Human Factors in Computing Systems, Los Angeles, California, United States, April 18-23, 1998, pp. 261–262. ACM, New York (1998) 4. John, B.: Why GOMS? Interactions 2(4), 80–89 (1995) 5. Olson, J.R., Olson, G.M.: The growth of cognitive modeling in human-computer interaction since GOMS. Hum.-Comput. Interact. 5(2), 221–265 (1990) 6. Gray, W.D., John, B.E., Atwood, M.E.: Project Ernestine: A validation of GOMS for prediction and explanation of real world task performance. Human–Computer Interaction 8(3), 207–209 (1993) 7. John, B., Prevas, K., Salvucci, D., Koedinger, K.: Predictive Human Performance Modeling Made Easy. In: Proceedings of CHI 2004, Vienna, Austria, April 24-29, 2004, ACM, New York (2004) 8. Gong, R., Kieras, D.: A validation of the GOMS model methodology in the development of a specialized, commercial software application. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHII 1994, Boston, Massachusetts, pp. 351–357. ACM, New York (1994)
Estimating Productivity: Composite Operators for Keystroke Level Modeling
361
9. Haunold, P., Kuhn, W.: A keystroke level analysis of a graphics application: manual map digitizing. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHII 1994, Boston, Massachusetts, pp. 337–343. ACM, New York (1994) 10. John, B.: The Cog-Tool Project (2009), http://www.cs.cmu.edu/~bej/cogtool/ (accessed January 2009) 11. Maynard, H., Stegemerten, G., Schwab, J.: Methods Time Measurement. McGraw-Hill, New York (1948) 12. Zandin, K.: MOST Work Measurement Systems. Marcel Dekker, New York (1980) 13. Niebel, Freibalds: Methods, Standards, and Work Design. McGraw-Hill, New York (2004) 14. Mayhew, D.: Keystroke Level Modeling as a Cost-Justification Tool. In: Bias, Mayhew (eds.) Cost-Justifying Usability, 2nd edn., pp. 465–488 (2004)
Paper to Electronic Questionnaires: Effects on Structured Questionnaire Forms Anna Trujillo* NASA Langley Research Center, MS 152, Hampton, VA 23681 USA anna.c.trujillo@nasa.gov
Abstract. With the use of computers, paper questionnaires are being replaced by electronic questionnaires. The formats of traditional paper questionnaires have been found to affect a subject’s rating. Consequently, the transition from paper to electronic format can subtly change results. The research presented begins to determine how electronic questionnaire formats change subjective ratings. For formats where subjects used a flow chart to arrive at their rating, starting at the worst and middle ratings of the flow charts were the most accurate but subjects took slightly more time to arrive at their answers. Except for the electronic paper format, starting at the worst rating was the most preferred. The paper and electronic paper versions had the worst accuracy. Therefore, for flowchart type of questionnaires, flowcharts should start at the worst rating and work their way up to better ratings. Keywords: Electronic questionnaires, Cooper-Harper controllability rating, questionnaire formats.
For this experiment, subjects used the Cooper-Harper (CH) Controllability Rating Scale [6, 7] on a control task that required them to keep a randomly moving target centered. Subjects were told that desired performance was maintaining the target in the inner portion of the screen while adequate performance was maintaining the target in the middle portion of the screen (Fig. 1). Each rating was also described to the subjects with respect to the control task. Adequate Performance
Target
Desired Performance
Fig. 1. Target Tracking Task with Indicated Desired and Adequate Performance
1.2 Objective The objective of this research was to determine whether electronic formats of paper questionnaires change subjects’ ratings. In particular, how electronic formats may affect responses to a structured, flowchart type of questionnaire.
2 Experimental Variables 2.1 Subjects’ Piloting Experience Twenty people participated as subjects. Ten were certificated pilots with at least a current Private Pilot license [8]. The rest of the subjects were non-pilots. The average age of the pilots was 48 years and the average age of the non-pilots was 40 years. The pilots averaged 22 years of piloting experience and they had an average of 7314 hrs of total piloting time. 2.2 Cooper-Harper (CH) Controllability Rating Scale Formats Each subject saw five CH controllability rating scale formats – the standard paper format and 4 electronic formats. The electronic formats were: (1) electronic paper, (2) forced choice bottom, (3) forced choice middle, and (4) forced choice top. Paper CH Format. The Paper CH format was the standard CH format [6, 7].
364
A. Trujillo
Paper and Electronic Paper CH Format. The Electronic Paper CH format mimicked the paper version but on a touch screen (Fig. 2). In order to choose a rating, subjects had to touch the appropriate rectangle (e.g., Major deficiencies … 8).
AIRCRAFT CHARACTERISTICS
Next
DEMANDS ON THE PILOT IN SELECTED TASKS
Excellent Highly desirable
Pilot compensation not a factor for desired performance
Good Negligible deficiencies
Pilot compensation not a factor for desired performance
PILOT RATING
Minimal pilot compensation required for Fair – Some mildly unpleasant deficiencies desired performance
Yes Is it satisfactory without improvement?
No
Deficiencies warrant improvement
Yes Is adequate performance attainable with a tolerable pilot workload?
No
Deficiencies require improvement
Minor but annoying deficiencies
Desired performance requires moderate pilot compensation
Moderately objectionable deficiencies
Adequate performance requires considerable pilot compensation
Very objectionable but tolerable deficiencies
Adequate performance requires extensive pilot compensation
Major deficiencies
Adequate performance not attainable with maximum tolerable pilot compensation Controllability not in question
Major deficiencies
Considerable pilot compensation is required for control
Major deficiencies
Intense pilot compensation is required to retain control
Major deficiencies
Control will be lost during some portion of required operation
Yes
Is it controllable?
No
Improvement mandatory
Fig. 2. Electronic Paper CH Format
Forced Choice Bottom CH Format. The Forced Choice Bottom CH format expanded depending on the choices selected by the subject. The flow chart started from the bottom (Is it controllable?) and worked its way up in ratings (Fig. 3). When the subject reached the ratings, only the ratings of the path taken were available. The path the subject had taken to get to those ratings was visible. Forced Choice Middle CH Format. The Forced Choice Middle CH format also expanded depending on the choices selected by the subject. The flow chart started from the middle (Is adequate performance attainable with a tolerable pilot workload?) and worked its way up or down in ratings. As before, when the subject reached the ratings, only the ratings and their associated path were visible. Forced Choice Top CH Format. The Forced Choice Top CH format expanded depending on the choices selected by the subject but the flow chart started from the top (Is it satisfactory without improvement?) and worked its way down in ratings. As with the other two forced choice CH formats, when the subject reached the ratings, only the ratings of that path and the path itself were visible.
Paper to Electronic Questionnaires
Yes
Yes
1
Is it controllable?
365
2
No Is adequate performance attainable with a tolerable pilot workload?
No
No
Is it controllable?
AIRCRAFT CHARACTERISTICS
Next
DEMANDS ON THE PILOT IN SELECTED TASKS
PILOT RATING
3 Adequate performance not attainable with maximum tolerable pilot compensation Controllability not in question
Is adequate performance attainable with a tolerable pilot workload?
Deficiencies require improvement
Major deficiencies
Considerable pilot compensation is required for control
Major deficiencies
Intense pilot compensation is required to retain control
Is it controllable?
Fig. 3. Forced Choice Bottom CH Format
2.3 Control Task Difficulty Each subject attempted to keep a moving target centered for 1 minute using a righthanded side stick. The control task difficulty levels ranged from a CH rating of 1 to a CH rating of 10. Each scenario had a preset control task difficulty level that was accomplished by linearly changing the speed of the target and inceptor gain. A pretest to verify that the control task difficulty levels matched an operator’s CH rating was conducted. The average difference between the control task difficulty level and the three subjects’ CH ratings was -0.07±1.4 with a median of 0. A linear regression of the data was significant (F(1,59)=1161.58; p≤0.01). The slope was 0.94 with an R2=0.95. 2.4 Dependent Variables The primary dependent variable was the subjects’ CH ratings compared to the control task difficulty. The time taken to complete the CH ratings and the workload incurred to complete the CH ratings were also analyzed. At the end of the experiment, subjects completed a final questionnaire. This questionnaire asked subjects to rate on a continuous scale how easy the CH formats were
366
A. Trujillo
for rating the control task difficulty and the associated workload to complete the various CH formats. The questionnaire also asked for subject preferences, and likes and dislikes by display type.
3 Procedure When subjects first arrived, they signed a consent form before being given a verbal briefing on the experiment tasks. Subjects then moved to the simulator where they completed two practice runs with the first CH format. After the practice runs, subjects completed 10 data runs. During each run, subjects had to keep a randomly moving target centered for 1 minute using a right-handed side stick. They also had to indicate when a frequency changed and answer a question that required basic multiplication skills. At the end of each run, subjects completed the CH controllability rating scale and the workload of determining a CH controllability rating. At the end of the 10 data runs with the first CH format, subjects completed at least one practice run with the next CH format and then the 10 data runs with that CH format. This was repeated until subjects had seen all five CH formats. At the end of the simulation runs and questions, subjects completed the final questionnaire. 3.1 Apparatus The simulations ran on two PCs running Windows™ XP Professional1. These had a redraw refresh rate of 60Hz and a graphics update rate of 30Hz. The target tracking task was displayed on a 30-inch LCD screen in front of and slightly above the subject’s eye level. The information indicating the frequency change and to answer the multiplication question was on a screen to the right of the subject. The questions were answered using a touch screen to the subject’s left. The CH questionnaire was also presented on this left screen at the end of the run. These two touch screens were 19inch LCD screens with an Elo Touchsystems IntelliTouch overlay for touch-screen capability. The side stick used was a Saitek Cyborg evo joystick. Subjects used their right hand to manipulate the side stick. 3.2 Data Analysis Data was analyzed using SPSS® for Windows v16. Most of the time, the data was analyzed using a 3-way ANOVA with CH format, control task difficulty, and pilot status (pilot vs. non-pilot) as the independent variables. To determine the accuracy of the CH formats, the control task difficulty level was subtracted from the subjects’ CH ratings. Therefore, a subject was the most accurate when this difference was 0 and the least accurate when the absolute value of this difference was 9. Furthermore, the CH ratings were on an integer scale. In the ANOVA analysis, the CH rating was treated as a continuous scale even though it is ordinal [9]. The final questionnaire responses were on continuous 100-point scales. 1
The use of trademarks or names of manufacturers in the report is for accurate reporting and does not constitute an official endorsement, either expressed or implied, of such products or manufacturers by the National Aeronautics and Space Administration.
Paper to Electronic Questionnaires
367
4 Results 4.1 Accuracy of Subjects’ CH Ratings
Mean Subject CH Rating - Control Task Difficulty
When subtracting the control task difficulty from subjects’ CH rating, pilot status by CH format was significant (F(4, 900)=3.21; p≤0.02) (Fig. 4). In general, both pilots and non-pilots underestimated the control task difficulty with non-pilots underestimating the difficulty a bit more than pilots especially for the Forced Choice Middle and Forced Choice Top CH formats. Subjects for these two formats typically underestimated the control task difficulty by a full rating. 0.00 -0.25 -0.50 -0.75 -1.00 -1.25 -1.50 -1.75
SE of the Mean
Non-Pilot Pilot
-2.00 Paper
Electronic Paper
Forced Choice Forced Choice Forced Choice Bottom Middle Top
CH Format
Fig. 4. Mean Subject CH Rating – Control Task Difficulty by Pilot Status and CH Format
A linear regression estimating the subjects’ CH rating by the control task difficulty was done in order to compare the effects of pilot status and CH format. As can be seen in Figure 4 and Table 1, subjects typically underestimated the control task difficulty by Table 1. Linear Regression Statistics of Estimating Subject CH Rating with Control Task Difficulty by Pilot Status and CH Format Pilot Status Non-Pilot
CH Format Paper Electronic Paper Forced Choice Bottom Forced Choice Middle Forced Choice Top
Slope 0.80 0.80 0.87 0.84 0.68
R2 0.86 0.89 0.89 0.87 0.86
Pilot
Paper Electronic Paper Forced Choice Bottom Forced Choice Middle Forced Choice Top
0.82 0.82 0.84 0.85 0.85
0.91 0.88 0.91 0.89 0.89
368
A. Trujillo
15%. For pilots, the most accurate CH formats were flowcharts while the Forced Choice Bottom CH format was the most accurate format for non-pilots. 4.2 Time to Complete CH Ratings The CH format was significant for the time it took subjects to complete the CH ratings (F(4, 900)=31.98; p≤0.01) (Table 2). Not surprisingly, the Paper CH format took the longest to complete with the Forced Choice Bottom CH format taking the second longest. This is probably because this format typically requires a greater number of button pushes. The other formats were not significantly different from one another. Table 2. Time to Complete CH Rating by CH Format Time to Complete CH Rating (sec) Mean SE of the Mean 18.27 0.58 10.34 0.71 13.16 0.62 10.99 0.43 10.80 0.46
CH Format Paper Electronic Paper Forced Choice Bottom Forced Choice Middle Forced Choice Top
4.3 Subjective Data Subjects’ preference for the CH formats was dependent on CH format (F(4, 87)=2.95; p≤0.03) and pilot status by CH Format (F(4, 87)=4.36; p≤0.01) (Fig. 5). In general, subjects preferred the Electronic Paper and Forced Choice Bottom CH formats. 100 Non-Pilot Pilot
Preference (0=low, 100=high)
80
60
40 SE of the Mean 20
0 Paper
Electronic Paper
Forced Choice Forced Choice Forced Choice Bottom Middle Top
CH Format
Fig. 5. CH Format Preference by Pilot Status and CH Format
Paper to Electronic Questionnaires
369
Pilot status by CH format was also significant for subjects’ reported workload in completing the CH ratings (F(4, 90)=2.51; p≤0.05) (Fig. 6). Workload for the Electronic Paper CH formats was the same for both pilots and non-pilots. But for pilots, the Forced Choice Bottom CH format a slightly higher workload than the Electronic Paper CH format but the workload was on par with the Paper version. The other two flow chart methods had even higher workloads for pilots. For non-pilots, the electronic versions of the CH format did not really affect the workload but they were lower than the Paper CH format. Subjects indicated that the CH format affected their ability to arrive at their desired rating (F(4, 83)=4.26; p≤0.01) (Table 3). In general, subjects felt that the Paper, Electronic Paper, and Forced Choice Bottom CH formats allowed them to arrive at an accurate CH rating. 100
Workload to Enter CH Rating (0=low, 100=high)
Non-Pilot Pilot 80
60
Compared to Paper CH Format
40
SE of the Mean
20
0 Electronic Paper
Forced Choice Bottom
Forced Choice Middle
Forced Choice Top
CH Format
Fig. 6. Workload to Enter CH Rating by Pilot Status and CH Format Table 3. Ability to Arrive at Desired Rating by CH Format CH Format Paper Electronic Paper Forced Choice Bottom Forced Choice Middle Forced Choice Top
Ability to Arrive at Desired Rating (0=low, 100=high) Mean SE of the Mean 65.32 6.89 77.37 6.42 65.94 5.09 48.17 5.21 52.41 4.55
Additionally, subjects indicated that on the Paper version, they specifically went step by step through the flow chart only about half of the time even though they were instructed to arrive at their ratings via sequentially answering the questions in the flow
370
A. Trujillo
chart: specifically 45% of the time for non-pilots and 64% of the time for pilots. This may be because the Paper and Electronic Paper CH formats allow subjects to “cut to the chase” and choose a number without going through the flow chart (Table 4). Table 4. Subject Comments on the CH Formats Subject Comment Categories and Example Comments All choices are available on Paper and Electronic Paper CH formats “like to see all options”; “easier to compare measures”
Number 18
Too much information on Paper and Electronic Paper CH formats “hard to sort all information”; “information overload”
8
Like the mechanics of flowchart “like flowchart with its logical sequence”
8
Do not like the mechanics of flowcharts “takes longer”
5
Do not like mechanics of Paper CH formats “more cumbersome”; “required most time to answer”
9
Specific comments on where to start in flow chart “flow chart pulls you in the direction of where you started” “liked starting at the bottom because it was the worst case”
16
Many subjects commented that they liked having all the information available to them to see at once. Some subjects did say that the Paper and Electronic Paper CH formats induced “information overload” because “there was too much information.” Subjects who liked flowcharts said it was because they had a “logical sequence” which helped “produce a more reasoned rating.” As for where to start on the flowchart, most subjects commented that they like to start at the bottom because it was the “most intuitive” and “ask[ed] the most important question first.” Other comments relating to other starting points in the flowcharts indicated that the “flow logic was counter intuitive.” Generally, subjects liked having all the information available to them at once but they did feel like the flow chart formats produced a logical thought process. Of the flow chart sequences, the Forced Choice Bottom CH format had the most preferred logic sequence.
5 Discussion Electronic questionnaires are replacing paper formats. The formats of traditional paper questionnaires have been found to affect a subject’s rating. Consequently, the transition from paper to electronic format can subtly change results. This research had subjects use five different formats of the CH Controllability Rating Scale that requires respondents to give their ratings by answering questions posed in a flowchart. Results indicated that while all formats were reasonably accurate, the Electronic Paper and Forced Choice Bottom CH formats produced the most accurate ratings
Paper to Electronic Questionnaires
371
while being the most preferred. In general, subjects underestimated the difficulty of the control task using all CH formats. Workload in inputting their answers was a bit higher for pilots when using the Forced Choice Bottom CH format but was on par for the workload when using the Paper version. Subjects did indicate that they only went through the Paper flow chart questions about half the time even though they were instructed to arrive at their ratings only after answering the flow chart questions. Therefore, moving questionnaires from paper to electronic media could change respondents’ answers. Specifically, the above results suggest that when using a flow chart type of questionnaire, it is best to have subjects directly answer each decision point while starting at the worst rating. Although this inflicts a slight penalty in time and workload, it does insure that subjects make decisions at each point while minimizing the underestimation of the difficulty of the task.
References 1. Trujillo, A.C., Bruneau, D., Press, H.N.: Predictive Information: Status or Alert Information? In: 27th Digital Avionics Systems Conference, St. Paul, MN (2008) 2. Trujillo, A.C., Pope, A.T.: Using Simulation Speeds to Differentiate Controller Interface Concepts. In: 52nd Annual Meeting of the Human Factors and Ergonomics Society, HFES, New York (2008) 3. Noyes, J.M., Bruneau, D.P.J.: A Self-Analysis of the NASA-TLX Workload Measure. Ergonomics 50(4), 514–519 (2007) 4. Riley, D.R., Wilson, D.J.: More on Cooper-Harper Pilot Rating Variability. In: 8th Atmospheric Flight Mechanics Conference, Portland, OR (1990) 5. Wilson, D.J., Riley, D.R.: Cooper-Harper Pilot Rating Variability. In: AIAA Atmospheric Flight Mechanics Conference, Boston, MA (1989) 6. Cooper, G.E., Harper, R.P.: The Use of Pilot Rating in the Evaluation of Aircraft Handling Qualities, Technical Report 567, AGARD. p. 52 (1969) 7. Harper, R.P., Cooper, G.E.: Handling Qualities and Pilot Evaluation (Wright Brothers Lecture in Aeronautics). Journal of Guidance, Control, and Dynamics 9(6), 515–529 (1986) 8. Federal Aviation Administration. Electronic Code of Federal Regulations - Title 14: Aeronautics and Space Subpart E-Private Pilots Section 61.103 (August 28, 2008), http://ecfr.gpoaccess.gov/cgi/t/text/text-idx?c=ecfr&tpl= %2Findex.tpl (cited September 2, 2008) 9. Bailey, R.E.: The Application of Pilot Rating and Evaluation Data for Fly-by-Wire Flight Control System Design. In: AIAA Atmospheric Flight Mechanics Conference, p. 13. AIAA, Portlan, OR (1990)
Website Designer as an Evaluator: A Formative Evaluation Method for Website Interface Development Chao-Yang Yang College of Management, Industrial Design Department, Chang Gung University, 259 Wen-Hwa 1st Road, Kwei-Shan, Tao-Yuan, Taiwan, R.O.C. dillon.yang@mail.cgu.edu.tw
Abstract. Commerce plays a fundamental part in a lot of websites so that their goals may be different from conventional computer system design e.g. to increase the user base or encourage repeat visits. With limited budgets, website designers are unlikely to involve their users during the design process and not all website designers have access to an evaluator, appropriate testing facilities or evaluation knowledge to support their design. The research develops a low cost, tailorable, formative evaluation method for web designers. The method addressed both HCI and commercial website goals such as the encouragement of repeat visits. This research first investigate the contemporary evaluation method, the users’ and designers’ needs from websites and website evaluation methods. Finally, the method was developed as a set of guidelines and verified in the evaluation of a website. The potential usefulness, practicality and necessity of the method was then confirmed by website. Keywords: Website usability, Engagement, Formative Evaluation.
Attracting new users and retaining them through good design and usability are of greater importance as competition increases on the Internet. However, because of the inherent differences between website usability and conventional computer systems usability, it may be inappropriate to directly apply standard HCI methods to evaluate websites. Spool et al [13] and Nielsen [9] suggest that website designers should pay more attention to enhancing the functions and information to make users like a web site. Furthermore, Nielsen [9] has also established that users have low tolerance for complex designs or slow sites; people don't want to wait and they don't want to learn how to use a home page. As there is no training or manuals for a Web site, people have to be able to grasp the function of the site immediately upon scanning the home page. On the other hand, this research has identified a need for designers to also consider the site’s ability to retain users and attract regular repeat visits. Hence, the users’ needs from the website and the designers’ needs from website evaluation should be clarified. As websites develop and evolve more quickly and cheaply than normal software releases [3, 16], designers are commonly faced with the need to refine or redesign their sites. Testing late in the development cycle enables the site to be compared to predetermined usability standards or benchmarks (e.g. task components work together, thereby preventing flawed releases entering the marketplace, that will need recall or adjustment [12]. Thus a general method that can be used quickly, by practicing designers is needed that will provide the specific information needed for redesign. Most existing evaluation methods have been criticized because they do not specifically identify usability problems [2]. Such deficiencies in information about user issues may leave designers guessing at solutions. For example, the statement “some of the users cannot find the correct navigation to link to next page to finish the task” provides insufficient information for an evaluator to judge the problem precisely. The best that can be concluded is that “the users cannot navigate correctly”, which is not specific enough to properly guide redesign. Instead, the actual problem could be that the users do not understand the navigation term used, or that the navigation element could not be located easily. When a problem is not fully understood [10] or is described at the wrong level of abstraction, and the designer is not a typical user [11], it is easy for the designer to overlook some of the most critical but subtle dimensions that contribute to a situation, and the resulting solution may make some parts of the interface worse [10]. This being the general case, it is important to examine current evaluation methods to establish their specific deficiencies. When the above in mind, this research aims to contribute to knowledge by enhancing existing evaluation methods forward in website design. A low-cost evaluation method for web designers to use for formative evaluation, prior to site launch was developed and validated. In particular the research shows that: In terms of designers’ goals and users’ expectations, web design needs to consider more than just the HCI issues considered in conventional computer interface design; • User-centred design is important, as designers’ and users’ perceptions of websites are different; • Web designer’s requirements with regard to evaluation methods are not met. They need to be presented with well specified, detailed problems from which they can
374
C.-Y. Yang
generate effective solutions. Such methods would enhance the efficiency of the redesign process; • Decomposing the website into visual, informational, navigational, and functional sections enables the evaluator to systematically test and determine problems.
2 Methods The research method adopted was that of problem-solving. Given that a problem has been identified, requirements for the solution generation are collected; a solution is proposed and finally tested. In order to achieve our aims and objectives, a number of methods were used as appropriate to an “understand –propose –realise -evaluate” lifecycle [17] (as shown in Fig. 1). An understanding of the problem was achieved through a literature review, analysis of current methods and attitudinal survey. From the requirements identified in the survey, literature review and the analysis of the applicability of existing evaluation methods, a web evaluation framework was developed that would meet both users’ and designers’ needs. An action learning approach was taken to the development of the method, whereby the researcher iteratively designed, tested, and selected methods based on their usefulness. Following evaluation (see below), the method was formalized for use by practicing designers. The method was subjected to three forms of evaluation: Iterative interface design and evaluation; formalisation of the method; evaluation by web designers.
Understand
Propose
Realise
Evaluate
Fig. 1. Design Research Model with Feedback Loops [17]
3 Literature Review A website consists of navigation, information, and visual elements. The purpose of a site may be viewed as a conjunction of satisfying what users are trying to accomplish (e.g. doing research, buying products, or downloading software), and the designer’s and client’s goals [13]. These goals are related to issues such as HCI, pleasure, security, technical issues, and accessibility. Typically, the website designer is in charge of the look, ease of use, and the content of a website [3], with the intention of providing a clear marketing message, trust, frequent update of information, aesthetics, functionality, so as to achieve client company goals. The following elements have been identified as important in terms of the user and designer’s needs. 1. Navigation. HCI plays an important role in website navigation design. This has been addressed in website design in terms of effective and efficient information structure, user interface, page design, content authoring, cognitive process, linking strategy and task design.
Website Designer as an Evaluator
375
2. Information. A commercial website should provide useful, helpful (e.g. to help make purchase decision), updated, and individualized information. In addition, it should include clear marketing messages and effective privacy statements that show it is following privacy and consumer protection guidelines, making the security of customer data a priority and using independent certification bodies. 3. Visual design. The likeability and attractiveness of visual design relies on the aesthetics of the layout, colour scheme, animation etc. Also accessibility for colour blind users needs to be considered. 4. Other issues – in particular technical and accessibility issues should be considered. This section has introduced website design, considered the role of HCI and identified a new set of web design considerations. Evaluation has been identified as crucial in supporting the design process, hence it follows that website evaluation methods will need to accommodate the identified design goals and issues associated with website design. The diversity of website users, their purpose, individual characteristics, information seeking behaviour, and cultural issues are complicated and affect the design of different types of websites. Having reached this point the question is, “To what extent do website design and evaluation methods handle the factors discussed above?” With this in mind, next, website design methodologies will be reviewed prior to an examination of website evaluation methods. The goal of this analysis will be to assess the extent to which current website design and evaluation is fit for purpose. Effective usability evaluation methods for website redesign need to be fast, lowcost, easy to learn, provide high confidence and high impact. This section reviewed conventional usability methods and those that have been employed in website design. It can be concluded that most website usability evaluation in late development stages requires real user testing, i.e., observing users completing website tasks. These methods provide reliable and useful information for redesign. However, given that existing methods have been adapted from those used in HCI, there are several shortcomings in the extent to which website features and issues can be evaluated using such methods and the extent to which such methods can be applied to rapid website development with short product life cycles, small development teams and limited budget. In such an environment, methods that require specialists or specific equipment may not be used. Further, the evaluation methods have also been discussed from a marketing perspective in which it has been shown that the marketing goals have not been properly addressed. In the next section, the issues identified from this review will be investigated through a study of selected user testing and data analysis tools.
4 Study of Current Usability Testing Methods To gain more insight into the usefulness of current evaluation methods, a representative set of different types of methods was used to evaluate a website prior to launch. These included observation, Meaning in Mediated Action [15], and Website Analysis and Measurement Inventory (WAMMI) [5] and Breakdown Analysis [14, 18]. The UNITE (Ubiquitious and Integrated Teamwork Environment) website, which had not
376
C.-Y. Yang
been launched was selected as the test object. This site was designed to promote an EU founded project which aimed to develop an environment for virtual teamwork. Overall, the evaluation was time consuming (especially in the task completion section) and each method provided both useful and not so useful information for redesign. By observing the process and the usefulness of the results, WAMMI was identified as being useful in assessing the participants’ preferences through its rating system. However, some of the questions were irrelevant and unclearly defined problems were not useful. The designer indicated that the MIMA and Task Completion sections were helpful for redesign of the navigational elements as they provided details of specific navigation problems. Further, the designer indicated that WAMMI provided information about the site’s engagement and time-based issues which was lacking in MIMA and the User Testing. To summarize, each method had strengths and weaknesses and using more then one technique can help ensure that the findings are reliable [14]. The comprehensive information needed for redesigning the site can be generated through multiple methods, although the process could be shortened by discarding less useful elements, simplifying them and concentrating on the elements/information needed by the designer for the task in hand. In summary, the results from existing different methods support each other; the results from user testing and MIMA are useful for redesign; marketing issues need to be considered in the evaluation; marketing goals can be addressed in a questionnaire but this should provide more specific information. The next stage of the research will employ a questionnaire to gather further opinions about website usability from users and designers. This will help to establish the requirements for website evaluation enabling us to construct a new method, geared to the needs of designers.
5 Internet Surveys of Designer and User Needs Through testing existing usability methods on the UNITE website it was found that different methods favour different aspects of web usability. Through triangulation and selected use and adaptation of different methods a more complete picture of usability issues can be established. However to be useful outside the experimental situation such a combination of methods has to provide sufficient detailed information for designers to concentrate on the important elements from the user’s perspective. Therefore the design of an effective method should, on the one hand recognize the needs of the designers (to produce usable sites quickly) and the requirements of the design task and on the other hand the needs of the user – to find the information they need efficiently. Taking the findings of the previous studies into account, the important elements for web site design may be summarized as: adherence to best practice in HCI, usefulness, pleasure in using the site, user retention, and the ability to attract new users. This part of research will detail a study undertaken to establish whether there are any differences in the way in which web site designers and users perceive usability, and what type of information designers would like for redesign (i.e. formative as opposed to summative evaluation). In this case the method chosen was an on-line Internet questionnaire which would help reach the massive key target populations – web designers and site visitors. This
Website Designer as an Evaluator
377
questionnaire focused on the users’ and web designers’ opinions of web usability. Web designers, whose sites were included on www.coolhomepages.com, were invited. The user participants were selected by posting an invitation on professional message boards and discussion areas (for experienced users) such as www.coolhomepages.com and www.msn.com, as it was believed these users would be more web-savvy. The results have confirmed that all five general goals should be given equal prominence in website design. In addition, several participants felt that a website should require its users input (e-mail or buying products) as this can bring benefits to the website. These can be attained by improving design requirements such as ease of navigation, helpful information, and good visual design. Helpful, updated, or interesting information are the user’s primary needs from a website. These features also affect the likeability of and degree of user engagement with a site. Clear and attractive visual design mainly affects likeability. Easy of navigation is a primary requirement, a feature emphasized in conventional HCI. In addition, as described previously, ease of navigation has been indicated as an important predictor of recommendability. Functionalities, such as a message board and search engine have been indicated as key ingredients of a website. Therefore, it is reasonable to propose that providing useful functionality that meets users needs, could improve the degree of engagement with and likeability of a website. The designers generally pay attention to the site’s usability before launching the site. Although the designers stated that they understood typical usability statements and could act on them appropriately – when their answers were considered in detail, different solutions were proposed by different designers to the same problem, indicating that the statements might be ambiguous and lacking in sufficient detail for reliable decisions to be made. Designers prefer feedback from users that have a clearly stated problem report relating to the site’s information structure, image, colour, compatibility, font, symbols, logo, etc. Further, they were concerned about the cost and complexity of evaluation. However, given the differences between designers and users it is still necessary to evaluate websites with real users. To summarize, a good site is designed considering issues of adherence to best practice in HCI, usefulness, user retention, likeability, and ability to attract new users. Designers and users were shown to have slightly different views on website usability. In terms of evaluation, the existing problem statements were not detailed enough. Following Newman and Landay’s [7] categorization, websites can be broken down into navigation, information, and visual elements to provide a clear view of the site. In addition to these elements, functionality has also been identified as a fundamental element that may affect website goals and usability.Applying this categorization to the development of an evaluation method may provide a clearer view of the website and lead to the development of more designer friendly methods. Collecting data is necessary but not sufficient for a usability test [4]. A clearly defined problem report is also necessary and the need for a usable output should be remembered, for example detailed problem identification is required to avoid ambiguity, and questions should be related to the actual web site rather than overall features. Such problems may arise especially in areas of overlap between knowledge domains. For example a problem such as “this web site is a waste of time in every respect” could relate to either poor quality information or hardness of navigation. To avoid
378
C.-Y. Yang
such ambiguity the method will relate statements more clearly to the domain they refer to, such as aesthetics or navigation. The results of this study have also shown that depending on the site, the target users may be different and may have different perceptions using the site. For example, a human resources site may aim to provide an easy to use interface, but a Disney site may aim to achieve high likeability. Without feeding the target user’s needs, the site may fail to retain their users. Therefore, a more tailorable approach to evaluation is needed, which is based around the site, the expectations of the owners of the site, the designers and the users.
6 Composing the Website Evaluation Method Previous research has shown that increasing the user base, the likeability and efficiency of the web site, engaging the users, and identifying and meeting user needs are important for website development. For users, as shown in Fig. 2, helpfulness of information, ease of use, attractiveness of the visual design, and functionality play a part in determining whether a site achieves these goals. An evaluation method is proposed that can be used by designers. The method is composed mainly by four evaluation techniques: MIMA interview, card sorting, user testing and structured interview. It takes into account time, cost, learning time, degree of confidence in the method, and the potential for impact on redesign. The evaluation method was designed to provide an effective and efficient formative evaluation that could be used by designers to provide information for redesign. MIMA interview. The elements to be tested are shown to the participant first in isolation, and then in the context of the web page they appear on. The participant is required to interpret the representation. This is recorded in the format of Table 1 against that of the designer. Where necessary the evaluator should ask for clarification of interpretation, so that the nature of the interpretation is fully understood. Table 1. An example of function key assessment Functions Search
Intended action Assessment of IM Participant’s interpretation Start searching the Search information related to given keyword in the the keyword database
Card sorting. The participant is asked to associate cards with the most relevant main navigational links (could be terminology or graphic). The cards containing the navigational elements to be tested are placed separately on the desk in front of the participant with the main navigation cards set out at the top. The participant needs to assign the sub-navigational elements or contents to these by placing them underneath the element to which they appear most relevant. If the site contains sub-sub-navigations/ contents, the evaluator should then ask the participant to assign these under the subnavigations determined by the designer.
Website Designer as an Evaluator
379
Fig. 2. Relationship among the design elements Table 2. The card sorting result format Main navigation
Superb-cards
Designer’s categorization
Assessment of Participant’s categorization participant’s categorization
○
“Saver” telephone card “Bubble” telephone card +
30% off Swifty telephone card Advantage buying from Superb Call
+ download mobile ring tones and pictures (yahoo.com.tw, 2005)
+
Web-telephone
The participant’s categorization is recorded and assessed as shown in Table 2. Those instances where the participant categorizes information in a similar way to the designer are of no interest and are marked “ ”. Additionally, a “-“ is given when the participant fails to sort a card in the correct place on the navigational link. A “+” is given when a card is placed on an incorrect link. This information may assist the designer in re-organizing the navigational structure.
○
User testing. Direct observation provides more objective information than surveys [6]. The user should now have some familiarity with the site, and is provided with a set of tasks to assess how efficient the web site is in letting them perform common tasks, where problems occur and the reasons for these. The tasks are fully explained to the users, but no other assistance is provided. The participants are required to verbalize their thoughts and feelings during the task as this can generate valuable usability information [8]. The time, path, action, and verbalizations are recorded as in Table 3.
380
C.-Y. Yang Table 3. The task completion data analysis format
Time (second) Path 5 Home
22
News
Actions Think aloud protocol Moving the cursor around Still looking, still looking. Haven’t seen all links in this page any thing say “subscribe” at the moment. I am going to News area as it looks most related
The path and completion times will later be compared to those provided by the designer. Where errors occur, the verbalizations and video are used to provide a rationale for this. Structured interview. The structured interview is conducted to assess information, visual, and function design. The interview is structured around a questionnaire, with the participant being required to provide a rationale for their ratings. As the questions are closely aligned to the contents of the web site, this necessitates participants using the website in some detail. All the ratings, for each question are combined for all the participants and average scores are used to determine the severity of the problems. Examples of the rationale are also presented so the designer can achieve a greater understanding of the design problem.
7 Conclusions This research considers the requirements to support commercial website design based on user’s and designer’s needs. Typically, HCI plays an important role in this domain and recent research shows that websites require more aspects to be taken into account. Without addressing specific issues such as marketing and pleasure, current usability methods will poorly support the design. Hence, the appropriateness of applying standard usability measures to website design was investigated. By incorporating the user’s and designer’s opinions, it was confirmed that websites not only have to meet usability criteria, they also have to increase the user base, likeability etc. It also showed that these issues can be achieved through improvements to the design components in navigation, information, visual, and functional aspects. Each aspect can be assessed efficiently and precisely by different evaluation techniques. Therefore, a multi-method method has been produced which is tailorable to different websites to advance the use of existing usability evaluation in commercial website design. In addition, the research has formalized the method into one which a designer can use. The studies undertaken have shown the validity, practicability and usefulness of this approach for website designers. In conclusion, the research has contributed to knowledge by identifying and filling the gap in the current use of evaluation methods by providing a method that practicing web designers can use with representative end users.
Website Designer as an Evaluator
381
References 1. Benyon, D., Davies, G., Keller, L., Preece, J., Rogers, Y.: A Guide to Usablity. UK, The Open University 2. Berkun, S.: The role of flow in web design, Microsoft Corporation (1990), http:// msdn.microsoft.com/library/en-us/dnhfact.html/hfactor10_1. asp?frame=true (accessed 24/11/2001) 3. Brinck, T., Gergle, D., Wood, S.D.: Designing Web Sites that Work - Usability for the Web. Academic Press, USA (2002) 4. Dumas, J.S., Redish, J.C.: A Practical Guide to Usability Testing, Intellect (1999) 5. Kirakowski, J.C.N., Whitehand, R.: Human Centered Measures of Success in Web Site Design. In: 4th Conference on Human Factors & the Web, New Jersey, USA (1998), http://www.research.att.com/conf/hfweb/proceedings/ kirakowski/ (accessed 06/04/2000) 6. Moseley, B.: Test the Usability of Your Web Site. Folio:PLUS 30(Part 4), 9–10 (2001) 7. Newman, M., Landay, J.A.: Sitemaps, Storyboards, and Specifications: A Sketch of Web site design Practice. In: DIS 2000, New York (2000) 8. Nielsen, J.: Usability Evaluation and Inspection Methods. Addison-Wesley, Reading (1993) 9. Nielsen, J.: Designing web usability: the practice of simplicity. New Riders Publishing, USA (2000) 10. Nielsen, J., Mack, R.L. (eds.): Usability Inspection Methods. John Wiley & Sons, New York (1994) 11. Norman, D.A.: The Design of Everyday Things. New York, Doubleday (1990) 12. Rubin, J.: Handbook of Usability Testing: how to plan, design, and conduct effective tests. John Wiley & Sons, USA (1994) 13. Spool, J.M., Scanlon, T., Snyder, C., Schroeder, W., DeAngelo, T., et al.: Web Site Usability: A Designer’s Guide. Academic Press, San Diego (1999) 14. Urquijo, S.P., Scrivener, S.A.R., Palmen, H.: The Use of Breakdown Analysis in Synchronous CSCW System Design. In: Proc. of the Third European conference on Computersupported Cooperated Work, Milan, Italy (1993) 15. Waldegg, P.B.: Handing Cultural Factors in Human Computer Interaction. Derby, UK, doctoral thesis (unpublished, 1998) 16. Winckler, M., Pimenta, M., Plalanque, P., Farenc, C.: Usability Evaluation Methods: What is still missing for the Web? In: 8th International Conference on HCI International, New Orleans, USA, August 5-10 (2001) 17. Woodcock, A.: Supporting Ergonomics in Concept Design, Loughborough, Loughborough University, doctoral thesis (unpublished, 2001) 18. Woodcock, A., Scrivener, S.A.R.: Breakdown Analysis. In: McCabe, P. (ed.) Conference of Contemporary Ergonomics, pp. 271–276. Taylor and Francis, Edinburgh, UK (2003)
Building on the Usability Study: Two Explorations on How to Better Understand an Interface Anshu Agarwal and Madhu Prabaker salesforce.com, The Landmark @ One Market St. Suite 300, San Francisco, CA 94105 {aagarwal,mprabaker}@salesforce.com
Abstract. In this paper, we describe two separate studies that improved our ability to understand our users’ experience of our products at salesforce.com. The first study explored a methodology of combining expert and novice performance data to yield a measure of intuitiveness. The second study created a methodology that combines both verbal and nonverbal emotion scales to better understand the emotional effect our products have on our users. We present both these methods as expansions on the standard usability study and examples of ways to better understand your users within an industry environment.
1.2 Study Two: Defining Emotional Response The topic of emotion has recently attracted increased research attention in HCI studies[1]. Numerous authors have proposed that that emotion may play an important role in user performance and user experience. However, very few “real world” case studies have been conducted to study the role of emotion in an HCI context. It is important to first define the often vague term "emotion." However, coming up with a precise and scientifically respectable definition of the term is notoriously difficult. As one might imagine, there are many definitions of "emotion" in the relevant literature [4]. Nevertheless, there are two generally agreed on aspects of what actually constitutes human emotion [1]. First, emotion is a psychological reaction to events relevant to the needs, goals, or concerns of an individual. Second, emotion is comprised of physiological, affective, behavioral, and cognitive components [1].
2 Study One: Measuring Intuitiveness Although it is often advisable to ensure that designs work equally well with both novice and expert users, not all systems need to be evaluated with this range of expertise. For most “walk up and use” systems, like movie kiosks, or one-time use systems like installation programs, it may not be necessary to test expert users. However, for most consumer and enterprise software systems, the system must allow experienced users to perform their tasks efficiently and novice users to complete tasks effectively without requiring extensive training or practice. 2.1 Measuring Novice Performance By performing an empirical usability study and measuring the average task completion time across a group of novice users, we can begin to understand how well a particular design performs. Although task completion times allow us to say, “it took a user x seconds to complete a task”, they fail to help us understand whether this time is too long or acceptable. To provide a more comparative understanding, we often visually compare time across all tasks (fig. 1). From this we can say, “it took an average novice user x seconds more to complete Task 3 than Task 4”. Although this is a more meaningful statement, it is still difficult to understand how long a task should take.
Fig. 1. An example of a visualization of the average task times for novice users on a system
Building on the Usability Study
387
2.2 Measuring Expert Performance When designing interfaces, we are often concerned with making tasks as efficient as possible. Although we can gauge expert performance by conducting usability studies with expert users and recording task completion time, this is often challenging in practice. It’s difficult to find users who are experts in all aspects of an interface – experts in one functional area are often novices in others. Additionally, an expert may not yet exist for a new design. For these reasons, practitioners have utilized human performance modeling methods to create reliable estimates of task performance time for skilled users. A particular model for expert user performance that has proven to produce highly useful and scientifically valid results is Keystroke-Level Modeling (KLM) [2]. When provided with a description of a task being performed, the model applies human performance estimates to produce a predicted task completion time. For the purposes of this paper it is not essential that we understand how exactly KLM is derived, but rather, that KLM is a relatively quick and low cost way to get expert performance task time data. Plotting these values results in a chart that shows expected task completion times for expert users (Fig. 2). Another way to look at this is that these values represent the efficiency limit of a particular design. Using this we can make statements about the minimum task times imposed by the design; for example, “we expect that Task 5 will take at minimum x seconds”.
Fig. 2. An example of a visualization of the average task times for expert users on a system
2.3 Deriving a Measure of Intuitiveness Revisiting Our Definition. Earlier we defined intuitive as being able to “interact effectively, not consciously using previous knowledge”. We also showed how we could get measurements of novice and expert performance across a set of tasks using a particular design. Mapping these two concepts on each other yields a more measurable definition – an intuitive interface can be thought of as one that “minimizes the difference between expert and novice task performance”. When an expert is using the system, they are not consciously thinking about how to use the system, rather, how to solve the task at hand. In this way, the closer the novice user performance resembles expert performance, the more intuitive the interface can be regarded.
388
A. Agarwal and M. Prabaker
Combining Expert and Novice Performance. In the Figure 3, shown below, we’ve plotted the task completion time for both novice and expert users. The intuitiveness line shows the difference between the novice user performance and the efficiency limit of the design. This is a more meaningful metric than the novice or expert measures alone because it enables us to make statements like, “novice users took x seconds longer on Task 8 than our design called for.” It’s important to not underestimate how much more powerful this statement is in driving design changes; it allows us to more explicitly recognize and dissociate the limits imposed by the design (the efficiency limit) from the observed performance data.
Fig. 3. The expert visualization (fig. 2) has been superimposed on the novice visualization (fig. 1). The difference between the times is shown as a line.
Additionally, because this visualization quantitatively takes into account the inherent difficulty differences between tasks, it enables us to notice phenomenon that’s hidden in the novice performance data. For example, although it initially appeared that Tasks 5, 2, 8, and 9 may be problematic because of their relatively long task performance times, Tasks 8, 2, and 9 are the ones that really demand attention – the design actually performed well for Task 5. While it is quite common in industry to invest the time and money to gather quantitative metrics for novice users, expert efficiency analysis is not always done. By using accurate expert task prediction models, we can achieve deeper insight in our analysis without requiring significant additional resources in our testing phase. 2.4 Empirical Validation of the Method In order to validate that this method of deriving “Intuitiveness” yielded valuable insight into the performance of a design that was not achieved through the standard usability study, we employed this technique within a comparative study between two versions of a Customer Relationship Management (CRM) application. To understand the novice user experience, we employed a between-subjects study design where we recruited 18 experienced salespeople to perform a set of 10 common, representative sales tasks (e.g. adding tasks, converting a lead, sharing an opportunity,
Building on the Usability Study
389
etc) as quickly as possibly without committing any errors on one of two CRM applications (Application A and Application B). All participants reported familiarity with each of the sales tasks, but none of them had prior experience with the application they were assigned to. For each session, participants were presented with the tasks in a randomized order and among the dependent metrics collected were Time on Task and Number of Assists. Because we were focused on understanding a more natural assessment of the time it took novice users to complete a task, we chose to provide assists instead of capturing the number of errors committed1. This methodology ensured that all participants completed each task and that our Time on Task metric captured the inherent difficulty novice users had. To understand the expert performance times for each of the 10 tasks, we performed KLM analysis using the software application, CogTool [7]. Table 1. The table below shows the 10 task times for Novice Users (Empirical), Expert Users (KLM), and the difference between these two (Intuitiveness). The design that performed better has been bolded for each task. Novice Performance A B
Expert Performance A B
Difference (Intuitiveness) A B
1. Complete a Task
153.91
91.22
19.82
21.74
134.09
69.48
2. Add a few tasks
187.75
197.19
64.81
55.46
122.94
141.73
3. Edit a contact
118.06
118.71
19.47
21.15
98.59
97.57
4. Convert a lead
114.26
184.51
17.26
23.75
97.00
160.77
5. View reports on leads
82.93
105.57
6.93
8.82
76.00
96.75
6. Share an opportunity
150.35
152.65
25.63
25.27
124.73
127.38
7. Manipulate a calendar entry 8. Manipulate a forecast
231.57
203.36
30.1
33.65
201.46
169.71
77.68
93.05
6.82
6.48
70.87
86.58
9. Create a campaign with leads 10. Search using help
292.28
237.77
27.16
54.72
265.12
183.05
58.215
86.33
13.65
9.07
44.56
77.25
Analysis and Results. Although no statistical difference was found across the overall task performance of novice users, users were statistically faster on Task 1 using Application B (p = 0.008)2. Application A had a lower expert performance time for six out of ten tasks3.
1
An assist was provided when the participant ceased making progress towards the completion of the task. The assist was given such that it only provided the user with enough direction to make it to the next step in the task and only when it became clear that the user was unable to advance to the next step. 2 We performed a Two-Sample T-Test on the novice, empirical, performance data. 3 Since the KLM values are not empirically derived, we can consider any difference between the designs as significant.
390
A. Agarwal and M. Prabaker
The value of this method can be seen based on how the conclusions might differ based on the data at hand. Armed with only the traditional, empirical usability study data we might be able to derive that both applications performed equally well with novice participants, though Application B had a more efficient interface for Task 1. Therefore if we are redesigning Application A, we should focus our effort on improving our design for Task 1; since the other nine tasks performed statistically similarly it’s unclear as to whether the designs on both are equally good or equally poor. However, once we add the expert performance metric and derive the Intuitiveness metric, we start to see a more interesting and insightful picture. Application B’s faster time for Task 1 cannot be attributed to an overall more efficient design – in fact Application A’s design allowed for expert users to complete the task faster than Application B’s design. Therefore, for Task 1, although Application A was more efficient than Application B, it was less intuitive. In this way we’ve changed the focus of our redesign efforts from a focus on efficiency to a focus on making it easier for the novice user to accomplish. With this insight, if our task is to redesign Application A we cannot help but notice that, in additional to Task 1, Tasks 7 and 9 should be the focus of our efforts.
3 Study Two: Measuring Emotional Response Emotion is an inherently complex construct to study. As such, researchers have created many different emotion measurement tools, including verbal measurement tools, nonverbal measurement tools, and physiological measurement tools in an effort to meet this challenge. In this study, our research challenge was to develop an emotion measure that would be quick to utilize, easy to understand, deployable remotely, and easy to incorporate into an empirical usability study. Given the nature of emotion, it would seem that “fuzzy” nonverbal measures would be most apt to assess emotion. However, most of the nonverbal measures in the HCI literature are either impractical in a “real world” setting, or of unknown validity. We therefore decided to combine an extensively used and validated verbal scale with a more experimental non-verbal emotion measure to improve the strength of our methodology. 3.1 Verbal and Non-verbal Emotion Measurement For the verbal component, we chose to utilize the PAD (Pleasure, Arousal, and Dominance) Semantic Differential Scale developed by Mehrabian and Russell [5]. By rating a set of bipolar adjective pairs along a nine-point range, this scale was shown to measure three important aspects of emotion: Pleasure, Arousal, and Dominance. Pleasure may be defined as a positive affective state, which is separate from feelings such as preference and reinforcement. Arousal refers to an emotional state from sleepy to very excited. The final dimension, Dominance, refers to the extent to which a person feels unrestricted or free from outside control. We reviewed Mehrabian and Russell’s original adjective sets to ensure that the pairs were relevant to interface emotional responses (Table 2).
Building on the Usability Study
391
Table 2. Although we maintained most of the original adjective word pairings of the PAD scale, we revised some pairings to ensure that the scale was concise and relevant to software interface assessment PAD Dimension Pleasure
We selected the Emocard tool by Desmet for the non-verbal component of our measure (fig. 4) [3]. The Emocard tool consists of sixteen cartoon-like faces, half male and half female, in which each face represents a combination of Pleasure and Arousal. We interpreted results in the Calm-Pleasant and Excited-Pleasant quadrants as positive feedback.
Fig. 4. The Emocard tool was an effective nonverbal measurement of emotional response which used human-like representations of emotion
3.2 Empirical Validation of the Method In order to validate this methodology, we performed a comparative study between two versions of a CRM application interface. We collected traditional usability measures (time on task and number of errors), as well as the new dual emotion measure we constructed. This measure utilized both the non-verbal Emocards and the verbal PAD scale methods in a linear fashion.
392
A. Agarwal and M. Prabaker
Twenty-two participants, thirteen male and nine female, were assigned to assess one of the two versions of the interface. Although participants had experience with CRM, they had no prior experience with the interface they were evaluating. Seven comparable CRM tasks between the two interfaces were created (e.g. manipulate a calendar entry, view a report of leads by source, create a new marketing campaign, etc). These tasks were representative of typical sales users of CRM interfaces. Tasks were randomized and participants were assigned to one of the three task list versions. As collected in a usability study, the traditional measures of time on task and number of errors were collected during each task. This was followed by an online survey where participants selected the Emocard that best represented their initial emotional reaction to each task. Participants then continued onto the PAD scale, and were asked for their qualitative feedback. This procedure was repeated for each task. Analysis and Results. No significant differences were found between interfaces using the usability measures collected in the study4. Neither time on task nor number of errors was significantly different between the interfaces when analyzed both overall across all tasks and by individual task (p > .05). Analysis of the PAD scale, however, did show significant differences in participants' emotional responses between interfaces. Overall, the Interface A was significantly rated by participants as being more Satisfying and Friendly (p < .05). When analyzed by task, users rated Interface A as more Pleasing and Relaxing for three out of seven tasks (p < .05). Participants therefore found that Interface A elicited a more positive emotional experience than Interface B, even though user’s performance levels in the usability studies were almost identical. Emocard responses were then compared between the two interfaces for each of the seven tasks (Fig. 5). As can be seen in the figure, clear differences and patterns in how users immediately reacted to the interfaces can be identified. Interface A elicited a more consistently positive response compared to Interface B, which included the selection of a few Emocards that represented negative emotions.
Fig. 5. Emocard selections for a sample task between Interface A (left) and Interface B (right) show clear differences in users’ immediate emotional responses
4
An independent sample t-test was used for analysis of the data to compare the two interfaces.
Building on the Usability Study
393
Qualitative feedback was also collected for each task. Two sample participant quotes are provided below: “It took me a while to find the [content]… I chose the slightly perplexed face… after exploring I found the [content] but initially it was a bit frustrating.” “I absolutely hate when I see something red that pops up and doesn't tell me anything... It makes me feel stupid. It drives me up the wall. I put a sad face, because it makes me kind of sad… I had a strong negative reaction to that. It was kind of unexpected, [Interface B] had a nice clean interface then this red blinking error popped up out of nowhere. It made me kind of tense.”
As indicated in these quotes, the qualitative data we collected was both rich in content and often emotionally charged. 3.3 Studying Emotional Response: Considerations Practitioners might assume that positive emotional response may be adequately indicated through usability metrics. However, the results of this study suggest that this may not be the case. If we had utilized only the usability metrics of time on task and number of errors as measures of user experience – and believed these measures to be comprehensive indicators of user experience – we would have concluded that the quality of the user experience for both interfaces were nearly identical. This conclusion, however, would have been incorrect, and, at the very least, incomplete. Differing emotional response to the two interfaces demonstrated that there were significant distinctions between the two interfaces beyond just that of usability. Additionally, these emotions may not only be central to how a user judges the overall product experience, but may also affect how a user perceives its usability. The goal of this study was to demonstrate the value of studying emotion and to test metrics for this purpose. Utilization of these metrics may help open up opportunities for HCI practitioners to incorporate fruitful and insightful emotional study into their process. Moreover, interaction designers of software interfaces may best be able to utilize the results of emotion studies to help enhance their interface designs.
4 Conclusion The two studies outlined in this paper demonstrate how studying emotion and measuring intuitiveness can add value to traditional user experience research. Both studies utilize new methods that practitioners can use to build upon the traditional usability study. Both explorations also yielded significant insight into our understanding of our users’ experience with marginal additional effort. The research efforts discussed here were only initial exploratory studies that merit further research. The intuitiveness measure still demands more empirical testing to validate its ongoing value and accuracy. Although emotional response has been proven a valuable aspect to study, further exploration of how interfaces might be improved based upon the results should be conducted. In the end, these methodologies hope to benefit the user experience community by encouraging practitioners to extend their everyday usability research in search of greater insights.
394
A. Agarwal and M. Prabaker
Acknowledgments. We thank the User Experience team at salesforce.com for all their help, support, and interest in this research.
References 1. Brave, S., Nass, C.: Emotion in human-computer interaction. In: Jacko, J., Sears, A. (eds.) Handbook of human-computer interaction, pp. 251–271. Lawrence Erlbaum Associates, Mahwah (2002) 2. Card, S.K., Moran, T.P., Newell, A.: The keystroke-level model for user performance time with interactive systems. Communications of the ACM 23(7), 396–410 (1980) 3. Desmet, P.M.A.: Emotion through expression; designing mobile telephones with an emotional fit. Report of Modeling the Evaluation Structure of KANSEI 3, 103–110 (2000) 4. Kleinginna Jr., P.R., Kleinginna, A.M.: A categorized list of emotion definitions, with suggestions for a consensual definition. Motivation and Emotion 5(4), 345–379 (1981) 5. Mehrabian, A., Russell, J.A.: An approach to environmental psychology. MIT Press, Cambridge (1974) 6. Naumann, A., Hurtienne, J., Israel, J.H., Mohs, C., Kindsmüller, M.C., Meyer, H.A., Husslein, S.: Intuitive Use of User Interfaces: Defining a Vague Concept. In: Harris, D. (ed.) Engineering Psychology and Cognitive Ergonomics, HCII 2007, vol. 13, pp. 128–136. Springer, Heidelberg (2007) 7. The CogTool Project: Tools for Cognitive Performance Modeling for Interactive Devices. Carnegie Mellon University (April 16, 2006), http://www.cs.cmu.edu/~bej/cogtool/index.html
Measuring User Performance for Different Interfaces Using a Word Processor Prototype Tanya R. Beelders, Pieter J. Blignaut, Theo McDonald, and Engela H. Dednam Department of Computer Science and Informatics, University of the Free State, South Africa {beelderstr,pieterb,theo,dednameh}.sci@ufs.ac.za
Abstract. Usability tests were conducted in order to establish the effect on user performance of different icon sets in a word processor. Both a set of alternative pictorial icons and text buttons were developed for a subset of word processor functions for comparison with the standard icons. In order to accommodate users in their home language the interface was available in English, Afrikaans and Sotho to determine whether usability of a product is increased when the users are allowed to interact with the product in their mother tongue rather than having to use the commonly available English interface. The scores obtained for completed tests as well as the time taken to complete tasks successfully were evaluated. Results indicate that neither icons nor language play a significant part in the usability of a product. In fact, the only significant contributor to user performance was the word processor expertise of the user. Keywords: Usability, word processor, icons, text buttons, localization.
to the user, when used under specified conditions”. This definition is further expanded upon in ISO 9241-11 where usability is defined as “the extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use” [5]. In terms of these definitions, four distinct components of usability can be identified, namely effectiveness, efficiency, satisfaction and learnability. Shneiderman [6] lists a set of five measurable objectives that can be measured in order to determine the usability of a product and several usability models are also available that provide a number of measurements which can be used by developers to comprehensively test the usability of a product [7]. 1.2 Icons Any attention that is devoted to the interface detracts from the concentration of the user and constitutes interference with the primary task [8]. Since processing of text is considered to be a cognitive task, and the user is typically focused on some cognitive task when using a computer, more interference will be caused when using a textbased interface as it draws on the same cognitive resources as those required during completion of the task [8]. Icons are common interface components that employ images to represent an object or an action that can be carried out by the user [8]. Continued use of icons has been attributed to the fact that they are easier for users to learn and to use [1]. Their use also increases the productivity of the user since recognition is generally faster for a picture than for text [1], [8]. One disadvantage of icons is that they may be misinterpreted by users if the chosen image invokes unintended associations [8] – the picture that “speaks a thousand words may say a thousand different words to different viewers” [9]. No visible advantages have been detected when using pictorial icons rather than a text based interface [8], while it was also apparent that neither pictorial nor text icons were always immediately recognizable to users [9]. Furthermore, the tooltips that appear below the icons as an expanded explanation for the icon also did not always serve to assist the user in determining what the icon represents [9]. 1.3 Language The issue of translation into the home language of the user has proven to be a fairly contentious one, with many researchers determining that translation increases usability of the product [10], whilst others advocate caution when considering translation, as not all users show a preference for carrying out tasks in their mother tongue [11], [12] and performance is often hampered by translation [11]. These studies did however focus on translation of web content which typically contains large amounts of text which must be read by the user, much more so than the single commands found in a word processing environment. Users of an interface that is not in their first language do however encounter a number of inherent problems, one of which is verbal context – where surrounding words serve to place a word in context, allowing users to identify the actual meaning of the word rather than the potential meaning thereof [13]. Many interfaces do not include verbal context in menus, toolbars or buttons, which is clearly disadvantageous
Measuring User Performance for Different Interfaces
397
to second language users [13] and could also lead to difficulty for novice or first time users who are unfamiliar with the domain terminology and concepts.
2 Methodology A small word processor application was developed which possessed minimal capabilities, while still ensuring that it was representative of a fully-fledged word processor or advanced text editor. Functions which were incorporated into the word processor prototype included document handling (e.g. open and close), text formatting (e.g. font size and style) and text manipulation (e.g. copy, cut and paste). Users were required to complete a number of simple tasks, representative of common word processor tasks, such as font formatting. The tasks were displayed sequentially and individually at the bottom of the word processor window (Fig. 1) and could be completed solely by making use of either a toolbar shortcut (icon) or a menu option. The prototype allowed for real-time evaluation of the tasks, that is, once the user had completed a task the application immediately determined whether or not the task had been completed successfully as well as the capturing of certain measurements such as the time required to complete the task. Each task was assigned a difficulty index based on the number of actions or inferences which had to be carried out by the user in order to complete the task successfully. Tasks had difficulty indices ranging from 3 to 8.
Fig. 1. Word processor prototype with alternative icons and English menu
398
T.R. Beelders et al.
2.1 Subjects The test subjects consisted of first year university students who were taking a basic computer literacy course. The test was conducted during the first practical session of the course, before the subjects had received any instruction in word processor packages. Test subjects spoke a variety of languages, including English, Afrikaans, Sotho, Tswana, Xhosa and Zulu. All subjects were conversant in either English or Afrikaans as these are the tuition languages of the university. The participants provided for different levels of word processor expertise. Of the participants 403 were female and 283 male. 2.2 Languages As mentioned above there was a wide range of different languages spoken amongst the test subjects. Since the interface was only available in English, Afrikaans and Sotho, the participants were divided into one of these three groups according to their first language (L1). Afrikaans users completed the test on either an Afrikaans (L1), or an English (their second language - L2) interface. Sotho and Tswana users completed the test either in Sotho (L1), or in English (L2). The remainder of the users completed the test in English, where English would either be their L1 or L2. 2.3 Icon Sets Three sets of icons were used in the different interfaces, namely (i) the standard icons found in the Microsoft Office package, (ii) an alternative set of icons obtained from previous studies [14] and via two brainstorming sessions (see Fig 1), and (iii) text based icons. The set of icons obtained during the first brainstorming session were distributed amongst potential word processor users. Respondents were required to indicate which icon they would choose for each of a number of listed word processor functions. Alternative icons for Open, Close, Save, Cut, Copy and Paste were determined in this way. The remainder of the icons were developed during a second brainstorming session, and these were included in the design without confirmation by non-computer literate users. The icons were developed to provide more context for novice users in the hope that these users would easily be able to relate to the concepts depicted by the icons. For example, the icons used for Bold, Italic and Underline consisted of a bold, italic or underlined capital letter “F” respectively. This was done in an effort to convey to the user the font changes that would occur if the function were invoked. By using the same letter throughout and by placing them adjacent to one another on the toolbar, easier visualization of the font styling (Fig. 1) was ensured. The textual word icons had no images; instead they displayed the name of the function they represented and were available in English, Afrikaans and Sotho. 2.4 Menus and Tooltips The menu structure, when available in the interface, was the same as the standard menu found in the Microsoft Office 2003 package and the toolbar situated at the top of the screen was divided into the standard and formatting toolbars.
Measuring User Performance for Different Interfaces
399
To enable the effect of the icons to be tested without interference from other interface components, each pictorial icon set was included in an interface with neither menus nor tooltips. This ensured that the user had to rely entirely on interpretation of the icon when using this interface. The next group of interfaces to be added to the test interfaces used the same pictorial icon sets but tooltips were added in English, Afrikaans or Sotho. To complete the set of interfaces for testing, the afore-mentioned interfaces were used as a base to which a menu was added in the same language as the tooltips for that particular interface. The interface using the text-based icons had no menu although the tooltips were still used. This was to compensate for the fact that often the entire function name did not fit on the button; in particular the Sotho translations. To ensure legibility of the icons, a shortened version of the function was placed on the button and the full-length version was displayed in the tooltip to provide verbal context.
3 Analysis Taking all of the above-mentioned considerations into account, there were seven possible interface configurations (Table 1). Interfaces 3 and 6 (Table 1) have no language component to speak of, since they have neither a menu nor tooltips, but the remainder of the interfaces were available in either the users’ L1 or L2, resulting in a total of 12 different interfaces. The two interfaces without a language component (3 and 6) were removed from the initial analysis to be evaluated separately from the remainder of the interfaces. The subjects who participated in the study and completed the test on the interfaces that contained a language component are designated as group A, and the rest of the subjects are categorized as group B. Table 1. User distribution
Group Group A
Group B
Total
Interface 1 Standard icons, menu, tooltips 2 Standard icons, no menu, tooltips 4 Alternative icons, menu, tooltips 5 Alternative icons, no menu, tooltips 7 Text icons, no menu, tooltips Standard icons, no 3 menu, no tooltips Alternative icons, no 6 menu, no tooltips
Language L1 L2 L1 L2 L1 L2 L1 L2 L1 L2
Novice 24 26 22 24 13 26 21 25 15 33
Expert 20 23 26 13 17 23 23 26 20 25
19
15
17 265
21 252
Total
445
72
400
T.R. Beelders et al.
Each user was classified as being a novice, intermediate or an expert word processor user based on their level of experience with a word processor application and the frequency with which they had made use of such an application prior to the test. The level of frequency and experience were rated on a scale of 0 to 4 and 0 to 5 respectively. These individual ratings were then cross-multiplied to obtain a scale consisting of fourteen distinct expertise ratings. In order to eliminate the effects of an individual’s uncertainty regarding expertise, the intermediate group was not included in the analysis of the results. The final distribution of users is shown in Table 1. As an effectiveness measurement, each user was assigned a weighted score which was calculated as the sum of the cognitive loads of all the tasks completed successfully by that user. The time taken to complete each task, which measures efficiency, was measured in seconds and then converted to 1/time for further analysis. A factorial ANOVA was used to test the following hypotheses: 1. H0,1: The word processor expertise of the user has no effect on the test score. 2. H0,2: The interface used has no effect on the test score. 3. H0,3: An interface in the user’s L1 or in L2 has no effect on the test score. 4. H0,4: The word processor expertise of the user has no effect on the time taken to complete the task. 5. H0,5: The interface used has no effect on the time taken to complete the task. 6. H0,6: An interface in the user’s L1 or L2 has no effect on the time taken to complete the task. 3.1 Analysis of Group A Group A consisted of those users who used any one of the interfaces 1, 2, 4, 5 or 7. Furthermore, these interfaces could either be in the L1 or the L2 of the user. Table 2. 1/Time ANOVA results for Group A and the consolidated group
Language A 0.577 0.940 0.743 0.886 0.867 0.468 0.697 0.802 0.624 0.567 0.579
Measuring User Performance for Different Interfaces
401
H0,1 was rejected (FExpertise(1, 425) = 27.73, p = 0 < 0.001), indicating that the word processor expertise of the user did indeed have an effect on the score achieved by the user. Neither H0,2 (FInterface(4, 425) = 1.10, p = 0.356 > 0.05) nor H0,3 (FLanguage(1, 425) = 0, p = 0.947 > 0.05) could be rejected, leading to the conclusion that neither the interface nor the language had an effect on the score achieved by the user. The results of the ANOVA for the time analysis, which included only correctly completed tasks in the analysis, are summarized in Table 2 (italicized font). An α level of 0.05 was used throughout to distinguish between significant and nonsignificant differences. For the sake of brevity, the results of the interaction between the variables have been excluded as they all had a p-value of over 0.05. As would be expected, expert users performed significantly better than novice users with hypothesis 3 being rejected for all of the tasks. Hypothesis 5 could be rejected for the tasks that require a single word to be made bold (p = 0.002, task 2) and a phrase to be italicized (p = 0.041, task 9). Hypothesis 6 could not be rejected for any of the tasks at an α level of 0.05, therefore it could be concluded that the interface language had no effect on the time needed by the user to complete the task successfully. As discussed previously, the group B users were removed from the analysis since the interfaces had neither a menu nor tooltips, thus containing no language component. Users of this group had to rely entirely on interpretation of the icon and as such could not be included in the initial analysis. Since it was shown that language had no effect on either the score achieved by the user or the time taken to complete the tasks, the need to separate the groups no longer existed and the two groups could be consolidated into a single group for the remaining analysis. 3.2 Analysis of the Consolidated Group Groups A and B were amalgamated into a single user group in which language no longer played a role. The analysis of this group included users of all seven interfaces listed previously. Hypothesis 3 and 6 as listed above were no longer applicable. H0,1 was rejected (FExpertise(1, 503) = 26.47, p = 0 < 0.001) indicating that the word processor expertise of the user does indeed have an effect on the score achieved by the user. The interface used had no effect on the score achieved by the users, so H0,2 (FInterface(6, 503) = 2.01, p = 0.063 > 0.05) could not be rejected. The results of the ANOVA for the time analysis are summarized in Table 2 with non-italicized font. Once again, only tasks that were completed successfully were included. Hypothesis 1 could be rejected for all of the completed tasks since expert users performed significantly better than novice users in all of these tasks. Hypothesis 2 could be rejected for three of the eleven tasks. Possible reasons for these observations are discussed below. • Task 2: Change font style of a single word to bold (p = 0.001) A significant difference was found between the users of interface 2 and interface 7 as well as between users of interface 2 and interface 6. This indicates that the icons contribute significantly to the performance of the user. In both cases, users of the standard icon had a shorter completion time than users of the other two interfaces. This indicates that the standard icon for Bold is extremely intuitive and succeeds in con-
402
T.R. Beelders et al.
veying the concept of bold to the user, even more so than does the word “Bold” on the button. Since only those users of the alternative icons where no tooltips were used showed a significantly longer completion time than those of the standard icons, it would seem that the tooltips assisted the users in deciphering the functions linked to the icons for the remainder of the alternative icon users. • Task 6: Close a document (p = 0.001) With an average completion time of 53 seconds, users of the alternative icons with no menu and no tooltips (interface 6) took longer to close a document than all other users, who completed the task in times ranging from 18 seconds to just marginally longer than 20 seconds. The number of correct answers to this task was also the second lowest of all the tasks. Although the alternative icon for the Close function was chosen by questionnaire respondents, the results of this task show that it does not successfully communicate the concept of Close when used in an interface without any tooltips or menus to assist the user. In fact, the icon chosen by the respondents was actually designed as an alternative for an electronic mail interface. The icon appears to be acceptable when used in conjunction with a tooltip • Task 9: Italicize a phrase (p = 0.006) Post-hoc tests indicated that the most significant difference occurred between users of interface 6 and interface 2 as well as between users of interface 6 and interface 4. Alternative icon users with no tooltips and no menu (interface 6) had a significantly longer average completion time than the users of the other two mentioned interfaces. These results indicate that the alternative italic icon does not succeed in conveying the function to the user. However, once again, inclusion of a tooltip to indicate the icon purpose assisted the users in determining the functionality linked to the icon.
4 Discussion Overall, the most significant contributing factor to the performance of the users is that of word processor expertise. The interface used appears to have minimal effect on user performance. The only difference between the pictorial and text icons occurred in the task which required users to bold a single word. Thus, there is very little performance difference between users of pictorial icons and those using text icons, a finding which supports those of [8]. The majority of performance differences that were detected existed between users of the alternatively designed icons where tooltips were absent from the interface and one of the other interfaces. The attempt to place the set of styling icons in a concrete context by using the same lettering and simply changing the styling effect had mixed results. Users of these icons with no tooltips showed a remarkably slower completion rate than users of other interfaces. The icons did, however, not seem to impede user performance when used in combination with tooltips or a menu. The only other icon that was unsuccessfully implemented was that of Close. Even though this icon was chosen as the preferred icon by non-computer literate users, it did not succeed in conveying the function concept to the user. This finding motivates the need for usability testing of interfaces even in the case where the interface is designed with the assistance of end users, since preferred interface choices do not always increase the proficiency of the users.
Measuring User Performance for Different Interfaces
403
Whether users work in their home language or not also has no effect on their productivity. These findings show that although users may not prefer to work on an interface in their L1 [11], [12], a translated interface does not hamper their performance, failing to corroborate the assertions that translation does increase user productivity [10] and that translation may adversely affect user performance [11]. The failure to confirm the results of previous studies may be attributed to the fact that the mentioned studies tested user performance on a translated website which contained large amounts of text [11], as opposed to single words or short phrases such as those used in this study. Also, where appropriate, word processing commands were placed in context, for example, the Sotho for Close and Open were translated as “close document” and “open document” respectively.
5 Conclusion All indications are that user performance is not adversely affected by different interfaces, be they textual or pictorial, or different languages. Rather it is the experience of the user which dictates the effectiveness and efficiency of user performance. Differences between users of the standard and alternative interfaces were minimal, indicating that there is no need for the development of an alternative interface. From these results it appears evident that once the user has been provided with enough training and has gained enough experience to be confident within the task and the application domain, they will easily adapt to changes in the interface.
References 1. Abran, A., Khelfi, A., Surya, W., Suryn, W., Seffah, A.: Usability meanings and interpretations in ISO standards. Software Quality Journal 11, 325–338 (2003) 2. Benbasat, I., Todd, P.: An experimental investigation of interface design alternatives: icon vs. text and direct manipulation vs. menus. International Journal of Man-Machine Studies 38, 369–402 (1993) 3. Blignaut, P.J., McDonald, T.: The implications of reading and writing language preference for Internet access in a multilingual South Africa. S.A. Tydskrif vir Natuurwetenskap en Tegnologie (2006) (in Afrikaaans) 4. Bodley, G.J.H.: Design of computer user interfaces for Third World users. M.Com. Dissertation, University of Port Elizabeth, South Africa (1993) 5. Cyr, D., Trevor-Smith, H.: Localization of Web Design: An Empirical Comparison of German, Japanese, and U.S. Website Characteristics. Journal of the American Society for Information Science and Technology 55(13), 1–10 (2004) 6. De Wet, L., Blignaut, P., Burger, A.: Comprehension and usability variances among multicultural web users in South Africa. In: Proceedings of CHI 2002, Minneapolis (2002) 7. ISO9241: ISO 9241-11: Ergonomic requirements for office work with visual display terminals. Beuth, Berlin (1997) 8. Johns, S.M.: Colors, buttons, words and culture: Designing software for the global community. In: CODI Conference, April 9-11, Mesa, AZ (1997) 9. Kacmar, C.J., Carey, J.M.: Assessing the usability of icons in user interfaces. Behaviour and Information Technology 10(6), 443–457 (1991)
404
T.R. Beelders et al.
10. Kukulska-Hulme, A.: Communication with users: insights from second language acquisition. Interacting with Computers 12, 587–599 (2000) 11. Nielsen, J.: International Web Usability. Alertbox (August 1996) 12. Shneiderman, B.: Designing the user interface: Strategies for effective human-computer interaction, 3rd edn. Addison-Wesley, Reading (1998) 13. Teklebrhan, R., Blignaut, P.: A study on the effect of Western designed metaphors in some culture groups in South Africa. Technical Report, 2005/02, University of the Free State, South Africa (2005) 14. Zammit, K.: Computer icons: a picture says a thousand words or does it? Journal of Educational Computing Research 23(2), 217–231 (2000)
Evaluating User Effectiveness in Exploratory Search with TouchGraph Google Interface Kemal Efe and Sabriye Ozerturk Center for Advanced Computer Studies, University of Louisiana, Lafayette, LA 70504 {efe,sxo7344}@cacs.louisiana.edu
Abstract. TouchGraph Google Browser displays connectivity of similar pages around search results returned by Google. A major research question is: to what extent does this graph help improve user effectiveness during exploratory search? This paper reports on our user study with TouchGraph visualization. This study has interesting implications for designing user interfaces of search applications.
1 Introduction Search engines generally do a good job in finding information that is easy to label, like “calories in apples.” However, information is not always easy to articulate. In some cases users may not even know what the correct answer looks like until they have seen it. Learning, navigation and exploration play important parts in information seeking. A user generally starts with an initial query and successively refines it until desired information is found. Each new query in this progression reflects something learned along the way. Exploration by successive query-refinement is difficult. Users frequently give-up searches in frustration. A key area that has become a hot research topic in recent literature [6,7] is determining the right set of tools that will help users during exploratory search. It is not known what interface tools would support exploration. User-interface research [1] suggests that use of symbols and graphics improves cognition and perception. Visualizations can highlight aspects of information not comprehensible with plain text. For example, TouchGraph Google Browser1 displays connectivity of similar pages around search results. An interesting research question is to what extent does this graph help improve user effectiveness during exploratory search. This paper reports on our user study with TouchGraph visualization.
2 Related Work Earlier work has focused on display and navigation of search results. Koshman [5] studied TouchGraph interface for Amazon.com and examined user ability to select similar items. Their user test with 17 participants showed that there was a high overlap between system-suggested similar items and user-discovered similar items. 1
Efe, Asutay, and Lakhotia [2] evaluated a visualization interface that allows following the links forward and backward. Additional tools allowed changing the scope of displayed graph, orientation support and backtracking. User tests with 50 participants showed that, given equal time, users were able to successfully complete twice as many search tasks as the users of a traditional interface. Heo and Hirtle [3] investigated different methods for visualizations of category information with 80 participants using distortion, zoom, and expanding outline. The study showed that performance did not improve with visualization tools, however the expanding outline was shown to be more useful to users than other visualizations. Hightower et al [4] studied visualizations of visited paths displayed in a tree structure. User tests based on 37 participants showed that users with the visualization tool completed the given set of search tasks nearly twice as fast as the control group without the visualization tool.
3 Research Question In this research we hypothesized that visualization can support exploratory search by displaying relationships among documents and by enhancing user interaction with the system. The major research question in this context is how well can users navigate a graphical presentation of related documents to reach desired information. Two aspects of this question are: a) how well can users navigate visualizations to reach multiple sources of related information on a subject, b) To what extent does this ability translate to reaching desired information.
4 Study Design 4.1 System TouchGraph is well suited for the research questions we consider. It supports exploratory search by displaying connections between related web sites. Web sites are displayed as nodes of a graph related to search results. Clicking on a node retrieves other related nodes and displays around the selected node. Additional information about a node is displayed by moving the mouse over the node. Display area also contains a list of search results in a vertical box on the left side of the screen. Selecting an item from the list highlights the corresponding graph node. It also displays site description in a special text area. An example screen generated in response to query “book sellers” is shown in Figure 1. 4.2 Participants Thirty-five computer science graduate students participated in the experiments. All of them said that they used Google on a daily basis, but none of them had knowledge of TouchGraph before the experiment. We randomly divided participants into two groups based on the parity of their student Id numbers. Students with even Id numbers were told to use Google and students with odd Id numbers were told to use TouchGraph. It turned out that there were 16 Google users and 19 TouchGraph users. Before the experiment, participants were given a five-minute training on TouchGraph to demonstrate different functionalities available.
Evaluating User Effectiveness in Exploratory Search with TouchGraph Google Interface
407
Fig. 1. TouchGraph visualization
4.3 Experiments Two experiments were designed. In both experiments, participants were free to use any queries they wished. User effectiveness is measured by the number of successful searches completed within a fixed period of time. The first experiment is designed to measure ability of users to reach a multiplicity of related documents. Participants were provided the URLs of 20 web sites on used books (sales, exchange, collector clubs, etc) and they were asked to find as many of these sites as possible by entering queries to the system and by exploring related pages. Table 1 shows the list of URL’s the participants were required to find. Table 1. The list of related pages used in user test Addall.com
The second experiment contained 10 search tasks specifically designed to require exploration before reaching a document with the correct answer. Topics of search were selected to be specialized enough that a layperson is not likely to have independent knowledge. Participants were asked to record the URL of the required
408
K. Efe and S. Ozerturk
document when they find it. For most search tasks, we tried to make sure that the required web source described in search question was unique in its information content. When in doubt, we provided page-specific information as part of search task specification. An example question in this test is as follows: Find a web site that offers thematic online maps relating to various topics. Upper-right corner of the page contains links for interesting maps available there. Examples of maps include population maps, economic maps, airport locations, travel maps, and others. This site also has detailed information about various topics of interest, such as top-10 countries in the world (in the sense of various criteria), map of top 100 hotels in the world, map of top 100 wonders of the world, etc. The expected answer for this question was http://www.mapsofworld.com. When queried on Google with “thematic world maps,” this particular URL comes up as the top item2. The specification about upper-right corner of page was intended to distinguish the “correct” document from others. Among 16 Google users, all but one found this URL as the correct answer. Of the 19 TouchGraph users, only four missed the correct answer. Another question, which had close to fifty percent success rate for both groups, was the following: Find a web site that provides a URL-based search over the billions of archived pages on the web. Here a user can enter the URL of a page and retrieve different versions of the same page as it was on different dates. The site doesn’t support keyword searching like search engines. The user must enter the URL of a page as input. The expected answer was http://www.archive.org. The “correct” answer is unique since there is no other web site providing this service. It shows up as the top result when searched on Google with query “internet digital library.” 5 Google users and 6 TouchGraph users gave the correct answer. Another question with somewhat lower success rate was: Find a web site that sells prehistoric monuments like fossils, meteorites, and other items that are related with dinosaurs. The monuments are displayed in a gallery. There is a diversity of prehistoric monuments in this exhibition like a two hundred million year old petrified wood slice, a dinosaur claw, eggs, toys and etc. The expected answer for this search task was http://www.dinosaurstore.com/. This URL shows up as the top item in the list of search results when searched with “dinosaur teeth fossils.” Notice that the word “teeth” does not appear in page description we provided. This may have made it harder to find the page, and search may require a good deal of exploration. We had found this web site by searching with 2
It should be noted that Google ranking of pages may vary over time. Generally, this variation is less for well-established web sites. Page rankings mentioned in this paper were true as of the date of user experiments.
Evaluating User Effectiveness in Exploratory Search with TouchGraph Google Interface
409
“fossils meteorites dinosaurs” and clicking on “similar pages” link of the second item on the list (which was www.arizonaskiesmeteorites.com/Dinosaur_Fossils_For_Sale). The page appears as the sixth item on the similar pages list. The “correct” answer was unique as of the date of tests because of the two hundred million year old petrified wood slice that none of the similar pages had. While six Google participants found this URL as the correct answer, among TouchGraph users only one participant found this URL. This was surprising because we had expected TouchGraph users to be more successful on this question due to the way we had reached it. The only question with no correct answer from any participant in either group was the following: Find a web site whose primary purpose is to maintain a comprehensive listing of African-Diaspora-related Web pages. The site provides a directory and a full text search of the indexed pages at a central site. Here, any one of several possibilities can be considered as being the correct answer, like http://www.ubp.com/, www.blackpages.com/, www.blackpgs.com/, and possibly others. All three of these pages are returned on the first page of Google when queried with “black pages.” Our original expectation was that most participants would suggest http://www.ubp.com/ as the correct answer, because it is the only site that explicitly states its mission, as “The primary purpose of the UBP is to maintain a comprehensive listing of African-diaspora-related Web pages at a central site.” Moreover, it comes up as the second item when searched with “African Diaspora pages,” or as the fifth item in response to “diaspora related pages.” This was one of the few search tasks with multiple acceptable answers in our experiments. Yet, it turned out to be the only one with no correct answer from any participant. After close inspection, we found that in Google search engine, “diaspora” seemed to retrieve (mostly academic) pages that study the concept of Diaspora, and history of diaspora. Using the keywords “black pages” was essential for Google to retrieve sites with directory or search facilities, but participants didn’t try these keywords from the provided description of search task.
5 Statistical Evaluation of Results 5.1 Related Items Test In finding related items, TouchGraph users were more successful than Google users. This result concurs with Koshman’s findings [5] where users of TouchGraph were highly effective in finding related items in Amazon.com. This is expected since TouchGraph user interface graphically displays relationships between pages. Figure 2 shows the user scores obtained, and Table 2 shows the group statistics. We performed Mann-Whitney significance test on these results by using SPSS statistical package. Significance measures are reported in Table 3 below. As can be seen, the difference between user performances was highly significant.
410
K. Efe and S. Ozerturk
Number of related pages found by participants 9 8 7 6
TouchGraph
5
Google
4 3 2 1 19
17
15
13
11
9
7
5
3
1
0
Fig. 2. User scores in searching for related pages Table 2. Group Statistics in finding related pages Interface
N
Mean
Std. Deviation
Std. Error Mean
Google
16
3.5000
2.03306
.50827
TouchGraph
19
5.5263
1.38918
.31870
Table 3. Significance Statistics URL found Count Mann-Whitney U
54.500
Wilcoxon W
190.500
Z
-3.274
Asymp. Sig. (2-tailed)
.001
Exact Sig. [2*(1-tailed Sig.)]
.001
5.2 Exploratory Search User scores in exploratory search are plotted in Figure 3 below. Table 4 shows the corresponding group statistics.
Evaluating User Effectiveness in Exploratory Search with TouchGraph Google Interface
411
Contrary to the first test, Google users were more effective in exploratory search than TouchGraph users. Mann-Whitney test on these results showed the performance difference to be significant, rejecting null hypothesis with 95 percent confidence. Mann-Whitney significance measures are reported in Table 5 below.
Number of successful searches by participants 7 6 5 4
Google
3
TouchGraph
2 1
19
17
15
13
11
9
7
5
3
1
0
Fig. 3. User scores in exploratory search Table 4. Group Statistics Interface
N
Mean
Std. Deviation
Std. Error Mean
Google
16
3.5000
1.59164
.39791
TouchGraph
19
2.4211
1.26121
.28934
Table 5. Significance Statistics Correct Count Mann-Whitney U
89.000
Wilcoxon W
279.000
Z
-2.138
Asymp. Sig. (2-tailed)
.033
Exact Sig. [2*(1-tailed Sig.)]
.037
412
K. Efe and S. Ozerturk
6 Discussion The purpose of our experiments were to find answers to two key questions: a) how well can users navigate visualizations to reach multiple sources of related information, b) To what extent does this ability translate to reaching desired information in exploratory search. The first test showed that TouchGraph interface positively helps users reach a multiplicity of related pages. However, the second test showed that this ability did not necessarily translate to more effective searches. The results appears counter-intuitive at first because we would expect that enhanced ability to reach a multiplicity of related documents would translate to enhanced ability to reach desired information. However, our interviews with users showed that they had difficulty in comprehending the graphical display. When the question only asked about page URL, they performed well because they could readily see the URL associated with each graph node displayed on the screen. However, there was a semantic gap between the URL of a page and its information content. They couldn’t see enough content description to meaningfully navigate their way toward required information. Consequently, they made very little use of related page information. Google user also admitted that they didn’t make much use of “similar pages” facility. Both groups tried to use query refinement as the primary mechanism for exploration. One month after the test, we asked participants: “Now that you are aware of its existence do you use TouchGraph instead of Google for search?” None of the participants acknowledged using it even occasionally. Effective tools that help users to explore related pages are essential in web search. In real life, learning and reasoning about related information is a primary method of discovering the unknown. There is no reason why this should not be the case for exploratory search on the web. We only know that TouchGraph is not the right tool for this purpose.
References 1. Aspillaga, M.: Perceptual foundations in the design of visual displays. Computers in Human Behavior 12(4), 587–600 (1996) 2. Efe, K., Asutay, A.V., Lakhotia, A.: A User Interface for Exploiting Web Communities in Searching the Web. In: WEBIST-2008, Proceedings of the Fourth International Conference on Web Information Systems and Technologies, Funchal, Madeira, Portugal, May 4-7 (2008) 3. Heo, M., Hirtle, S.: An empirical comparison of visualization tools to assist information retrieval on the Web. Journal of the American Society for Information Science and Technology 52(8), 666–675 (2001) 4. Hightower, R.R., Ring, L., Helfman, J.I., Bederson, B.B., Hollan, J.D.: Graphical Multiscale Web Histories: A Study of Padprints. In: Proceedings of the 9th ACM Conference on Hypertext and Hypermedia (HYPERTEXT 1998), Pittsburgh, PA, USA, June 20-24, 1998, pp. 58–65 (1998) 5. Koshman, S.: Web-based visualization interface testing: similarity judgments. Journal of Web Engineering 3(3/4), 281–296 (2004) 6. White, R.W., Kules, B., Drucker, S., Schraefel, M.C.: Supporting exploratory search: Introduction. Special issue of the Communications of the ACM 49(4) (2006) 7. White, R.W., Muresan, M., Marchionini, G.: Proceedings of the ACM SIGIR 2006 Workshop on Evaluating Exploratory Search Systems (EESS 2), Seattle, USA, August 10 (2006)
What Do Users Want to See? A Content Preparation Study for Consumer Electronics Yinni Guo1, Robert W. Proctor2, and Gavriel Salvendy1,3 2
1 School of Industrial Engineering, Purdue University, W. Lafayette, IN, USA 47907 Department of Psychological Science, Purdue University, W. Lafayette, IN, USA 47907 3 Department of Industrial Engineering, Tsinghua University, Beijing, China 100084 {guo2,salvendy}@purdue.edu, proctor@psych.purdue.edu
Abstract. To investigate what users want to see from consumer electronic devices, a content preparation study was conducted. A questionnaire was constructed based on the results from web site content research and traditional usability studies on consumer electronics, and was completed by 401 Chinese participants. The statistical results reveal that there are nine major factors of cell phone content. Also users of different age and gender have different requirements for cell phone content, especially concerning accessory and multimedia functions. This study suggests guidelines for cell phone designers targeted at the Chinese market, as well as a base for content study of other consumer electronics. Keywords: Content preparation, factor structure, consumer electronics.
culture-related content preparation study, Savoy and Salvendy [4] found that seven factor captured the survey structure: general product description, member transaction, shipping, secure customer service, company, durability and price. Comparison of the three studies reveals similar factors of web site content and provides evidence that the quality of content plays an important role in usability. Given the importance of content preparation for web sites, the study of content preparation should be extended to non-web-based products like consumer electronics. Compared to web-based products, consumer electronics are similar in being used to display a large amount of information to the users. However, unlike web-based products, consumer electronics like mobile phones and PDAs impose limitations of small screen size and cumbersome input mechanisms [5]. Therefore, indication about ways to control the devices may be essential in the content structure. We chose cell phones as representative of consumer electronics because nowadays cell phones are being developed as multi-functional devices; therefore, the factor structure of cell-phone content may be applicable to other appliances. A related study of content for consumer electronics was performed by Caus et al. [6]. They pointed out that reasons for low market penetration of mobile applications included their lack of standardization concerning the handling of information and high technical complexity. Caus et al. proposed that one possible way to reduce the problem of representing and selecting content in mobile Internet use was to only offer users content relevant to their particular situation, through context-aware information processing.
2 Methodology When the users are using a certain function of the cell phone, they are facing a series of tasks. Therefore the essential issue for cell phone content is what content should be provided so that users can operate the functions easily, and what types of functions are necessary. One way to figure out what content is needed is to ask the customers themselves. The three previous studies [2, 3, 4] validate the efficiency of surveys. Therefore, a questionnaire was developed based on previous content preparation studies, cell phone usability studies [7, 8, 9, 10], multi-media studies on cell phones [6, 11] and observation of current advanced cell phones. Content questions included six major categories: function (18 questions), menu (8 questions), instruction and status (15 questions), file (6 questions), input and search (8 questions), service (4 questions), and phone calls features (5 questions). The questionnaire included these 64 questions and 4 questions asking participants’ feelings concerning how much cell phone content would influence their satisfaction, operation efficiency, and effectiveness, as well as whether current cell phone content is enough. There were also seven demographic questions to investigate users’ backgrounds. Out of the 68 questions, there are four repeated questions to test internal consistency.
3 Procedure and Participants The survey was conducted in Xiamen, China, in May 2008. A paper-based questionnaire was used due to distribution convenience. A total of 401 participants filled out
What Do Users Want to See? A Content Preparation Study for Consumer Electronics
415
the survey, out of which 375 yielded usable results. Twelve participants did not finish the whole questionnaire, and 14 participants answered the questionnaire with low internal consistency. 42% of the participants were female. The ages of subjects ranged from 18 to 60 years, with 95% of them being under age 40 years. About 96% of the subjects had education higher than an associate college degree. The participants had a diverse range of occupations. Most had experience of using cell phones for 2 years or longer and with 2 or more models. Detailed description of the subjects’ demographic information can be found in Table 1. Table 1. Demographic characteristics of survey participants
Gender Education Level Age Occupation
Models Years
Categories Female 158 High school 6 Under 20 33 Manager 28 Technician 22 0 7 0-1 10
4 Results and Discussion 4.1 General Results and Factor Analysis We use a 7-point Likert scale to record users’ attitude. The mean answers for each question ranged from 4.05 to 6.22, with standard deviations ranging from 0.93 to 1.73. Some questions yielded extremely high means with low standard deviations, which indicates that these items are considered very important across all participants. These questions are about the main or basic cell phones features like calendar, message status, search by name function, time of a missed call, and missed call times. On the other hand, some items like sequential shooting camera, mobile televisions, dual time zone function, and animation of power on/off had low means and high standard deviations. This result shows that participants’ preferences concerning accessory functions differ a lot, probably due to different backgrounds. The survey also reveals that participants agree that quality of cell-phone content would influence their satisfaction (mean rating, M = 5.56), as well as their operation efficiency (M = 5.23). There was no agreement on “current content is enough” (M = 4.46; SD = 1.73), which indicates that for many current cell-phone models, but not all, the necessary contents are not all included.
416
Y. Guo, R.W. Proctor, and G. Salvendy
The survey showed an acceptable overall internal consistency of 0.82. To get the hidden structure of information content, maximum likelihood factor analysis with varimax and promax rotations was conducted. By examining the scree plot and eigenvalues, we found that 9 factors would explain 85.54% of the total variance. Under each factor, items with loadings lower than 0.50 were considered insignificant and eliminated. The factors were named according to the loading questions. Factor 1 includes content items of “current input method”, “the input ‘pinyin’ letters”, “what content has been input”, “search by name” and “search by initial”. Therefore it could be concluded as factor about “Input and search”. Factor 2 covers questions about “number of each function”, “name of each function”, “all options of each function on any menu”, “scroll bar” and “cursor”, which are all related to assistance to functions. Therefore, Factor 2 is named as “Functions”. Items under Factor 3 are all related to the indication of keys or functions, like indication of “back to previous menu”, “confirm key” and “which keys are in use”. Therefore Factor 3 is named as “Operation”. Factor 4 includes the three most widely used multimedia functions (digital camera, sequential shooting camera, video camera), and is named as “Multimedia function”. Factor 5 covers items of “file size”, “photo size”, “file properties” and “storage”. It could be named as “Stored files” since all four items are related to cell phone storage space and stored file attributes. Questions loaded under Factor 6 are all about phone calls like “miss call times”, “time of a miss call” and “length of each call”. It could be named as “Phone calls”. Factor 7 could be named as “Help and service” because the loaded questions are about how to get more information about signal carrier and manufacturer, as well as help information of cell phone functions. Factor 8 covers a large range of questions from reminding icons to emergency key and could be considered as “Accessorial functions”. Factor 9 can be concluded as “Message” since it contains two items “icon of message box status” and “icon of voice mail status”. Of the original 64 questions of cell-phone content, 27 items did not load on a factor. Therefore the questionnaire could be simplified for future use. Out of these nine factors, four factors (Factor 4, 6, 8, 9) are about specific cellphone functions. These factors and the items covered could be applied to the design of cell-phone content. The other five factors are related to general functions and operation. These factors are universal and could be applied to the content design of most consumer electronics. For instance, Factor 5 could be used for devices that could store files, like music player, digital camera, PDA, GPS; Factor 2, Factor 3 and Factor 7 need to be applied for every information appliance. 4.2 Analysis of Users of Different Backgrounds The survey included seven demographic questions to classify participants with different backgrounds. By checking the difference, we could give guidelines of whether different designs should be considered for different user groups. Duncan’s multiple range test has been taken to compare all pairs of means, with alpha level of 0.05 set for statistical significance. Difference of means over 10% is considered as a practical significant difference; and items revealed to be practical significant different are listed in Table 2. For the comparison between females and males, 23 items show statistically significant differences. However, only one item “Message of break out incident” shows a practical difference. Females tends to agree more than males on getting
What Do Users Want to See? A Content Preparation Study for Consumer Electronics
417
message when there is any break out incident like explosion, hurricane, or earthquake. This is probably because women are more averse to risk taking [12] and perceive greater danger than men [13]. The comparison of three age groups showed many differences. Although most of the subjects were no more than 40 years of age, differences still exist among participants of different age ranges. Thirty questions show statistically significant differences between age groups of “under 23 years old”, “23 to 29 years old” and “above 29 years old”. Twelve questions show practical significant differences, while 11 of them show decrease on mean as age increases (Fig. 1). These questions are all about whether a certain accessorial function is necessary, like mp3 player, instant message, memo, etc. This result can be concluded as the older users do not want accessorial functions as much as younger users do.
Age Effect 23-29
Above 29
ed day s
mo
ma rk
me
me ssa nge r
ins t
ant
es urf in g
pla yer
on lin
mp 3
iza tio n
e- b oo k
cus tom
eo vid ita l dig
ent
ial s
ho oti n
g
cam era ita l
seq u
dig
mp 3
pla yer
Question Response
Under 23
7 6,5 6 5,5 5 4,5 4
Questions
Fig. 1. Mean response for different product features as a function of age
The education level of participants varies from associate degree to Ph.D. degree. However, there are much more undergraduate degree level subjects than associate degree level subjects. Therefore, we decided to compare only two groups, undergraduate degree level and graduate degree level. By checking the results we can see that there are 8 items showing statistical significance, 6 of which show practical significance. Similar to what we have found in the age comparison, these 6 items are questions about whether a certain multimedia or accessorial function is necessary. The results suggest that participants with higher degrees would pay less attention to these features. However, this tendency might also interact with the age factor since people with graduate degrees tend to be older than those with undergraduate degrees. The demographic table shows that the numbers of sales personnel, technicians and managers are not comparable to the number of students. Therefore, we decided to combine participants with jobs and compare them with the students. The results reveal
418
Y. Guo, R.W. Proctor, and G. Salvendy Table 2. Practical significant differences between subjects of different backgrounds
Questions Message of break out incident Questions Mp 3 function Digital camera function Sequential shooting Video camera E-book function Customization function Mp3 function Online surfing Instant messenger Indication of “back to previous menu” Icon of “memo” status Icon to remind marked days Questions Digital camera function Video camera Cell phone game Customization function Mp3 function Instant messenger Questions Video camera E-book function Customization function Online surfing Instant messenger Icon of “memo” status Icon to remind marked days Questions Animation of power on/off Questions Video camera Mp3 function Indication of “confirm” key Animation of power on/off Questions Digital camera function Number of each function
that there are 9 items that show statistically significant differences, and 7 of them are practical significant. Like the difference between undergraduate and graduate degree holders, the 7 items are all about accessorial functions. Non-student participants would not pay as much attention to these functions as students. This result might also interact with the age factor since people with jobs tend to be older than students.
What Do Users Want to See? A Content Preparation Study for Consumer Electronics
419
There is only one item showing significant difference between participants with different cell phone model experience. Participants who have used more cell phone models prefer less animation of power on/off than less experienced users. But this trend does not apply to participants with no experience at all. Compared to other effects like age, education level and job category, the effect of model experience is much weaker. On the contrary, how many years that the participant has used cell phones shows more significant differences. Out of 11 statistically significant items, 4 of them are practical significant different. Experienced users were prone to place less attention to accessorial functions. This might also interact with the age factor since people have been used cell phone longer tend to be older. The comparison of participants that hold different opinions about cell phone usability shows six items as significantly different, and two of them are slightly practical significant. Participants who consider usability as “very important” or “median” do not think that the digital camera function is as important as participants who consider usability to be “not important”. But when talking about “number of each function”, the result is the opposite. It might be because “number of each function” is a way to support cell phone usability. After finalizing the factor structure, we compared participants with different backgrounds on the nine factors. Results from a MANOVA showed that only cell phone model experience has no influence on any factor. The age effect and usability effect would cause most differences. For all effects, the differences always include the need of information of accessorial functions or information of multimedia function. By checking the main effect of demographic characteristics, we found that Age and Gender are the two major characteristics, each of which shows significant on 27 and 20 items (p < 0.05). All the other characteristics have less than 7 significant items. Therefore, we can conclude that designers should make different models for users of different target age groups and target genders. For the current market, designs for different genders are more common than designs for different age groups. Older users complain that they cannot find cell phones that they can use [14].
5 Conclusions and Guidelines There are basically four conclusions from the above analysis. First, content with higher quality will benefit customer satisfaction, and there is room for current cell phones to improve their content. Second, there are different needs of content for users with different backgrounds, especially for different ages. Younger users, especially students, rely a lot on multimedia functions and content, while older users or industrial populations do not consider them important. Designers need to take this into consideration when they are designing cell phone content. For example, offering the elder and business users cell phones with stable and high quality phone call functions, easy accessed phone book and easy input keypad is more important than providing the most advanced multimedia functions. Thirdly, 27 questions in the original questionnaire did not load on any of the factors. Therefore, the questionnaire can be simplified for future use. Fourth, information about input and search shows importance on how much variance it explains and the factor mean. This factor is essential for the Chinese
420
Y. Guo, R.W. Proctor, and G. Salvendy
population because it is more difficult to input Chinese using the small cell phone panel, though text messaging is more widely used in China than in the U.S. Compared to existing studies of cell phone interface and cell phone usability, this study complements the lack of consideration given to content and content structure [7, 8, 9, 10]. Compared to the study of Caus et al. [6], which discussed about contextadaptive information for cell phones, this study provides a straightforward structure of the necessary information.
References 1. Proctor, R.W., Vu, K., Salvendy, G.: Content Preparation and Management for Web Design: Eliciting, Structuring, Searching, and Displaying Information. International Journal of Human-Computer Interaction 14, 25–92 (2002) 2. Liao, H., Proctor, R.W., Salvendy, G.: Chinese and U.S. Online Consumers’ Preferences for Content of E-commerce Web Sites: a survey. Theoretical Issues in Ergonomics Science 10, 19–42 (2009) 3. Guo, Y., Salvendy, G.: Factor Structure of Content Preparation for E-business Web Sites. Behaviour and Information Technology (in press) 4. Savoy, A., Salvendy, G.: Foundations of Content Preparation for the Web. Theoretical Issues in Ergonomics Science 9, 501–521 (2008) 5. Venkatesh, V., Ramesh, V., Massey, A.P.: Understanding Usability in Mobile Commerce. Commun. ACM 46, 53–56 (2003) 6. Caus, T., Christmann, S., Hagenhoff, S.: Hydra – An Application Framework for the Development of Context-Aware Mobile Services. In: Business Information Systems, vol. 7, part 14, pp. 471–481. Springer, Heidelberg (2008) 7. Simth-Jackson, T.L., Nussbaum, M.A., Mooney, A.M.: Accessible Cell Phone Design: Development and Application of a Needs Analysis Framework. Disability and Rehabilitation 25, 549–560 (2003) 8. Kaikkonen, A., KekÄlÄinen, A., Cankar, M., Kallio, T., Kankainen, A.: Usability Testing of Mobile Applications: a Comparison between Laboratory and Field Testing. Journal of Usability Studies 1, 4–17 (2005) 9. Zhang, D., Adipat, B.: Challenges, Methodologies, and Issues in the Usability Testing of Mobile Applications. International Journal of Human-Computer Interaction 18, 293–308 (2005) 10. Ji, Y.G., Park, J.H., Lee, C., Yun, M.H.: A Usability Checklist for the Usability Evaluation of Mobile Phone User Interface. International Journal of Human-Computer Interaction 20, 207–231 (2006) 11. Miyauchi, K., Sugahara, T., Oda, H.: Relax or Study? A Qualitative User Study on the Usage of Mobile TV and Video. In: Changing Television Environments, pp. 128–132. Springer, Heidelberg (2008) 12. Byrnes, J., Miller, D., Schafer, W.: Gender Differences In Risk Taking: A Meta-Analysis. Psychological Bulletin 125, 367–383 (1999) 13. Lagrange, R., Ferraro, K.: Assessing Age and Gender Differences in Perceived Risk And Fear Of Crime. Criminology 27, 697–720 (2006) 14. Guo, Y., Proctor, R.W., Salvendy, G.: Development and Validation of Axiomatic Evaluation Method (working paper)
“I Love My iPhone … But There Are Certain Things That ‘Niggle’ Me” Anna Haywood and Gemma Boguslawski Serco Usability Services, London, United Kingdom anna.haywood@serco.com, gemma.boguslawski@serco.com
Abstract. Touchscreen technology is gaining sophistication, and the freedom offered by finger-based interaction has heralded a new phase in mobile phone evolution. The list of touchscreen mobiles is ever increasing as the appeal of ‘touch’ moves beyond the realms of the early adopter or fanboy, into the imagination of the general consumer. However, despite this increasing popularity, touchscreen cannot be considered a panacea. It is important to look beyond the promise of a more direct and intuitive interface, towards the day-to-day reality. Based on our independent research, this paper explores aspects of the touchscreen user experience, offering iPhone insights as examples, before presenting key best practice guidelines to help design and evaluate finger-activated touchscreen solutions for small screen devices.
solutions. Although focused rather than exhaustive, the guidelines aim to optimise the user experience by bringing qualities such as simplicity and ease of use, as well as consistency and responsiveness to the fore.
2 Perhaps ‘Cool’ But Not a Panacea Within the touchscreen arena, Apple’s iPhone is often the first device that springs to mind when asked to name a touchscreen mobile phone, despite manufacturers such as Samsung, Motorola and LG also being very strong contenders in the touch marketplace. Especially since the advent of the Apple’s 3G iPhone, interest in touchscreen mobiles has received a boost. A wealth of competitor products are hitting the market in order to ride on the iPhone wave, each aiming towards the large screen size and aesthetic appeal of the iPhone, while looking to distinguish themselves sufficiently, so not to attract a ‘me-too wannabe’ label or being too iPhone-esque to risk a costly legal battle. Despite its increasing popularity and the promise of a more intuitive interface, touchscreen is not a panacea. While finger-activated touchscreen can arguably be considered a progression over stylus-manipulation, views promoting touchscreen as a natural progression for mobile phones in general, are on ‘shaky ground’. All things being considered, it cannot be considered the ‘cool solution’ that waves goodbye to the usability issues typically associated with traditional, non-touch handsets. In addition, touchscreen devices can bring their own usability problems. At least for now, even Apple’s iPhone, which is often heralded as a touchscreen success story, is by no means perfect. Indeed, often tales of ‘iPhone love’ have user experience issues in the subtext, upon further investigation.
3 Exploring the ‘iPhone Experience’ 3.1 The Transition to ‘Touch’ The iPhone is frequently touted as being more intuitive than other mobile devices, not merely by virtue of its touchscreen interface, but also because it relies on one toplevel menu and a single physical button. Additionally, the device is often considered to offer a good balance between flashy design and practical functionality. This ‘balance’ is often cited as adding to its emotional appeal, especially amongst the consumer rather than the business market. While there’s a degree of truth in the ability of novice users to adapt relatively quickly to its use, this cannot be held true across the interface. There are aspects of the iPhone’s interface that still have a ‘learning curve’, requiring familiarisation and patience before an acceptable degree of performance is attained. For example, users sometimes struggle to discover and perform gesture style interactions such as zooming. Typically, ‘mastery’ and the overall user experience is measured in comparison to previous mobile phone use, including non-touchscreen devices. Although the transition to touch isn’t always ‘rosy’, the iPhone is often heralded as revolutionary in instances where prior mobile use was constantly fraught with difficulties and building the mental model necessary to use the device, was not easy to attain. As an example from our research, after persistently struggling with non-touch devices over several
“I Love My iPhone … But There Are Certain Things That ‘Niggle’ Me”
423
years (and multiple handsets), one 73 year old respondent reported being a total iPhone devotee who regularly texts, downloads applications, and is addicted to playing games on her beloved iPhone. The interaction paradigm offered by the iPhone is often considered to add to its intuitive nature. Here, rather than adopting a computer-based model of scrolling (like some competitors) where a scroll bar sits to the right of the screen, the iPhone’s physical interaction model (i.e. scroll up to access content further down the page and vice versa) encourages users to freely scroll anywhere on the screen. This model allows users to focus on page content and maximises screen real estate. In our studies, many users indicated that dragging a list/page up or down felt very smooth and very much like interacting with a real, physical object. In particular, the ability to flick a list in order to scroll it with momentum was appreciated once mastered. 3.2 Touchscreen Responsiveness Working to minimize the touch-response lag is imperative to the usability of touchscreen interfaces, as delays will frustrate and confuse users, encouraging repeated selection of target elements. Optimising responsiveness will dissuade users from pounding the screen and/or attempting to use their fingernail or pen like a stylus. When users’ reactions to the responsiveness of the iPhone were probed, responses typically signaled a high degree of satisfaction. Responsiveness was thought to be extremely good, with a negligible amount of lag between selection and launch. Indeed, with its underlying capacitive technology, the iPhone was generally considered more responsive than competitor devices that relied on direct pressure (resistive): only the lightest touch was required. However, where novice users were concerned, the iPhone’s high degree of sensitivity sometimes fostered niggling concerns about accidental interactions, for example, where overall finger size exceeded the target’s dimension or when hitting the target off-centre. Also, due to its responsiveness, an ongoing problem with the iPhone was that it sometimes confused navigation with selection, if users scrolled too slowly across a webpage full of links. This issue was then compounded by the inability to stop a new page opening. 3.3 Screen Size Matters In addition to the inherent novelty appeal, the ‘no-button’ design of touchscreen phones lends itself to a large screen size and the potential for a more sleek aesthetic design not ‘burdened’ by the need to accommodate physical buttons. When it comes to touchscreens, screen clarity and size matters. Large good quality screens are considered essential to provide space for key elements, as well as affording comprehension of the elements presented. Users need to feel that icons and other screen elements are large enough to select without accidentally selecting adjacent items. The hardware design of the iPhone is seen to bolster its emotional appeal. As noted by our respondents, the large 3.5-inch screen size, the clarity of the touchscreen, and its ‘unfussy’ single button design, positively combine, contributing to perceptions of the iPhone as being a high-end phone. Indeed, such factors were cited as reasons why current iPhone users had chosen the iPhone over the competition in the first place. For some novice users, however, positive reactions to the large screen were sometimes pitted against concerns that the screen may be vulnerable to damage, which may
424
A. Haywood and G. Boguslawski
render the device inoperable. With its reliance on finger-activation, the large iPhone screen was also viewed as a ‘finger print trap’, and novice users sometimes questioned whether the screen would depreciate in sensitivity, especially for scrolling activations, due to a build-up of dirt and grime on the screen. Also considered an issue for the iPhone’s capacitive touchscreen was the requirement for user’s fingers to be bare - activation relies on electro-connectivity in the user’s fingers. Where discussed, this was seen as a potential burden during winter months, as gloved users would find their fingers rendered useless, unless fingerless or specialist gloves were worn. 3.4 Form Factor Referring to issues such as size, weight and shape, the ergonomic aspects of both touch and non-touch handsets have a notable impact on the user experience. For practical reasons, the ideal mobile phone should not impose constraints on the user’s clothing or accessories. The desired size and shape needs to fit comfortably, not only in the hand, but also in pockets and/or the user’s choice of bag. Accordingly, during our studies some participants wanted to place the iPhone in a pocket, to try it out for size. Typically, reactions highlighted that the handset’s physical design achieved a good balance between being a suitable size and weight to be accommodated in bags or clothing with relative comfort, while still offering a screen size optimised for touchscreen interaction, especially when it comes to web browsing. In terms of handling the device there were a modicum of concerns that the iPhone’s overall form factor may be uncomfortable and potentially awkward to use for voice calls, particularly lengthy ones, especially if protected by an attached casing. Also, mixed with positive reviews of the iPhone’s ‘sleek’ hardware design, was a degree of mourning that the days of wedging your handset between shoulder and ear, in order to free up your hands, would be at an end with this handset. However, even where considered a little heavier than traditional handsets, the iPhone was largely considered ‘weighty’ in a positive way, with this being perceived as a mark of quality. 3.5 Navigation – The Importance of Simplicity and Consistency Like their non-touchscreen siblings, it is imperative that touchscreen interfaces aim towards simplicity and consistency throughout the interface, in order to minimise potential frustration and allow user expectations to be appropriately managed. If users have problems with finding, selecting and using the most basic functionality, then they will feel negative about the product. With a mobile phone, it is vitally important to support key functions such as answering or ending a call, creating and accessing text functionality (and email, if available), listening to music (and volume alteration), and accessing the internet, etc. In terms of accessing functionality, the steps involved should be minimised, by keeping access points at a high level. To support users’ navigation, there also needs to be a clear and direct path to the Main Menu or ‘Home’ area. In this respect, the iPhone was typically praised. All applications are accessible from the home screen, creating a shallow menu structure that is practically impossible to get lost in, and the single hard key provides a constantly visible route home. On the negative side, secondary functions of the Home key, such as the ability to double-press this to access Favourite Contacts and its role to exit the menu customization mode, were generally only discovered by accident or word of mouth.
“I Love My iPhone … But There Are Certain Things That ‘Niggle’ Me”
425
The iPhone interface follows the Apple philosophy of achieving ease of use through simplicity, limiting the number of options and functions available to make menus as simple as possible. One notable example of a key function where performance was marred was instigating a call. Here, fuelled with anticipation of a dedicated call ‘button’, new users often overlooked the need to press the actual phone number once on the contact details page. Perhaps surprisingly, there were some key functions that even presented difficulties for existing iPhone users – e.g., setting an alarm, discovering the ‘pinch’ gesture to zoom, etc. In the pursuit for simplicity, it is noteworthy that several functions, cited as important by mobile phone users, are omitted from the iPhone. Here, our respondents commonly complained about the lack of an MMS facility, the inability to forward received messages, no communication concerning the number of characters remaining in an SMS (resulting in recipient frustration over multiple texts), the inability to cut and paste, the lack of flexibility in displaying SMS messages (we’ve observed a lovehate relationship with the chat-style view). Functionality increasingly provided on mobile phones, such as radio, a camera facility complete with flash and zoom abilities, as well as an (official) way to record video clips (using the built-in camera), were also missed. Accompanying this last point, several comments highlighted reluctance to ‘tinker’ with the handset in order to explore ‘unofficial’ solutions for core functionality, given the perceived high cost of the device. Consistency is, largely, a key attribute of the iPhone interface. For example, once users learn to tap in a text entry field to access the virtual keyboard, this works in the same way across applications. Overall, users reported that elements for onward navigation could be distinguished with relative ease, despite some inconsistency in the interface being noted. Also, the consistent positioning of back buttons throughout the interface was welcomed. However, consistency does not guarantee good usability. Noteworthy here is that the meaning of the ‘+’ button, which is widely used throughout the iPhone interface to, for example, add configuration set-ups or new notes pages, was not immediately visible or understood by all. For example, users often stumbled when setting an alarm. In this case, the ‘edit’ and/or ‘+’ button was often overlooked, with users expecting to select the field of an existing alarm in order to edit the time. In addition, in some places users need to save their changes explicitly, whereas in other places alterations are saved automatically – e.g. while ringtone settings are automatically saved, ‘setting the alarm’ requires users to select ‘Save’ to commit their settings. Where encountered, there was occasional uncertainty and confusion about whether or not the performed action had been accepted and confirmed. Even existing iPhone users were sometimes surprised that they needed to save their changes within certain areas of the interface and not others. There seems to be a movement towards automatic saving in mobile interfaces, however, to reinforce this model this needs to be applied consistently across the interface. 3.6 Visual Design As with non-touchscreen devices, it is important for users to readily understand, at a glance, any iconography presented, especially if it’s not supplemented with a label descriptor. Where icons are relatively abstract or their visibility is reduced (through
426
A. Haywood and G. Boguslawski
either their visual design and/or a cluttered display), users will become frustrated if they continually struggle to locate target features. Considering the iconography on the iPhone’s Home screen, the colorful array of default items, as well as downloaded applications, tended to attract positive reviews amongst our respondents, with icons largely regarded as depictive rather than abstract. Here, the size and relative spacing of the application icons, and the provision of supplementary labels (under each item) were considered to support both selection as well as an understanding of the functionality presented. In terms of its graphical look and feel, despite adopting a rather limited colour palette, the iPhone tended to attract praise, especially amongst Apple-consumers, with a more ‘jazzy, colourful look’ only being requested by a minority. 3.7 The Virtual Keyboard If devices exclusively rely on an on-screen keyboard, the aim should be to mirror levels of speed and accuracy offered by traditional handsets as far as possible. Also, without a permanently presented physical keypad, clear access to the virtual keyboard is vitally important. Users must not be left wondering how to enter text using the touchscreen. Additionally, it is important to ensure that users can readily change between different text input modes, to support the creation of messages that involve punctuation, numbers and the input of special characters. Representing a common task for mobile users, for some even more important than making or receiving calls; writing and sending text messages and, increasingly, emails, is one aspect where touchscreen mobile devices often come under fire, typically amid concerns that virtual keys are not adequately sized for accurate fingerselection. In this regard, the iPhone is no exception. As our findings suggest, those who use their mobile phone extensively, especially for text entry (e.g. heavy texters or Business users), may have a less smooth transition to touchscreen devices, than more ad hoc or light text users. For this latter group, especially in instances where multi-tap text entry was considered a chore, there were indications that performance may even be enhanced, at least in the users’ perception. During our research, the iPhone’s on-screen QWERTY keyboard was largely appreciated, as the layout (if not the experience) was familiar from using a computer keyboard. However, although the iPhone’s keyboard fills approximately half the screen, the size of the keys tended to attract mixed reactions, and there were concerns over selection accuracy. Especially for novice users, the keys were often considered too small and there were worries that fingers would span more than one key, increasing input errors. Aiming to negate such concerns, the ‘magnification bubble’ of the key selected, was popular, both aesthetically and in terms of supplying feedback, as when using the keyboard users’ fingers occluded their selection. Although users often reported improvements in keyboard comfort as their familiarity with the on-screen keyboard increased, there was not widespread confidence that performance could ever match that exhibited on a physical keypad. Reports of being able to type more efficiently and with more accuracy on conventional non-touch devices were abound. In particular, those who used their mobile device heavily for email or text messaging perceived a deficit in their performance when using the iPhone’s on-screen keyboard. Indeed, both users of conventional numeric 12-key keyboards (multi-tap and predictive users) and users of hard-key QWERTY
“I Love My iPhone … But There Are Certain Things That ‘Niggle’ Me”
427
keyboards, reported frustration at a higher perceived level of entry errors, as well as a reduction in perceived speed, when using the iPhone as compared to prior experience with physical keypads. Also, for respondents who indicated proficiency at multitasking when texting on a traditional keypad (e.g. texting whilst watching TV or even while driving!), the need to always attend more to the virtual keypad was anticipated and considered a bind. Also, existing touchscreen users (iPhone and competitor) were seen to lament the ability to enter text one-handed without looking at the screen. Similarly, singlehandedly balancing the device and taking a photo using the on-screen button was also found to be tricky, if the desired composition was to be maintained. As frequently found with touch interfaces, the iPhone’s keyboard also came under fire for not being ‘thumb-friendly’, and those with long fingernails often experienced difficulties, especially when inputting text. In this latter case, users’ attempts to initiate their selections using their fingernail much like a stylus were thwarted. At least initially, many users attempted to carry over text entry techniques from using physical keypads. For example, numeric keypad users often attempted a onethumb approach to typing, while some hard-QWERTY users tried a two-thumb technique. However, after frequently mis-keying due to miscalculating which bit of their digit(s) hit the keyboard first and which character was being selected as a result, it was often reasoned that, until familiarity had grown, single index finger interaction was probably the most efficient method, given the width and spacing of the keys. Due to the above concerns, the ability to change the orientation of the device from portrait to landscape, where available, was deemed very welcome. Some anticipated the ability to access the landscape keyboard universally across the interface and expressed surprise when they realised that horizontal text entry was only available in the Safari browser. With the perception that an increase in horizontal width would improve both single-digit and thumb performance, this facility was often requested across applications. Interestingly, even where respondents were iPhone users themselves, none used or reported awareness that letters were only registered once the selected ‘key’ had been released. Additionally, where this strategy was prompted, comments reported that it didn’t feel natural to remove one’s fingers from the screen in order to make a selection, and it was doubted whether adopting this strategy would improve performance. The technique was still seen to strongly rely on feedback via the ‘magnification bubble’, and it didn’t remove the need to divide one’s attention between the keying area and the characters accruing in the text field. When considering the iPhone’s text correction facilities, despite potential benefits being acknowledged, initially the auto-correct (and complete) facility was considered to represent part of the problem. Optimal use of this can take a little bit of getting used to, as it requires users to divide their attention between what they are doing with their fingers (i.e. typing) and what is being registered on the screen, as well as what is being suggested in the ‘pop-up fields’. Indeed, reports of sending text messages with ‘odd sentences’ in them, because a suggested word had surreptitiously entered the message, pepper our research. The iPhone’s ‘magnifying glass’ editing feature, which supports corrections earlier in a word or line, was not considered intuitive, even amongst some current iPhone users. Accordingly, when faced with the task of correcting input errors, unless users were already aware of this feature, many expressed frustration that, without an
428
A. Haywood and G. Boguslawski
obvious alternative, they needed to delete strings of correct characters in order to access the place where editing was needed. Once this feature was acknowledged, however, it tended to be widely praised for both its actual function and the aesthetic appeal of the magnifier. 3.8 Missing the ‘Tactility’ of the Keyboard Largely, for both novice users and iPhone users alike, some users lamented the natural haptic response of a physical keypad. In particular, comments highlighted concerns that locating specific ‘buttons’ on the hard, uniform touchscreen, required users to stare at the on-screen keys while they typed, which took attention away from the text in the message field. This loss of tactility was cited as a factor contributing to deficits in both speed and accuracy, especially amongst novice users. Comments revealed that the undulating feel of a physical keypad and the ‘click’ offered upon selection, supported selection without needing to focus undue amounts of attention on the keypad. Especially for numeric keypad users who regularly use predictive text, the need to make selections from a virtual QWERTY keyboard, without any haptic support to signal the relative positioning of the keys, was considered daunting by some. Whilst this discussion has focused on the keyboard, problems associated with a lack of tactile feedback extend to other direct selections. In the absence of a tactile response, the careful design and placement of visual feedback become more important. The problem with visual feedback on small screens is that fingers occlude parts of the screen (including the elements under selection). With this in mind, users welcomed that, in Safari, feedback appears at the top of screen when a page is loading. Similarly, the magnification bubble that provides feedback as users explore the keyboard, was often thought to be both ‘funky’ and informative. With touchscreen interfaces, there is obviously scope for attempting to replicate the haptic feedback offered by a physical keypad by introducing tactile sensations (potentially synchronized with sound). However, despite the argument that providing vibro-tactile effects, corresponding to the user's exploration and selection, will provide a more satisfying user experience, at this stage a lot seems to depend on the sophistication of the haptic technology. Indeed, our research suggests that offering a range of discrete sensations, such as the realistic feel of buttons depressing and releasing, in addition to the feel of a screen populated with icons and the sensation of an undulating keypad, may hold more appeal than a coarse ‘buzz sensation’ whenever a selection is made. Furthermore, regardless of the treatment offered, it is important that users are given the freedom to turn this facility on or off, as preferred.
4 Best Practice Guidelines Based on our research in this arena, this paper concludes with a range of best practice guidelines for finger-operated touchscreen interfaces. Being focused, rather than exhaustive in scope, the guidelines indicate factors that are important to consider when evaluating or designing touchscreen solutions for small screen devices.
“I Love My iPhone … But There Are Certain Things That ‘Niggle’ Me”
429
Table 1. Best practice guidelines for finger-touch interfaces Screen size matters • When it comes to touchscreens, screen clarity and size matters - large good quality screens are essential to provide space for key elements. • As larger screens can foster concerns over vulnerability, the hardware design needs to support notions of robustness and quash any concerns over screen fragility. Touchscreen responsiveness • Aim towards high system responsiveness, as delays will frustrate and confuse users. Minimising response lag will dissuade users from pounding the keys to repeatedly select target elements, and/or using their fingernail or pen like a stylus. • To minimise keying errors, ensure that sensitivity and screen alignment (calibration) are optimised. Maximise sensitivity levels, uniformly, across all areas of the screen. Particularly where a scroll bar draws the users’ focus, sensitivity at the perimeter needs to be optimised. Towards a tactile experience • As the tactile experience offered by conventional keypads, may offer a positive effect on efficiency, error rates, and user satisfaction, consider options to support a more tactile user experience – e.g. tactile output for the identification of controls and/or vibro-tactile sensations to in response to selections. • Aim towards an array of discrete sensations, rather than just a coarse ‘buzz sensation’ upon selections. • If provided, tactile feedback should be an optional rather than a default feature, with a means to easily switch between the two modes being understood and clearly visible. Navigation & efficiency of use • If users have problems with the most basic functionality, then they will feel negative about the product. Support key functions such as answering or ending a call, instant messaging, listening to music, viewing messages, accessing the internet, etc. • Minimise steps to access or perform core functions, by keeping access points at a high level. • Allow clear and direct navigation to return Home and to the Main Menu. This is especially important where the device doesn’t offer a physical button dedicated to this. • To reassure users and allow ease of navigation, ensure consistency throughout the interface. • As users’ fingers may occlude parts of the screen (including selected items), carefully consider the design and placement of visual feedback. Ideally, feedback should appear above the item selected. • Consider ways to ensure navigation and selection are easily discernable, so that users don’t accidentally make selections when they scroll. • Allow actions to be readily reversible, so that if an error is made, it can be easily rectified. • As appropriate, consider providing on-screen buttons that can be readily selected and hidden when not required, and ensure that the existence (and access to) these buttons is understood. • Although a help facility mustn’t be seen as the solution to a poor user interface, if feasible, consider options to provide an on-device Help system that is both easy to find and use.
430
A. Haywood and G. Boguslawski Table 1. (continued)
The virtual keypad • Aim towards mirroring the levels of speed and accuracy offered by traditional handsets as far as possible. • Without a permanently presented keypad, clear access to the virtual keyboard is vitally important. Consider using a consistent convention across the interface (e.g. tapping the field). • Consider presenting a virtual QWERTY keyboard instead a multi-tap configuration (where characters are shared on individual keys). Without the tactile cues familiar on a conventional keypad, a QWERTY layout may be easier to use than a multi-tap design – with the latter, lag and precision issues may come to the fore. • Consider the option to offer the keyboard in a horizontal view, and allow this to be consistently available across the interface. • Ensure users can change between different text input modes with ease. Options to enable and disable predictive text and switch between letter, number or symbol inputs, must be clearly presented and quick to use, with any shortcuts being clearly understood. • As users need to feel confident that selections will have the desired effect, ensure the selectable area (icon/button) is larger than the target or of an acceptable size. Remember people will want to reach for a stylus if things go wrong or if they don’t feel confident that their selection will be accurate. • As well as being sized to accommodate finger-input, users need to perceive keys and other screen elements to be adequately sized for accurate selection. Explore ways to minimise concerns about finger size relative to key size. Maximise the perceived size of elements, through visual design, ensuring that a good delineation of keypad elements is presented. • Also, to minimise mis-selection ensure that sufficient space between entries in a vertical list. Icons & labeling • Carefully consider iconography. Make use of familiar icons (and colour conventions) so users can associate with them. • Consider colour icons that have detail to them to make the most of graphical capabilities. • Aim towards a high contrast between discrete touch elements, text, and background colours. Also, to enhance visibility, controls and text should not be placed over an image or patterned background. • Where icons are relatively abstract, users will become frustrated if they continually struggle to locate and use target features (e.g. without a physical key, ensure that the means to end a call is highly visible). While preserving a non-cluttered display, consider supplementing graphical symbols (such as icons) with labeling or other textual cues. • To aid legibility on small screens, especially across lighting conditions, consider adopting a sans serif font for all text and labels. • Labels and instructions should be short and simple, with abbreviations avoided, if possible. • Allow icons to be suitably sized and spaced, so they can to be readily selected without worrying about accidental selection of nearby icons/screen elements.
Acceptance of Future Technologies Using Personal Data: A Focus Group with Young Internet Users Fabian Hermann, Doris Janssen, Daniel Schipke, and Andreas Schuller Fraunhofer Institute of Industrial Engineering Nobelstr. 12, D-707569 Stuttgart, Germany {fabian.hermann,doris.janssen,daniel.schipke, andreas.schuller}@iao.fraunhofer.de
Abstract. Future technologies in smart and social environments are expected to use personal data extensively. As young users of today’s social web platforms already take risks of privacy loss, the question of acceptance of technology using personal data and influencing factors appears of to be of strong relevance. We present results from a focus group with ten young internet users which indicate different attitudes on privacy and different aspects of social influence on use decisions. Implications for technology acceptance theories are discussed. Keywords: Technology acceptance, smart environments, social web, privacy.
1 Introduction Ubiquitous computing systems are described as complex systems that use situational and personal data, derive conclusions from them, and adapt the system UI and behavior partly autonomously (see e.g. IST Advisory Group, 2001, 2003). These functionalities rely on highly integrated data on the physical environment, situation, but also the user’s location, preferences, interaction behavior, etc. In this respect, future ambient and mobile social systems bear similar and even higher risks as currently discussed social web platforms: Users risk a loss of privacy because of permanent storage of personal data, profiling and address trading by hosts etc. (Hildebrandt, 2008). Nevertheless, social web media are broadly accepted in the markets (Universal McCann, 2008) and a frankness unexpected until now spreads in particular among younger users (The National Campaign, 2008). On this background, factors influencing technology acceptance appear to be of high relevance.
• usefulness: the perceived or expected practical advantages of the system • effort: the expected effort to use the system Other variables were not included in this original model, as it was assumed that other important factors like individual abilities, tasks, system type, situational constraints etc. mediated the perceived usefulness and effort. In order to model these external variables explicitly, a new version of TAM, the Unified Theory of Acceptance and Use of Technology UTAUT was proposed (Venkatesh, Morris, Davis, & Davis, 2003). It assumed that system use is influenced by “facilitating conditions” like system accessibility, training support etc. Due to the UTAUT, the intention to use a system is depending not only on the expected system “performance” and effort expectancy but also by social influence. Social influence measures the perception of social pressure to use a system. These models were mainly indented to predict system acceptance in organizational contexts and professional use. They were also applied and partly adapted to describe consumer decisions for private technology purchase and use (e.g. Carlsson et al., 2006; Kwon, 2000; van Biljon et al., 2007). While these studies worked on more classical technologies, the acceptance of emerging technologies using personal data was investigated by acceptance models of ubiquitous computing services. Beier, Rothensee, and Spiekermann (2006) used the following predictors for acceptance of such technologies (together with the already introduced “usefulness”): • Risks, e.g. loss of time or financial risks that may result from system use a user perceives • Control: perceived controllability of system behavior by the user Both variables were expected to influence the usage intention via the emotional attitude towards a system as a mediating variable. Spiekermann (2008) added another variable: • Privacy: necessity to provide private data and the user’s concerns of them being given away. This variable was expected to have a negative impact on usage intention via the mediating variable “affective attitude”, i.e. the general emotional attitude towards the system. Taken together, the acceptance research found stable effects for usefulness as well as practical issues like effort or expected risks. Newer results on future systems stress the influence of perceived control on usage intention. Interestingly, concerns about private data were hypothesized, but could not be shown to have significant impact on usage intention (Spiekermann, 2008).
3 A Focus Group with Adolescents To get a picture about young internet user’s privacy-related behavior, their use of internet platforms and applications, and acceptance we conducted a focus group with young internet users. Further discussion topics as the participants’ ideas and wishes for future technology trends yielded no relevant results and therefore are ignored here.
Acceptance of Future Technologies Using Personal Data
433
3.1 Procedure Sample. Two sessions were carried out with altogether 10 participants. In one session, 6 participants from 14 to 17 were invited. In a further session, 4 adolescents from 17 to 19 participated. Participants have been acquired by a chain email advertisement initially send to employees of an university institute. The sample can be characterized as follows: 6 male, 4 female, age from 14-19, internet use on average since the age of 10, mobile phone use on average since the age of 8, average online time 3.7 hours per day. Participants used instant messenger like ICQ, MSN or Skype very frequently as their main online communication medium. They stated to send 25 mobile short messages (sms) and 16 emails per week on average. Stationary computers are mainly used for instant messaging, gaming, and music. Also school tasks play an important role on the PC. Mobile phone games do not play an important role for any participant. This pattern of frequent use of online communication was quite homogeneous amongst participants. No participant used online media rarely. Open Discussion. After an initial questionnaire on internet behavior and general communication patterns (like mobile phone use and instant messengers) a creativity method (6-3-5 method) was used to initiate the discussion. The following discussion on generated ideas was moderated by one session leader with the goal to foster a vivid, open exchange of thoughts. The topics included current use of technology, acceptance of new technologies and privacy behavior, and ideas and expectations on future life and its support through computers, internet, artificial intelligence etc. In some cases, the moderator directly posed open questions on the topics of interest to direct the discussion and to encourage statements on issues like privacy or social pressure. Many issues were addressed repeatedly, while others were discussed only once depending on the argument line of the open discussion. The analysis of the discussion was done by transcripting parts of a session video. A rater clustered discussion statements related to the issues of acceptance, privacy, and social influence. Statements are qualitatively interpreted in the following chapter. Questionnaire with Open Items. Last step of the session was a questionnaire with several questions on particular issues on communication technology use. The following open items directly addressed the participants’ attitude towards data privacy and related behavior: • How do you safeguard your personal data using the internet (in general, when using blogs, communities, chats)? • Are you using pseudonyms? • Are you feeling watched when surfing the net? • Do you think it’s good if companies use your data (e.g. which sites you’re visiting) to give you personalized offers and advertisements? • How important is the fact to you that no one knows which sites you visited? Why? • If you are chatting or surfing the internet is it important to you that no one can watch your monitor and see what you’re doing? On the basis of the individual answers to these questions a rater classified the participants into types of attitudes towards privacy. It was possible to synthesize categories that describe the main direction of the answers of each participant. For most participants, the answers to the different questions appeared to be quite
434
F. Hermann et al.
homogeneous. In many cases, participants referred to their own answers on previous questions. However, some answers of two users were inconsistent. The rater then decided to assign these users to the category based on the most prevalent answers. 3.2 Results: Different Attitudes towards Protection of Private Data The following categories were derived to characterize the users’ privacy-related behavior and attitude: • Naive users aren’t aware of any problems regarding data protection. These users don’t think that anybody would be interested in his special actions or personal data, so nobody would try to find out about them. A characteristic statement of one participant (14, male) was: “I don’t feel observed and nobody can see what I have made, because nobody knows my password”. • Frank users don’t mind about privacy and are willing to let anyone know things about themselves. Typical statements here expressed that one has nothing to hide. For example, one participant (male, 18 years old) answered “Many people can see what I’m doing [in the internet]. But I don’t care about it. I don’t have anything to hide.“ • Sensitized users are aware of risks and potential problems from publishing private data. They are willing to live at the best with it. They adopt strategies to protect private data or identity, for example, by using different personas, trying to act anonymous, or avoiding using tools they don’t trust. A typical statement here was given by a participant (15, male): “I use usually different nick names and change my identity.” Figure 1 shows the distribution of the different user characteristics amongst the ten participants of our focus group. 3.3 Results: Social Influence on Technology Acceptance During the open discussions, participants stressed the fact that not participating in communication technologies, in particular social internet platforms would result in alienation from the peer group. One of the participants said that one would feel as a loner if one would the only one not using a certain technology. Another participant said that if you don’t share information in a social network, then no one “would like to chat with you via IM”. Another set of arguments addressed social facilitation: Participants said that if everyone would get used to a technology or interaction mode, public behavior would become acceptable even if had appeared awkward before (like using speech commands or gesture interaction on a mobile phone in public). Participants also mentioned they would use communications media like instant messengers or social community platforms when interacting with younger people, whereas using e-mail was to send job applications or to communicate to older people. A variant of this argument appeared when the issue of embarrassing content like party photos posted in social communities was raised. One of the older participants said that a prospective employer searching for applicant’s web information would wonder if there were only well-behaved pictures in your profile. Participants
Acceptance of Future Technologies Using Personal Data
435
discussed that people might reason why someone has no party pictures, that they are hold back on purpose, or that he has no social contacts whatsoever.
Fig. 1. Frequency of user’s attitudes (absolute numbers)
4 Conclusions The results of the focus group show two aspects that may have impact on acceptance research: 4.1 User Attitudes on Privacy We found indication that users have different attitudes towards privacy. According to the statements users made during the session, they have adopted different behavior styles from naive and unaware use of profiles to the strategies sensitized users follow to protect private and identity information. The age distribution in our small sample support the assumption that privacy issues become more important the older users get and the more they acquire knowledge and media competences. This suggests that privacy and other factors on acceptance may not only depend on features of the system and the user’s evaluation of them, but also on person characteristics like e.g. knowledge about information use and risks, experiences with concrete consequences of publishing private data, etc. However, it seems not to be plausible that interindividual differences in privacy concerns are stable. In particular for the participants in our focus group, we assume some of the different attitudes to represent different stages of knowledge about risks and possible consequences of risky behavior. There might be more stable differences found when looking at a broader range of age. The general idea of different attitudes on technology and resulting user types of acceptance can also be found in the field of innovation adoption (Rogers, 2005), were different typical behavior styles of adopting new technologies are characterized. This may have implications on the modeling of privacy perception in the acceptance models: Even if no significant influence of privacy concerns on acceptance as subjective measure could be found (Spiekermann, 2008), there might be
436
F. Hermann et al.
an influence on use behavior and strategies. Also subgroups of differently sensitized users who do care about technologies risks might be found. 4.2 Peers Influencing Usage Decision Several statements of our focus group highlight the social influence on decisions to use or avoid technologies. Statements imply direct peer pressure from the adolescent’s friends and peers as well as informal comparisons with the cohort of comparable age and social group that seems to have impact on personal decisions to use a technology. Communication media used in the age-group of adolescents seems to be much more attractive than “old-fashioned” means of communication like e-mail. Generalized expectations by others and norms of technology are taken into consideration when deciding for usage. They also seem to influence the acceptance of possible⎯ known or unknown⎯ risks like abandoning parts of privacy. Although not occurred in our focus group, we expect a further interesting aspect of social influence to have impact on voluntary usage decisions: Peers may serve as trusted behavioral models that facilitate purchase decisions by reducing complexity of research on advantages and risks. An integrated model of user acceptance should cover the known influences like usefulness, risks and privacy, but also different types of social influence. The already investigated construct in UTAUT (Venkatesh et al., 2003) of social influence was found to be moderated by other variables, in particular usage experience (Li, Kishore, 2006). The concepts of self-identity and related internalized norms (a well-established variable from the theory of reasoned action, Fishbein & Ajzen, 1975) was shown to have impact on technology acceptance also for voluntary decisions but it’s relation to other constructs are open (Lee, Lee, & Lee, 2006). Considering the privacy risks of current social media and future technologies the clarification of social influence on technology use and risky behavior still is of high importance as well as the investigation of related believes and attitudes among users and social mechanisms.
References 1. Beier, G., Rothensee, M., Spiekermann, S.: Die Akzeptanz zukünftiger Ubiquitous Computing Anwendungen. In: Heinecke, A.M., Paul, H. (eds.) Mensch & Computer 2006, pp. 145–154. Oldenbourg Verlag, München (2006) 2. Carlsson, C., Carlsson, J., Hyvonen, K., Puhakainen, J., Walden, P.: Adoption of Mobile Devices/Services — Searching for Answers with the UTAUT. In: Proceedings of the 39th Annual Hawaii international Conference on System Sciences, vol. 6 (2006) 3. Davis, F.: Perceived Usefulness, Perceived Ease of Use, and User Acceptance of Information Technology. MIS Quarterly 13(3), 319–334 (1989) 4. Davis, F., Bagozzi, R., Warshaw, P.: User Acceptance of Computer Technology: A Comparison of Two Theoretical Models. Management Science 35(8), 982–1003 (1989) 5. IST Advisory Group: Scenarios for ambient intelligence in 2010. Final Report. European Commission (2001), ftp://ftp.cordis.lu/pub/ist/docs/istagscenarios2010.pdf (21.2.2009)
Acceptance of Future Technologies Using Personal Data
437
6. IST Advisory Group. Ambient Intelligence: from vision to reality. Draft Report (2003), ftp://ftp.cordis.lu/pub/ist/docs/istag-ist2003_draft_ consolidated_report.pdf (21.2.2009) 7. Fishbein, M., Ajzen, I.: Belief, attitude, intention, and behavior: An introduction to theory and research. Addison-Wesley, Reading, MA (1975) 8. Hildebrandt, M.: Profiling and the Rule of Law. Identity in the Information Society Journal (2008) 9. Kwon, H.S.: A Test of the Technology Acceptance Model: The Case of Cellular Telephone Adoption. In: Proceedings of the 33rd Hawaii international Conference on System Sciences, vol. 1 (2000) 10. Lee, Y., Lee, J., Lee, Z.: Social influence on technology acceptance behavior: Self-identity theory perspective. SIGMIS Database 37(2-3), 60–75 (2006) 11. Rogers, E.M.: Diffusion of Innovations, 5th edn. Free Press, New York (2003) 12. Spiekermann, S.: User Control in Ubiquitous Computing: Design Alternatives and User Acceptance. Shaker Verlag, Aachen (2008) 13. The National Campaign, Sex and Tech-Results from a Survey of Teens and Young Adults (2008), http://www.thenationalcampaign.org/ (21.2.2009) 14. Universal Mc Cann, Power to the People. Social Media Tracker Wave 3 (2008), http:// www.universalmccann.com/Assets/wave_3_20080403093750.pdf (21.2.2009) 15. van Biljon, J., Kotzé, P.: Modelling the factors that influence mobile phone adoption. In: Proceedings of the 2007 Annual Research Conference of the South African institute of Computer Scientists and information Technologists on IT Research in Developing Countries (2007) 16. Venkatesh, V., Davis, F.: A Theoretical Extension of the Technology Acceptance Model: Four Longitudinal Field Studies. Management Science 46(2), 186–204 (2000) 17. Venkatesh, V., Morris, M.G., Davis, G., Davis, F.: User Acceptance of Information Technology: Toward a Unified View. MIS Quarterly 27(3), 425–478 (2003)
Analysis of Breakdowns in Menu-Based Interaction Based on Information Scent Model Yukio Horiguchi, Hiroaki Nakanishi, Tetsuo Sawaragi, and Yuji Kuroda Graduate School of Engineering, Kyoto University, Yoshida Honmachi, Sakyo-ku, Kyoto 606-8501, Japan {horiguchi,nakanishi,sawaragi}@me.kyoto-u.ac.jp
Abstract. High communicability of the menu-based system is on the basis of consistent vision and clear policy in designing the system of menus, and then they should be perceivable to the users. In this light, failures in menu-based interactions can be explained that they might emerge from lack of information in the users’ available cues to identify the design vision. This study focuses on communicative breakdowns in menu-based human-computer interactions from this perspective, and investigates their causes in ill-organized structures of menu hierarchy in terms of the user’s interpretation of the menu items. Pirolli’s information scent model is extended and utilized as an analytical tool for describing the meaning system of menus from the users’ point of view, and their decision making in search of particular menu items is analyzed by use of information scent. Keywords: Menu-based interaction, information scent model, communicative breakdowns, human-computer interaction.
Usability of a hierarchical menu system is characterized by a structure in which the menu items are organized as well as by a familiar terminology with which they are written, and both of these characters should be designed as sensible, comprehensible and convenient forms relevant to the user’s task [3]. On the other hand, high communicability of the menu-based system is, by definition, on the basis of consistent vision and clear policy in designing such systems of menus, and then they should be perceivable to the users. In this light, failures in menu-based interactions can be explained that they might emerge from lack (or inconsistency) of information in the users’ available cues to identify the designer’s vision. In this study, we focus on breakdowns in menu-based interactions from this perspective, and investigate their causes in ill-organized structures of menu hierarchy in terms of the user’s interpretation of the menu items. Pirolli’s information scent model [4-6] is extended and utilized as an analytical tool for describing the meaning system of menus from the users’ point of view, and their decision making in search of particular menu items is analyzed by use of information scent. The scent measure, which can estimate the strength of each option to attract the user’s attention relevant to a particular goal, is applied to specifications of possible discrepancies between the designer’s intended usage and the user’s actual decision.
2 Information Scent of Menu Relevant to User’s Goal Two different activation patterns derived from one common spreading activation network are compared for measuring the scent value of a menu item. One pattern of them represents the activities of concepts (to be precise, indexing words) induced by the user’s goal whereas the other simulates the activities induced by the menu texts the user has encountered on the UI. The network of words was built from a text corpus, i.e., a large collection of documents, whose subject is to provide descriptions about the usage of the product’s functions. In this network, every directed arc has weight derived from the conditional probability at which its source word would appear in a document containing its destination word, and each node has base level activation derived from the probability at which the corresponding word would appear in a document. So as to calculate these probabilities, we utilize instruction manuals as the corpus, which are decomposed into documents in accordance with its functional units, because it has sufficient statements about all the functions of the product in terms both of quality and quantity. The detailed descriptions on this calculation are given in [7]. As shown in Fig. 1, the scent value of a menu item is calculated according to the following procedure: 1. Each word’s activation level induced by the user’s goal, i.e., L = ( L1 , L2 ,...) , is derived after all words’ activities in the user’s task description Q have spread in the network. 2. Each word’s activation level induced by the menu texts, i.e., R = ( R1 , R2 ,...) , is derived after all words’ activities in the target menu description C have spread in the network. 3. The scent value of the menu item C in relation to the task description Q is given by the inverse Euclidean distance between the two activity patterns, i.e., L and R .
440
Y. Horiguchi et al.
As is clear from its definition, the more similar the two activity patterns in response to the different activation sources, the larger the menu’s information scent.
Q Task Description (Q)
Menu Description (C) C
Activation Pattern triggered by task description terms
L
Activation pattern triggered by menu description terms
Network of Indexing Terms
trm1 trm2 … trmN
R trm1 trm2 … trmN
Information Scent
⎛ ⎜ g (C , Q) = ln⎜ ⎜ ⎝
⎞ ⎟ 2 ⎟ ∑i ( Li − Ri ) ⎟⎠ 1
Fig. 1. Diagrammatic illustration of how the information scent of a menu item is calculated. Two different activation patterns derived from one common spreading activation network are compared for measuring the scent value of each menu item.
3 Breakdowns in Menu-Based Interaction 3.1 Experiment A DVD recorder, which is one of typical multifunctional electric appliances that have hierarchical menus, was employed as the target application system. Twelve female users from three different age groups (30’s, 40’s and 50’s: each age group contains four people) participated in the experiment, and all of them have no experiences with DVD recorders whereas they have some with VCRs. Four different tasks listed below were prepared for this experiment, and they are related to programming or configuring the recorder: • Task 1: “Program the recorder for timer recording of a television show on which a particular on-screen talent will appear.” • Task 2: “This recorder has a capability to display closed captions on the television screen for terrestrial and BS digital broadcasts. Configure the recorder to display the captions.” • Task 3: “Configure the recorder for recording the second audio programs provided by multichannel broadcasting services.”
Analysis of Breakdowns in Menu-Based Interaction
441
• Task 4: “This recorder has a capability to adjust timer recordings automatically to any airtime changes of the scheduled programs when some extension or delay of the prior programs has occurred. Configure the recorder for enabling this function.” The participants performed these tasks in the order from Task1 to Task4. Among them, the later task would be more difficult for the users because its goal is a peripheral function that is rare to be used and thus that is located at ‘out-of-the-way’ corners of the menu hierarchy. Each task was specified on a sheet of paper1 which was presented to the users immediately before a measurement session. After the experimenter had confirmed the user’s sufficient understanding of the task without the sheet, he gave her a cue to start operation. During each session, the users were not allowed to refer to the sheets. A base time limit was set to four minutes that was used to judge the exit state of the participants’ performances.
10
Task1
Task2
6
Task3
6
Task4
1 0%
0
3
20%
60%
0
5
6 40%
0
3
1
4
2
0
Success Give-up Time-out Mistake
1 80%
100%
Fig. 2. Summary of the participants’ performances. The results of individual sessions are classified into four different classes: “success”, “give-up”, “time-out” and “mistake”.
3.2 Results Fig. 2 presents the summary of the participants’ performances where the results of their individual sessions are classified into four different classes: “success” represents the state that the user successfully completed her task in time; “give-up” represents to the state that the user gave up her task; “time-out” represents to the state that the user was interrupted by the experimenter to abandon her operations because it seemed to be no chance for her to complete the task; and “mistake” represents the state that the user could not find the correct menus although she declared she had finished the task for herself. The result indicates that Task4 is the most difficult while Task1 is the easiest of all. 3.3 Analysis of Failures during Menu-Based Interaction Low communicability of an interactive system can be evaluated by numerous patterns of slips, mistakes and failures spotted during interaction between the user and the system. 1
The descriptions of all the tasks were given in Japanese.
442
Y. Horiguchi et al.
The concept of communicative breakdown is prepared to capture instances of such problematic interactions [2]. A communicative breakdown will appear during interaction between the user and the computerized system when the effects on the state of affairs induced by his/her operations do not coincide with what was meant to be the case. From this perspective, failures during menu-based interactions are analyzed here. After all measurement sessions, the experimenter interviewed every participant about the reasons for her menu selections by watching playback videos together. With the use of their answers and comments as reference, failures of interactions between the users and the menu system were associated with the categories of communicative breakdowns. In accordance with de Souza’s method [2], problematic portions of user-artifact interaction were tagged with one or more virtual “utterances” of the users corresponding to the categories of communicative breakdowns such as • “What's this?” — the user is being unable to interpret what a certain interface element means, • “Where is it?” — the user is not finding where his/her expected element is, • “I can't do it this way.” — the user is abandoning a path of interaction composed of many steps, and so on. Fig. 3 illustrates an example of the analyzed discourse between a participant user and the DVD recorder where the user was performing Task1. In this figure, tags of communicative breakdowns are represented in the dialogue balloons. Elasped time 0:00:15 0:00:28
User's operation (selection of menu item)
Behavior of menu system (display of menu screen or icon) (SESSION START)
Communicative Participant's comment breakdown
[BANGUMIHYO]
"I thought it would have some menus for timer recording." [BANGUMIHYO] screen (cursor moving)
(cursor operations) 0:00:38
[BACK]
0:00:47
[SAISEI NAVI]
Where is it? I can't do it this way.
"But there were no such menus."
What's this?
"I tried it because I had never used this button."
TV screen [SAISEI NAVI] screen (cursor moving)
(cursor operations) 0:01:12
[BACK]
0:01:33
[KINOU-SENTAKU]
0:01:42
[BACK]
0:01:45
[SUBMENU]
0:01:53
[KINOU-SENTAKU]
0:02:07
[BANGUMIHYO-NO-KENSAKU]
Where is it? I can't do it this way.
TV screen What now?
"I had no choice but to select this button after all."
What's this?
"I tried it because I had never used this button, but" "I abandoned this path after the indication of invalid operation."
"Since I couldn't find any appropriate options among these "This was the only one option I thought plausible."
[JINMEI-KENSAKU] [JINMEI-KENSAKU] screen (GO STRAIGHT TO THE GOAL)
Fig. 3. An example of analyzed discourses where the user was performing Task1 with the use of the DVD recorder. The dialogue balloons represent the tags of communicative breakdowns.
As clarified in Fig. 2, Task1 and Taks4 have quite a difference in their success rate. Fig. 4 compares these two different tasks in terms of frequencies of the communicative breakdowns. This bar chart indicates that breakdowns tagged with “I can’t do it this way.” occurred in Taks4 more than twice as frequently as in Task1. This type of breakdown involves the user’s becoming aware of a need for some reform of her search strategy since a series of her operations seemed not compatible with what the designer intended. Before it comes up in the user-system interaction, repetitions of “Where is
Analysis of Breakdowns in Menu-Based Interaction
443
it?” were observed when the user did not find a certain expected element (i.e., “it”) among her selectable options. We can see a significant number of utterances of “Where is it?” in Task4 than those in Task1, and it reminds us that the design intent of the menu hierarchy should be distant from the users’ assumptions for interpreting the interface signs. This hypothesis is also supported by the high frequency of “What's this?” because it corresponds to the breakdown where user is looking for any other cue about what a particular interface sign means. In addition, both of these breakdowns should induce another type of utterance “What now?”. The latter indicates the situation where the user could not make sense of the interaction the designer intended and thus she was temporarily clueless about what to do next. 35 Task1
Frequency
30
Task4
25 20 15 10 5 0 I give up. Looks fine to me. Ia
Where is it?
Ib
What What now? Whare am happened? I? IIa
Oops!
I can't do it this way.
What's this?
Help!
IIb
Complete Failure
Why I can do doesn't it? otherwise.
IIc
IIIa
Temporary Failure
Thanks, but no, thanks. IIIb
Partial Failure
Fig. 4. Frequency distribution of communicative breakdowns. Task1 and Taks4 are compared in terms of frequencies of breakdowns because they have quite a difference in their success rate. 35 Frequency of selection
30 25 20 15 10 5 0 1
2
3
4
5
6
7
Rank order
Fig. 5. Frequency distribution of the participant users’ menu selections with respect to the rank order of information scent. The users selected menu items of higher rank order more frequently.
Both of the designer and the user have their distinctive assumptions for generating or interpreting the interface signs. The above result from the communication analysis suggests that there is a large difference between them, especially on peripheral functions
444
Y. Horiguchi et al.
like the goal of Task4. In order to visualize this difference, the information scent analysis is applied to the user’s interaction with the menu system in the next section.
4 Analysis of Breakdowns Based on Information Scent The decision strategy of the participant users can be explained by the perspective of information scent. Fig. 5 shows the frequency distribution of menu items that the users actually selected with respect to the rank order of information scent. The histogram illustrates that the higher rank order menu items have, the more frequently the users selected them. The scent distribution has thus a power to explain and predict the users’ menu-selection behaviors. On the basis of this finding, the organization of menus was analyzed through the scent distribution. Table 1 shows the scents of each menu in relation to the four different tasks, where the numbers in blue boldface represent the highest values of a menu list while the underlined numbers represent the correct options the users should select to get to the goals. Table 1(a) presents the scent distribution in the portal menu screen while Table 1(b) presents the distribution in the menu screen after MENU 7 is selected in this portal. These two tables show a significant tendency that the more successful tasks like Task1 have more manifest scent in their correct paths. Conversely, in the less successful tasks like Task4, menu options competing to the correct one have stronger scent toward the goals. These menus are not compatible with the users’ decision strategy explained above. The analysis here indicates it can easily misdirect and confuse the users’ search of the goal items by attracting their more attention. It is lack of information in the users’ available cues. The users are in difficulty to identify how they may or must interact with this system, i.e., the design vision. This menu hierarchy can be said not to have a well-organized structure, especially for peripheral functions of the product. Table 1. Scent distribution among menu items in two different menu screens. Scent values of individual menu items are listed with respect to each task. (a) Portal menu screen MENU 1
MENU 2
MENU 3
MENU 4
MENU 5
MENU 6
MENU 7
Task1
1.168
2.704
2.135
1.331
1.373
1.139
0.976
Task2
0.898
1.539
1.530
1.260
1.304
0.839
1.292
Task3
1.100
1.535
1.511
1.218
1.283
1.032
1.305
Task4
1.052
2.137
2.106
1.478
1.564
1.062
1.218
(b) ‘MENU 7’ screen MENU 7-1
MENU 7-2
MENU 7-3
MENU 7-4
MENU 7-5
MENU 7-6
Task1
1.048
2.513
0.983
1.179
1.078
1.175
Task2
0.825
1.066
1.099
1.671
2.036
0.873
Task3
0.958
1.284
1.136
1.860
1.695
1.069
Task4
0.951
1.528
1.074
1.797
1.659
1.059
Analysis of Breakdowns in Menu-Based Interaction
445
5 Conclusion This paper discussed breakdowns in menu-based interactions between the users and the computerized system from the perspective of perceivable structures of the menu systems. Information scent model was utilized for comparing the meanings of menus from the users’ point of view and then analyzing the users’ decision makings in search of particular menu items. The communicative breakdown analysis confirmed that there is a large difference between the designer and the users in assumptions for signifying or interpreting the menus (i.e., menu items and their organization for listing), especially in the case of the product’s peripheral functions. On the other hand, the information scent analysis confirmed that the distribution of information scent among a menu list provides a powerful clue for predicting the user’s menu selection. This result supports the findings that the success rate of the users’ search will decrease if menu options competing to the correct one are designed to have stronger scents toward the goal. Menu designs not compatible with the users’ naturalistic decision making can easily misdirect their search. The latter analysis specified the discrepancy between the designer and the users which was suggested in the former analysis.
Acknowledgments This work has been partially supported by the Grant-in-Aid for Creative Scientific Research No.19GS0208 of the Ministry of Education, Culture, Sports, Science and Technology (MEXT) of Japan. We are grateful for their support.
References 1. Norman, D.A.: The Psychology of Everyday Things. Basic Books (1988) 2. de Souza, C.S.: The Semiotic Engineering of Human-Computer Interaction. MIT Press, Cambridge (2005) 3. Shneiderman, B.: Designing the User Interface: Strategies for Effective Human-Computer Interaction, 3rd edn. Addison-Wesley Longman, Amsterdam (1998) 4. Pirolli, P., Card, S.K.: Information Foraging. Psychological Review 106, 643–675 (1999) 5. Pirolli, P.: The Use of Proximal Information Scent to Forage for Distal Content on the World Wide Web. In: Kirlik, A. (ed.) Adaptive Perspectives on Human-Technology Interaction: Methods and Models for Cognitive Engineering and Human-Computer Interaction, pp. 247–266. Oxford University Press, Oxford (2006) 6. Pirolli, P.: Information Foraging Theory: Adaptive Interaction with Information. Oxford University Press, Oxford (2007) 7. Horiguchi, Y., et al.: Analysis and Proposal of Hierarchical Menu Design from the Perspective of Communicative Breakdown. The Transactions of Human Interface Society 10(3), 21–34 (2008) (in Japanese)
E-Shopping Behavior and User-Web Interaction for Developing a Useful Green Website Fei-Hui Huang1, Ying-Lien Lee2, and Sheue-Ling Hwang3 1
Oriental Institute of Technology, Department of Marketing and Distribution Management, Pan-Chiao, Taipei County, Taiwan, R.O.C., 22061 Fn009@mail.oit.edu.tw 2 Chaoyang University of Technology, Department of Industrial Engineering and Management, Wufong Township, Taichung County, Taiwan, R.O.C., 41349 yinglienlee@gmail.com 3 National Tsing Hua University, Institute of Industrial Engineering and Engineering Management, Hsinchu, Taiwan, R.O.C., 30013 slhwang@ie.nthu.edu.tw
Abstract. In recent years there has been an increasing respect for green issues. It has been addressed in various products/services as well. There is still no website to support green customers’ decision process on electronic commerce (EC). The aim of this study is to understand user EC needs and expectations in order to elicit the design requirements of a useful interface. A questionnaire and an experiment were conducted to get users’ green knowledge and to detect user external behaviors interacting with computer when e-shopping. The study is centered on electric green products, including computers, communication devices, and consumer electronics. The results are used to produce the online-shopping process flowchart and several suggestions for improving e-shopping. The suggestions including information search, information display, and web site features have been addressed. From this, further research will focus on the design of web sites supplying consumers with green product information. Keywords: User-centered design, User-Web Interaction, Green product, Ecommerce.
establishing an international image and reinforcing people's environmental values i.e. conservation, preservation, world of beauty. However, there has been a tremendous increase in the number of web sites and most of them have been designed without respect to the user’s cognitive thinking leading to a frustrating and disappointing experience for searching information. In addition, rarely can one find a web site catering to the needs of consumers searching for green products. This study is designed to observe how users search for specified product information and then making a purchase decision online. This project is an initial phase focusing on obtaining consumers’ needs of green information for guiding their buying behavior towards being more environmentally friendly. Therefore, consumers’ information seeking processes and needs have to be considered during the interface design. An experiment has been developed to understand users’ needs and interaction with Web resources in order to elicit the design requirements of a useful interface for dealing with information overload and usability problems. Here, the study is centered on electronic products, including computers, communication devices and consumer electronics. The aims of this study are: (1) to acquire target consumers’ preferences, purchase intention, and acceptance level of electronic green products (EGP) by using a questionnaire; (2) to develop an experiment for studying user-web interaction and information searching process from the e-purchase behavior; and (3) to provide several suggestions for developing an EGP Web site with user-centered interfaces in an EC environment from the results of the questionnaire and the e-purchase behavioral experiment.
2 Relevant Literatures Green information is very important to people in protecting our environment. The speed at which electronic technology is improving has shortened the life cycle of products contributing to increased pollution. The internet has already become a powerful tool for people to search information, evaluate alternatives, and make decisions before making purchases. Developing a web site for green products is not easy but important. Effective website design is necessary for improved customer satisfaction and enhanced consumer experience. Electronic commerce (EC) may exchange large amounts of product information between users and sites. Given the large amounts of information available at the site, user interaction with web sites becomes an effort. To improve user’s operations in EC requires understanding their behavior online. The user’s external behavior is important to understand because they may correlate with their cognitive need. What the user considers success can be seen from their behavior when interacting with the interface design. The user-centered interface design on web-based systems may assist the user in receiving the right information in the right way and in an acceptable time before making any purchases. Also, it may support user knowledge in a specific domain, minimize of the cost of interaction, and provide less information load. There is an increasing number of consumers making purchases online, however currently there is no model for EC purchasing decision-making. For traditional offline shopping there exists multiple proposed models, one of the most popular is the EBM model for purchasing decision-making process. It was abstracted
448
F.-H. Huang, Y.-L. Lee, and S.-L. Hwang
from the EKB model. The consumer purchasing decision-making process can be divided into five stages: need recognition, information search, alternative evaluation, purchase, and after purchase evaluation [3]. This study aims to develop a flowchart of e-shopping process for users in the web environment from the real shopping process, including information seeking and decision making from the e-shopping behavior. The focus will be on the information search and alternative evaluation stages from the EBM model. Information is an important tool for the growing public support for environmental issues and also developing environmentally responsible behavior in many ways. With the right kind of information it is possible to influence consumers’ value priorities, and to persuade them to change their priorities [4]. Information is accessible in various forms, and nowadays people find it easier to search for information online. The internet is being searched both when a consumer’s objective is specific product or service information in anticipation of a purchase as well as when the objective is to obtain general information about a brand or product or service category [1]. There are two types of internet-based consumer information search behavior using six dimensions [5]. Specific information search was characterized as being extrinsically motivated, having an instrumental orientation, reflecting situational involvement, seeking utilitarian benefits, consisting of directed search, and focusing on goal-directed choices. General information search was characterized as being intrinsically motivated, having a ritualized orientation, reflecting enduring involvement, seeking hedonic benefits, consisting of non-directed search, and focusing on navigational choices. The complexity of consumer information search behavior is inherent. In order to develop a user-centered web-based system, anticipated needs, requirements, and expectations from the EC will need to be identified. To build effective and efficient human-centered electronic information systems, developers need to ground systems in a comprehensive understanding of the information-foraging process in context [6] and [7]. Collecting quantitative data on thoughts and feelings from userweb interactions besides physical movements is important to developing a usercentered web-based system. User-web interaction can be seen as (1) communication consisting of a series of transactions between the user and the web, and (2) information processing and problem-solving in which the user makes decisions based on the interpretation of information presented to him/her via an interface [8]. The store front for an EC transaction is the web site and online retailers invest in its design improvements. The usability of a website has be a focused upon in determining its success or failure according to human-computer interaction (HCI) literature. In the standard ISO-9241 Part 11, usability has been defined as ‘the extent to which a system can be used by specified users to achieve a specified goal with effectiveness, efficiency and satisfaction in a specified context of use’. The usability or HCI criteria are important in making the customer’s interaction with the website a satisfying one through the web site interface. An interface is the layer between the user and the system that facilitates human-computer communication [8]. This study researches eshopping behaviors for improving user-web interaction via the web site interface design.
E-Shopping Behavior and User-Web Interaction
449
3 Method According to the latest report from Taiwan Network Information Center (See the TWNIC at http://www.twnic.net.tw/) up to January 31, 2008, the population ages 12 to 35 accounted for about 90% of the internet usage. In particular, the percentage of internet users in age group 16 to 20 rises to 96.95%. It can be seen that the major part of internet user in Taiwan is young adults especially in age 16 to 20. This age group has the highest percentage online presence and will be the next generation of online consumers. Here, an initial questionnaire and an experiment were conducted to investigate the user needs, captured by consumers’ external and mental patterns, to apply it on the interface design. 3.1 Collection and Analysis of Questionnaire Data To elicit responses from consumers in Taiwan, a questionnaire has been designed With survey questions concerning the preference and purchase intention of green products. Most questions provided multiple-choice items and allowed selecting multiple answers. Using the questionnaire method, a total of 291 (97%) questionnaires retrived, including active Web males (n = 197) and females (n = 94) students ages 18 to 21 at Oriental Institution of Technology (OIT) in Taiwan, were analyzed using descriptive statistics. The results indicated that 73.5% students had some form of knowledge about electric green products, and only 1.3% students have bought related goods. In addition, if 291 students are willing to buy green products, 60.9% students will choose computers, 45.7% of them will choose electric appliances, and 43.6% students will choose communication devices. The reasons of buying green products for students were protecting environment (78%), marketing (40.1%), and friend’s recommendations (31.4%). The reasons of not purchase green products were having no idea about the products (49.3%), price (49%), and limited selections of products (38.6%). 3.2 Experiment The subsequent experiment has been designed based on the results of the initial questionnaire and was conducted to detect users’ external behaviors interacting with web sites when e-shopping and to collect users’ mental pattern from experimental questionnaires. Participants. Forty undergraduate students at OIT in Taiwan, 20 males and 20 females in age 18 to 21 years, were paid to participate in the experiment. All had online shopping experience for an average of 30.83 (SD 20.4) months. Apparatus. A computer with internet capabilities provided for online information search and shopping. Input devices available included a keyboard and a mouse. Output device available was 15 inch liquid crystal display (LCD) screen. Procedure. An experimenter introduced a task, which is searching for laptop computer product information on the Web and deciding which to buy. All the participants used the same computer and were given the same task. After filling out a pre-experiment questionnaire each participant was instructed to find the item that they would most
450
F.-H. Huang, Y.-L. Lee, and S.-L. Hwang
likely purchase online in any way that they preferred. During the experiment, the participant was video taped to analyze their user-web interaction, user e-shopping process, and time spent shopping from the online behaviors. After the participant completed the task of making the purchase decision, he/she had to fill out a postexperiment questionnaire. Measurements. The following sections for the measurements are pre-experiment questionnaire, user-web interaction, online shopping process, and post-experiment questionnaire. Pre-experiment questionnaire. It is designed with nine questions to obtain participants’ background, online experience, and anticipated features for buying a laptop, which was to be completed before the start of the experiment. User-web interaction. It is analyzed by objective measure in this study. A quantitatively-based measure has been developed by simple frequency counts to describe the nature of the interaction between user and World Wide Web. The interactions are captured by observers at an intermediate level of detail that incorporates behavior and quantitative aspects of the interaction. During the run of an experiment, the observer watched the interactions in real time, and used a specially designed form to capture the source, the recipient, and the type of the interactions between the user and the web sites. After the experiment, the results of interaction were double checked from the video tapes. Online shopping process. It is presented by a simply flowchart constructed from the participants’ e-shopping behavior. The flowchart is hypothesized to facilitate researchers understanding users’ cognitive style and habitual behavior for dealing with the current Web environment. Post-experiment questionnaire. It is designed with fifteen open-ended questions to get information such as what kind of laptop the participant wanted, where to buy, and why and to elicit their intentions, experiences, decision-making, information load and suggestions for online shopping process.
4 Results 4.1 Pre-experiment Questionnaire In this experiment, the average experience of participants’ interaction with World Wide Web resources is 102.45 (SD 19.85) months and the online purchase is 30.825 (SD 20.35) months. About 55% of the participants have had the experiences of buying clothes, 43% of buying accessories, and 40% of buying books online. Before execute searching information online, the anticipated features of laptop product are that 63% of the participants value tech specs 58% design, 55% usefulness, and 53% size and weight. 4.2 User-Web Interaction The results is shown in Table 1, one can see that the participants visited the mean number of 34.63Web pages and of 9.13 (SD 8.05) Websites by using keyword/hierarchial/others search functions for mean number of 20.85 (SD 13.59) times
E-Shopping Behavior and User-Web Interaction
451
in 71.81 (SD 22.32) minutes. This demonstrates that the users visited many web sites and web pages before having sufficient information to make their decisions. The userweb interaction data is analyzed by interaction ratios. In search anticipation ratios, the ratios for females (ratios=1.38), for males (ratios=1.67), and for all participants (ratios=1.5) are larger then 1.0 indicating that users using the search functions are able to locate more relevant information that non relevant information. The ratios <1 means that users using the search function are less likely to be able find relevant information. The information ratio is 1:0.75, meaning text and image information are both important for users. Table 1. Summary of results for user-web interaction Used search functions Keyword
4.3 Online Shopping Process The online shopping process from the experiment has been simply drawn as a flowchart (please refer to Figure 1). About 85% of participants use Yahoo.com as their web portal and use the keyword, hierarchical, or other search functions to get wanted information based on their preferences for a laptop computer including tech specs, design, usefulness, size and weight, and/or price. Then, they will narrow down the choices of which to buy. At that time, they will need the information to compare about the tech specs, price, consumers’ rating, and/or user reviews to aide in their decision making. Here one can see that the consumer has to make a decision on whether or not to make a purchase. If the consumer does not make a decision to purchase, then they will have to repeat the searching process again from the mean number 9.13 (SD 8.05) of web sites until a suitable product is chosen for purchase. After the experiment, a few participants may want to see the real product in the offline store before deciding where to make their purchase. 4.4 Post-experiment Questionnaire In addition, the results revealed that about 70% of participants were satisfied about the online shopping process and was due to a large amount of information (40%), the information easy to be obtained and understood (40%), and convenient (12%). The remaining 30% of participants were unsatisfied with the online shopping process because of difficulty in obtaining more specific information (69%), insufficient information (15%), and information overload (15%). During the experiment, every
452
F.-H. Huang, Y.-L. Lee, and S.-L. Hwang
participant came across unnecessary information or visited wrong web pages/ web sites sometimes because 21% of users reached unrelated information, 29% users using the hierarchical search do not get right information because the classification model is not match their mental thinking, 29% users could not find wanted information online with the resources available to them. Users can buy the laptop computer online for the following reasons; 53% value high seller reputation, 50% value deliver service, and 25% value better price. Keeping track of information that has been retrieved can be a difficult process. The two main methods are searching the browsing history of the browser and keeping multiple browsing windows or tabs open. However 78% of users still have difficulty in relocating information that they have previously found. With so much information to process, some users may face information overload. How does one deal with this problem. 95% of consumers felt overloaded with information. After the experiment, the participants provided the following suggestions to improve the online shopping experience: 30% prefer more online security, 30% prefer a well designed interface, 28% prefer real and relevant information, 18% prefer information to be concise, 15% prefer improved speed of web site, and 13% prefer the use of product placement advertisement to introduce new products. In addition most of the users agreed that having the ability to do a side by side comparison of products from different vendors from a single website would greatly assist in making their purchase decision easier and quicker. Www.oit.edu.tw
Searching Information Ex. Keyword, Hierarchial, others
Learn more laptop products information from Websites/Web pages
Decide possible candidates for purchase
No
Compare Information Ex. Tech specs, Price, Consumer ratings, User reviews
Select one to purchase?
Yes Select store to make purchase.
Fig. 1. Online shopping process flowchart
E-Shopping Behavior and User-Web Interaction
453
5 Discussion The following discussion topics follow from the results: information search, information display, and web site features. • Information search: Users search information by product preferences or requirements from their online experiences and knowledge about the product (please refer to Figure 1) via keyword or hierarchical search which plays an important role in the first step in finding the right information. The way the search behaves has to match the user‘s mental model or cognitive style in order to avoid unnecessary information overload from visiting a lot or unnecessary web pages thus improving interaction performance by using less search functions to get right information from less web pages. • Information display: This also plays an important role in helping people have the right knowledge about green products (please refer to Figure 1). Availability of green information may affect their shopping decision from traditional to green products. With green information 78% surveyed from the questionnaire would buy green products for protecting our environment. However, 49.3% of persons questioned have no prior experience with real green information. Green information from marketing (40%) and recommendation from others (31%) are very important in this case. Availability of green information on websites would enhance people’s environmental values. The laptop information that online users request most are: tech specs, design, reviews (negative and positive), computer accessories, comparison information (price, consumers’ rating, etc), new product information, and clear and definite information. The way to display the information would be using text and images in the ratio of 1:0.75, with comparison information integrated in one table allowing easier reading and comprehension. • Web site features: These play an important role in attracting visitors including (1) site reputation and services provided; (2) real information via social network that allow users to share their reviews; (3) automatic record tool to record the important information for the user including past searches and provide recommendations; and (4) comparison tool to compare specific criteria i.e. price or tech specs for specific items.
6 Conclusion Provided with more green information, consumers would be more willing to purchase green products. Consumers have grown accustomed to using the internet to search for their information needs. This study is aimed towards researching consumers’ online shopping behavior to obtain users’ needs and expectations on web-based system for improving users’ ability to gather information during their online shopping. For usercentered design on web-based system, user e-shopping behavior has been investigated producing the online-shopping process flowchart and several suggestions for improving e-shopping. The suggestions about information search, information display, and web site features have been addressed. Results from this study will be applied in the design of future web sites focuses on supplying consumers with green product information.
454
F.-H. Huang, Y.-L. Lee, and S.-L. Hwang
Acknowledgments. The authors would like to express their gratitude to National Science Council of Taiwan for the funding under the grant number NSC-97-2221-E324-018-MY3.
References 1. Peterson, R.A., Merino, M.C.: Consumer information search behavior and the internet. Psychology & Marketing 20(2), 99–121 (2003) 2. Peterson, R.A., Balasubramanian, S., Bronnenberg, B.J.: Exploring the implications of the internet for consumer marketing. Journal of the Academy of Marketing Science 25, 329– 346 (1997) 3. Engel, J.F., Blackwell, R.D., Miniard, P.W.: Consumer Behaviour, 8th edn. Dryden Press, Fort Worth (1995) 4. Ball-Rokeach, S.J., Rokeach, M., Grube, J.W.: The great American values test: influencing behaviour and belief though television. Free Press, New York (1984) 5. Hoffman, D.L., Novak, T.P.: Marketing in hypermedia computer-mediated environments: Conceptual foundations. Journal of Marketing 60, 50–68 (1996) 6. Garg-Janardan, C., Salvendy, G.: The contribution of cognitive engineering to the effective deasign and use of information systems. Inform. Services Use 6(5/6), 235–252 (1986) 7. Levy, D.M., Marshall, C.C.: Going digital: A look ata assumptions underlying digital libraries. Commun. ACM, 77–84 (1995) 8. Wang, P., Hawk, W.B., Tenopir, C.: Users’ interaction with World Wide Web resources: an exploratory study using a holistic approach. Information Processing and Management 36, 229–251 (2000)
Interaction Comparison among Media Internet Genre Sang Hee Kweon, Eun Joung Cho, and Ae Jin Cho The Department of Mass Communication and Journalism 53 Myeongnyun-dong 3-ga, Jongno-gu, Seoul, 110-745, Korea skweon@skku.edu, putyourhope@gmail.com, holymars@nate.com
Abstract. This research explores interactivity dimension in the portal media (such as Yahoo, Naver, Daum, Paran, and Nate)1. The research is designed to measure user’s perception of interactivity in the portal site at the three levels including 1) media 2) contents, 3) perception of HCI and CMC. This research also seeks the associated variables relationship among those variables through SEM (structural equation model). The 587 data was collected and was analyzed to test the hypotheses. The results shows that the dimension of the media side’s interactivity affected to the content’s side’s interactivity. The content side’s interactivity affected the user’s perception of portal media level either HCI and CMC media. Keywords: HCI, CMC, Interactivity, Communication, Community, Hypertext, Interface.
1 Introduction The future of the media is changing from fixed form to open form. Nicholas Negroponte confirmed that “media form of the future won’t be much like the ones in existence today.” Portal is constantly changing from the initial stage to multiple stages. Therefore, the user's participation will be increased as pull their wanted service. In addition, the portal service also will keep increased push their content to make increase media competition. The interaction functions are crucial factor in portal both the user and portal company. How the user has been perceived the 'interactivity' dimension. First of all, the definition of the interactivity is varied from scholar to scholar. The basic interactivity dimension is communication based interaction, technical interaction, psychological liminal interaction, mechanical transitive interaction.
2 The Concept and Measurement of ‘Interactivity’ There are several different types of interactivities from functional aspect to communication aspect. Most interactivity studies focuses on one dimension rather than multi-dimension such as social interactivity, psychological interactivity, message 1
In Korea, there are two types of portals: one is foreign portal such as Yahoo and Google; the other is Korea based portal such as Naver, Daum, Paran, and Nate.
level, media level, and user level. Moreover, there are difficulty in the blurring line between communicator and audience, message and media, and genres. The digital media is going to convergence from technology convergence, to media convergence, and to user convergence. As is the same level, the interactivity keeps going on convergence from technological interactivity to content convergence. Therefore, this research define the concept of interactivity as three dimension 1) technological dimension (hypertext and interface), 2) content or message dimension (personal communication, community work, news or information, and 3) perception of media characteristics ( HCI and CMC). 2.1 Media (Technical) Interactivity There are two factors of portal’s technical interactivity: one is hypertext and the other is interface. The hypertext most often refers to text on a computer that will lead the user to other, related information on demand. Hypertext represents a relatively recent innovation to user interfaces, which overcomes some of the limitations of written text. Rather than remaining static like traditional text, hypertext makes possible a dynamic organization of information through links and connections. The interface (or Human Machine Interface) is the aggregate of means by which people—the users—interact with the system—a particular machine, device, computer program or other complex tools. The user interface provides means from the various interactivities. Deuze [13] argues that hypertextuality, interactivity and multimediality determine the ‘added value’ of online journalism, which he names as the ‘fourth’ kind of journalism that differs in its characteristics from traditional types of journalism. Content Aspect in interactivity Many researches confirm that the interactivity occurs from contents selection from various menus. Pavlik (1997) described the evolution of online content in three stages. The first stage involves ‘repurposing print content’ for the online edition. In stage two, content is rearranged with interactive features, such as hyperlinks, newinterface, and search engines. In stage three, the news providers create original news content for own media sites and designed specifically for the new medium. This type of content involves both new forms of storytelling and increased levels of interactivity from simple to complexity level Communication. There are several types of communication interactivity in the portal such as e-mail, instant messenger, and chatting. It is including not only synchronous communication but also a synchronous communication. According to David Fortin (1997) interactivity is ‘the degree to which a communication system can allow one or more end users to communicate alternatively as senders or receivers with one or many other users or communication devices, either in real time (as in video teleconferencing) or on a store–and–forward basis (as with electronic mail), or to seek and gain access to information on an on–demand basis where the content, timing and sequence of the communication is under control of the end user, as opposed to a broadcast basis.’ Jens Jensen (1998) defined interactivity as “a measure of a medium’s potential ability to let the user exert an amount of influence on the content
Interaction Comparison among Media Internet Genre
457
and/or form of the mediated communication”, while Ha and James described interactivity as “the extent to which the communicator and the audience respond to each other’s communication need.” News and Information seeking interactivity. Portal’s main function is information providing, news services, and various contents services. There are many ways to percept the interactivity including online journalism, searching, learning, and various buying information seeking activities. Those portal navigation activities are ‘contents’ related interactivity. The previous research shows that the different media usage determined the message selection activities. Table 1 shows that newspaper user are select as much as political, international, and economic news, while portal news user are more interactivity to soft news such as health or life, information or communication, sports and entertainment. Community and group activity interaction. Portal has various cyber community activities such as cyworld, myspace, and second life, UCC community. The users conducted many types of social interaction, where users perceived certain pattern of interactivity. Stromer–Galley [44] argues that the term refers to two distinct phenomena: interactivity between people and interactivity between people and computers or networks Table 1. Interactivities Classficiations Dimension
Type of Interaction
Content
Technical Interaction
Playfulness/ Functional
Remote-control, Channel Flip, Control Screen, Computer Control Penel, Technical Procedure
Communication With content
Information collection
Text Selection, News, Information Seeking
Communication Reciprocal Reply to the Mass Media Contents With content Communication Producer-User Inter-personal Useres Communication Inter-personal Connectedness / Communication network
Interpersonal Communication E-amil, IM, SMS, Blog Social Association, Community Cyber-community
Fig. 1. Perception of Interactivities
Prospect in Digital Era Traditional Mass Media Digital Media evolutes to more interactivity such as IPTV, DMB and DTV Both Mass Media and Digital Media, Portal Media is the most ARS, Reply, Text Participat, Two-way communication From Face to Face, Chatting, IM etc. Digital-cyber community
Fig. 2. The research model
458
S.H. Kweon, E.J. Cho, and A.J. Cho
HCI VS CMC. The portal has two aspect media characteristics: one is human computer mediated communication and the other is computer mediated communication. The HCI is related to Computer as Source Model, where as CMC is related to Computer as Media Model. CMC (computer mediated communication) is communication interactivity, while HCI is information and news selection interactivity.
3 Research Questions 3.1 Research Question The research question is constructed to measure perception of interactivity in the portal sites. [Research Question1] The media characteristics (technological elements of interactivity) including hypertext and interface is positively affecting user's perception of portal interactivity. [Research Question 2] Are the characteristics of content or message interactivity positively related to the user's perception of interactivity in the portal sites? [Research Question 3] How user's perception of interactivity in portal to define media characteristics either HCI or CMC aspect? 3.2 Research Method Research Method and subjects. This research is designed to figure out the interactivity level in the portal sites using constructed questionnaires that is derived form previous studies. The literature reviews created research variables (media variables, contents variables, and perception variables). Using the survey method, the research measured the user’s perception of interactivity in the three levels. This survey’s subjects are most college students because they are using the portal media daily based. Therefore, they perceived the various the portal sites’ interactivities from technical to messages. They could provide the portal’s interactivity recognition. Operational definition of variables. To Measure the interactivities, the research design conducted the variables with measurable definition. The hypertext is webpage linkage, the interface is the contact point and usability. The contests variables are communication, community, and information. The user perception of the interactivity is classified two sides HCI and CMC. HCI is the portal as news or information sources model, whereas CMC is the portal as media model. Questionnaires. The questionnaires are constructed as several dimensions: 1) media usage time and year, 2) media factors, 3) contents factors, 4) user perception factors. This study adopted previous used questionnaires to measure the portal’s interactivity (McMillan& Hwang, 2002), [52]. The questionnaires as follow: To measure the dimension of portal interactivity, the researcher collected data using constructed questionnaires with survey method. The total sample was 587 respondents. The table1 shows demographics of respondents. The gender was: female (50.5%) and male (49.1%). A majority age is between 20 and 24 years old (46.1%), more than 25 and under 29 (00%), not apply (0.3%). Concerning of education variable is 46.1% of the sample responded are university students or graduated from
Interaction Comparison among Media Internet Genre
459
high school, and 30.6% of the respondents answered that they didn't go to the high school or just graduated from middle school. And 23.4% of the respondents graduated from university. Regarding household income, 41% of the sample earns less than 40,000,000 won ($40,000), 24.3% earns between 40,000,000won ($ 40,000) and 50,000,000won ($50,000).
4 Results 4.1 Reliability and Factor Analysis After reliability test, the researcher conducted confirmatory factor analysis based on constructed concept. The each factor of the goodness of fit index is based on conservativeness. The index standard is as follow: GFI (Goodness-of-Fit Index: =0.90), AGFI (Adjusted Goodness-of-Fit Index: = 0.90), RMR (Root Mean Square Residual: = 0.05), NFI (Normed Fit Index: = 0.90), p value (= 0.90 a=0.5). Table 3. Confirmation Factor Analysis by Variables
Media Con-tents Perception
Factor
Items in First
Hypertext Interface Communication Community Information (HCI) (CMC)
Table 4. Media Aspect Interactivity Factor Analysis
6) Link connection and immediately pop-up the related information window. 7) The user easily navigates one page to another. 8) The site has optimal information and its usability. 9) There are many selectivity information menus. 4) Linked information is highly related to the appropriated information. 5) The sites provide visual window construction as map orientation. 10) The site has communication menu section with many bulletin and menu 2) The user knows where the user would go. 1) The sitel provides the interface where I am while I am using or navigating. 3) The users can navigate what they want to go or sites. Eigenvalues % of Variable (%) Cumulative(%) Extraction Method: Principal Component Analysis Rotation Method: Varimax
Hypertext
Interface
0.782
0.091
0.780 0.747 0.739 0.683 0.663
0.133 0.121 0.167 0.334 0.217
0.312
0.177
0.173
0.910
0.134
0.884
0.302 4.388 43.88 43.88
0.781 1.555 15.55 59.433
460
S.H. Kweon, E.J. Cho, and A.J. Cho
Table 10 shows the factor analysis and reliability of the questionnaires. The reliability scores are .85 in hypermedia and interface factors. Table 3 shows the confirmatory factor analysis and the result shows that all variables have statistical satisfaction form GFI to RMR. Table 4 shows the media aspect’s interactivity and conducted factor analysis. There are two factors: one is hypertext, the other is interface. These two factors explained 59% of media interactivities. Table 5 is content aspect’s interactivities. There are three factors: one is communication, the other is community, and the third factor is information factor. These three factors explained contents dimension’s interactivities with 64.05% of the variables. Table 5. Factor analysis of Contents aspects 3) wiring comment on the reading article 4) reading comment 5) writing article / updating photos 6) sending mail / chatting 7) updating information 11) making community about the subject 12) operating community 10) debating about specific subject 9) removing information to a blog 1) searching information 2) reading article 8) downloading information 13) joining community Eigenvalues % of Variable (%) Cumulative(%) Extraction Method: Principal Component Analysis Rotation Method: Varimax
Table 6 is user’s perception of interactivity in the two sides: one is HCI and the other is CMC. HCI is 41.479 % and CMC is 21.013% of the user perception variable. Table 6. HCI and CMC Questionnaires HCI 3) It is technical sensitive ./HCI 0.809 5) It is activity. /HCI 0.682 2) It is non-personal ./HCI 0.649 4) It is humanity. /CMC 0.097 1) It is personal communication. /CMC 0.197 Eigenvalues 2.074 % of Variable (%) 41.479 Cumulative(%) 41.479 Extraction Method: Principal Component Analysis Rotation Method: Varimax
4.2 Correlation Analyses After factor analysis in the level of interactivity dimension, the researcher conducted correlation analysis among variables (hypertext, interface, communication, community, information, and perception –HCI and CMC). The correlation r is Table 7. Table 7. Correlation among variables in the interactivity Hyper-text Inter-face Communication Hypertext 1 Interface -.002 1 Communication .140(**) -.039 1 Community .069 .131(**) -.007 Information .378(**) .251(**) .005 HCI .314(**) .205(**) .066 CMC .018 .051 .194(**) ** Correlation significant level= 0.01.
Community Information HCI CMC
1 -.005 .145(**) .276(**)
1 .265(**) -.023
1 .001 1
The result shows that hyper text factor correlated with communication and information interactivities, whereas the interface factor is correlated to the community and information factors. In addition communication variables is correlated to communication (CMC) perception , whereas community factor is correlated with HCI and CMC perception variables. The information factor is correlated to the HCI perception of the interactivities. 4.3 Structural Equation Model This structural equation model (SEM) is empirical confirmed by variable test. Table 8, and figure 3 is the results. Table 8. The result of path score by variables Hypothesis H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 H11 H12
Process Hypertext Æ communication Hypertext Æ Group/ Community Hypertext Æ Information / News Interface Æ communication Interface ÆGroup/Communication Interface Æ Information news Communication Æ HCI Communication Æ CMC Group/Community Æ HCI Group/Community Æ CMC Information News Æ HCI Information News Æ CMC
5 Discussion This research shows the dimension of portal’s interactivities. There are three factors in the portal sites’ interactivities including technical or media factor, contents factors, user perception factor. The hypertext affect in communication and the interface factor is correlated to the information and community interactivities. The portal has two big different images in the aspect of interactivities including HCI and CMC. The web 2.0 is represented with sharing and participation. The HCI and CMC interactivities will be co-evolutes and co-existence in the digital media’s conversion feature. Acknowledgement. This research is supported by 2009 SBS Foundation Inc.
References 1. Barker, C., Gronne, P.: Advertising on the WWW, p. 27, Master’s Thesis, Copenhagen Business school (unpublished, 1996) 2. Baker, M.J., Churchill Jr., G.A.: The impact of physically attractive models on advertising evaluations. Journal of Marketing Research 14(4), 538–555 (1977) 3. Barnes, S.: Computer-mediated communication: Human-to-human communication across the Internet. Allyn & Bacon, Boston, MA (2003) 4. Barnes, S.: Online Connection: Internet interpersonal relationships (2001) 5. Bell, D.: An Introduction to Cyberculture. Routledge, NY (2001) 6. Bolter, J.: Writing Space: Computers, Hypertext, and the Remediation of Print. Lawrence Erlbaum Associates, Mahwah, NJ (2001) 7. Bolter, J., Grusin, R.: Remediation: Understanding new media. The MIT press, Cambridge, MA (1999) 8. Bomie, A., Pohlmann, K.: Writing for New Media: The Essential Guide to Writing For Interactive Media, CD-ROMs, and the Web. John Wiley & Sons, Chichester (1998) 9. Bretz, R.: Media for interactive communications. Sage, Beverly Hills, CA (1983) 10. Miller, C.R., Shepherd, D.: Blogging as Social Action: A Genre Analysis of the Weblog (2004), http://hochan.net/archives/2004/04/0201:41AM.html 11. Coyle, J.R., Thorson, E.: The effects of progressive levels of interactivity and vividness in Web marketing sites. Journal of Advertising 30(3), 13–28 (2001) 12. Dahlgren, P.: Television and the public sphere. Sage Publication, London (1995)
Interaction Comparison among Media Internet Genre
463
13. Deuze, M.: Online journalism: Modelling the first generation of news media on the World Wide Web. Online journalist review 6(10) (2001), http://www.firstmonday.dk/issues/issue6_10/deuze/index.html 14. Deuze, Mark: The Web and its journalisms: Considering the consequences of different types of newsmedia online. New Media & Society 5(2), 203–230 (2003) 15. Dillon, A., Gushrowski, B.: Genres and the Web: Is the Personal Home Page the First Unique Digital Genre? Journal of the American Society for Information Society 51(2), 202–205 (2000) 16. Fidler, R.: Media morphosis: Understanding new media. Pine Forge Press, Thousand Oaks, CA (1997) 17. Ghose, S., Dou, W.: Interactive functions and impacts on the appeal of Internet presence sites. Journal of Advertising Research 38(2), 29–43 (1998) 18. Gumbrecht, M.: Blogs as Protected Space (2004), http://www.blogpulse.com/papers/www2004gumbrecht.pdf 19. Hall, J.: Online journalism: A critical primer. Pluto Press, London (2001) 20. Kawamoto, K.: Digital journalism: Emerging media and the changing horizons of journalism. Rowman & Littlefield Publishers, NY (2003) 21. Kelleher, T., Miller, B.M.: Organizational blogs and the human voice: Relational strategies and relational outcomes. Journal of Computer-Mediated Communication 11(2) (2006), http://jcmc.indiana.edu/vol11/issue2/kelleher.html 22. Killan, C.: Writing for the web: Writer’s edition. Self-Counsel Press, Bellingham, WA (1999) 23. Kiousis, S.: Interactivity: A concept explication. New media & Society 4(3) (2002) 24. Macias, W.: A preliminary structural equation model of comprehension and persuasion of interactive advertising brand Web sites. Journal of Interactive Advertising 3 (2003), http://www.jiad.org/ 25. Manovich, L.: The language of new media. The MIT Press, Cambridge, MA (2002) 26. McMillan, S.J., Hwang, J.S.: Measure of perceived interactivity: an exploration of the role of direction of communication, user control, and time in shaping perception of interactivity. Journal of Advertising 31(3), 29–42 (2000) 27. Miller, C.H.: Digital storytelling: a creator’s guide to interactive entertainment. Focal Press, Boston (2004) 28. Miller, C.: Genre as social action. Quarterly Journal of Speech 70, 151–167 (1984) 29. Miller, C., Shepherd, D.: Blogging as Social Action: A Genre Analysis of the Weblog (2003), http://blog.lib.umn.edu/blogosphere/blogging_as_social_ action.html 30. Murray, J.H.: Hamlet on the holodeck: The future of narrative in cyberspace. MIT Press, Cambridge, MA (1997) 31. Newhagen, J.E., Bucy, E.P.: Routes to media access. In: Bucy, E.P., Newhagen, J.E. (eds.) Media access: Social and psychological dimensions of news technology use, pp. 3–23. Lawrence Erlbaum Associates, Mahwah, NJ (2004) 32. Newhagen, J.E., Cordes, J.W., Levy, M.R.: Nightly@nbc.com: Audience scope and the perception of interactivity in viewer mail on the internet. Journal of communication 45(3) (1995) 33. Newhagen, J., Reeves, B.: Negative video as structure: Emotion, attention, capacity, and memory. Journal of Broadcasting and Electronic Media 40, 460–477 (1996) 34. Pavlik, J.V.: Journalism and new media. Columbia University Press, NY (2001)
464
S.H. Kweon, E.J. Cho, and A.J. Cho
35. Poor, N.: Mechanisms of an online public sphere: The website Slashdot. Journal of Computer-mediated Communication (2005), http://jcme.indiana.edu/vol10/issue2/poor.html 36. Pryor, L.: The third wave of online journalism. Online journalism review (2002), http://www.ojr.org/ojr/future/1019174689.php (April 1, 2003) 37. Rafaeli, S.: Interactivity: From new media to communication. In: Hawkins, R.P., Wiemann, J.M., Pingree, S. (eds.) Advancing communication science: Merging mass and interpersonal processes, pp. 110–134. Sage, Newbury Park, CA (1988) 38. Rheingold, H.: The Virtual community: Homesteading on the electronic frontier. MIT Press, Cambridge, MA (2000) 39. Rheingold, H.: The smart mobs: The next social revolution (2002) 40. Samsel, J., Wimberley, D.: Writing for the interactive media: The Complete Guide. Allworth Press, NY (1998) 41. Sohn, D., Lee, B.: Dimensions of interactivity: Differential effects of social and psychological factors. Journal of Computer-Mediated Communication 10(3), article 6 (2005), http://jcmc.indiana.edu/vol10/issue3/sohn.html 42. Stansberry, D.: Labyrinths: The art of interactive writing and design. Wadsworth Publishing, Belmont, CA (1998) 43. Steuer, J.: Defining virtual reality: Dimensions determining telepresence. Journal of Communication 42(3), 73–93 (1992) 44. Stromer-Galley, Jennifer: Interactivity–as–product and interactivity–as–process. The Information Society 20, 391–394 (2004) 45. Sundar, S., Kalyanaraman, S., Brown, J.: Explicating website interactivity. Communication research 30(1) (February 2003) 46. Turkle, S.: Life on the screen: Identity in the age of the Internet (1997) 47. Wallace, P.: The Psychology of the Internet (1999) 48. Walter, J.B.: Interpersonal effects in computer-mediated interaction: A relational perspective. Communication Research 19(1), 52–90 (1992) 49. Williams, F., Rice, R.E., Rogers, E.M.: Research method and the new media. Free Press, New York (1988) 50. Wolf, M.: Genre and the video game. In: Wolf, M. (ed.) The Medium of the Video Game, pp. 113–134. University of Texas Press, Austin (2001) 51. Wood, F., Smith, M.: Online communication: Linking technology, identity, & culture. Lawrence Erlbaum Associates, Mahwah, NJ (2005) 52. Wu, H.D., Bechtel, A.: Web site use and news topic and type. Journalism and Mass Communication Quarterly 79(1), 73–86 (2002) 53. Yates, J., Orlikowski, W.J.: Genres of organizational communication: A structurational approach to studying communication and media. Academy of Management Review 17(2), 299–326 (1992)
Comparing the Usability of the Icons and Functions between IE6.0 and IE7.0 Chiuhsiang Joe Lin, Min-Chih Hsieh, Hui-Chi Yu, Ping-Jung Tsai, and Wei-Jung Shiang Department of Industrial Engineering, Chung Yuan Christian University 200, Chung Pei Rd., Chung Li, Taiwan 32023, R.O.C {hsiang,g9674019,s921909,g9674021,wjs001}@cycu.edu.tw
Abstract. Microsoft has presented the newest net browsing interface , Internet Explorer 7 IE7 in 2007. The purpose of this study was to compare the design of icons and functions between IE 7.0 and IE 6.0 for the effect of operating performance. Thus, we designed two experiments and a program which was constructed in Builder C++ 6.0. Participants were given missions, and then we recorded the mission completed time as operating performance. The results showed that the difference of icon design and functions between IE 7.0 and IE 6.0 do affect the operating performance.
( )
Keywords: Interface Design, Usability, Browser.
1 Introduction With the rapid development of the Internet, its usage has been on increase since World Wide Web popularized. According to the data published by comScore Media Metrix and collected by FIND (Foreseeing Innovative New Digiservices), there were about 750 million people who used Internet globally in January 2007 at a growth rate of 10% comparing to the previous year [1]. Among the twenty three million people in Taiwan, the ratio of population using the Internet is over sixty percent [1]. Therefore, the Internet has been part of daily life for people. The Internet users navigate the page with the browser, and interact with the page, document, image, and other information. Currently there are several browsers available to choose from, including Internet Explorer (IE), Firefox, Mozilla, and Opera, among them the usage ratio of Internet Explorer (IE) is the highest, as shown in Table 1. Table 1. Market share of Internet browsers [8]
Microsoft published a new browser (Internet Explorer 7), and there are many differences between IE6 on the user interface. Among them, the buttons are divided into several groups and arranged to different places (Fig. 1), which differs far from IE6. In addition, on the same basis of display pixels, the button of IE7 becomes smaller obviously, and increases new functions of Tabbed browsing and Quick tabs (Fig. 2). Therefore, the purpose of this research is to firstly, examine how icon design and location influence users when they surf the net with IE6 and IE7 and, secondly, inspect how tabbed browsing would influence users when switching between sites.
Fig. 1. Tool bar in Internet Explorer 7
Fig. 2. Quick tabs (left) and Tabbed browsing (right) in IE7
2 Literature Review Psychologists consider that people do not only think by simple words, but sometimes recall the memory in mind by icon or space place [3]. The meaning which icons contain in the interface of human-computer interaction is more extensive than what the words contain. Wickens [6] and Weidenbeck [7] thinks icons are better to comprehend completely and easier to memorize than words. This is the reason why icons are applied extensively. Among the current developed application programs, functional icons have been the basic elements, however, the icons on the tool bar in most application programs may appear too small and the spacing too narrow. Lindberg and Näsänen [3] think the size and spacing of icons have significant influence on user performances. They pointed out that the preferred spacing to use between icons (interface elements) is 1/2 to 1 icon (interface element) width, and the icon width ought to be bigger than 0.7 degree visual angle. In other words, the efficiency will be promoted when the icon width is about 0.5 cm at the viewing distance of 40 cm, or when the icon width is about 0.9 cm at the viewing distance of 70 cm. The final
Comparing the Usability of the Icons and Functions between IE6.0 and IE7.0
467
purpose for icon design is to allow the user to control more directly, reduce the user’s burden for memorizing, and reduce the complication and errors of operation. Norman [4] addresses four principles to help designing for the sake of making up the gap between the designers and the users, visibility, good mental model, good mapping, and feed-back. Norman hopes that the user-centered design by these four principles can reduce the blind spots of the designers. The eight golden rules of interface design proposed by Shneiderman and Plaisant [5] contain some concepts similar to Norman’s principles. The eight golden principles put more emphasis on the design of interactive systems. They are: 1. Strive for consistency. 2. Cater to universal usability. 3. Offer informative feedback. 4. Design dialogs to closure. 5. Prevent error. 6. Permit easy reversal of actions. 7. Support internal locus of control. 8. Reduce short-term memory load. The icon design is involved in human’s psychology and physiology, and it has to express the meaning correctly. Horton [2] addresses the principles for icon design as follows: 1. Understandable: The icon should be presented in an easy way to comprehend, and the content should be connected directly. When designing the icon, it is better to apply the familiar and easily recognized visual icon for the users to reduce the burden of learning and memorizing. 2. Importance: The icons of the same character should be collected in the same block. 3. Distinct: There has to be frames and shadow surrounding the icons to differ from other icons. In addition, there has to a great difference form the similar icon and avoid designing the familiar icons confuse the users. 4. Memorable: The usage has to be continuous in the documents, and attractive. 5. Size: The size of icon will affect the users’ recognition and operation. The too big icon will occupy too much space; the too small icon is hard to attract users’ attention, and cost users’ concentration to click. 6. Attractive: The designers should pay attention to the balance of visual sensation and the coordination of the size, color, and interface. This study contains two experiments. The first experiment is based on Horton’s principles to evaluate how icon design and location influence users when they surf the net with IE6 and IE7. The second experiment inspects how tabbed browsing would influence users when switching sites.
3 Method 3.1 Experiment 1 There are ten subjects, including 7 males and 3 females, aged from 24 to 35 years old, all right-handed, and are without the disease of eyes and hands. In addition, they all have the experience of using the Internet over one year. Among them, there are 7 subjects using IE7 for over one year; the rest still using IE6. The experiment hypothesis is that the buttons of IE6 are bigger and centralized. Therefore, the merit of using IE6 is better than using IE7. The major measure lies in the spending time for using the buttons of the different interfaces. The independent variable is the two different browser interfaces, including five buttons, previous/next, stop, refresh, my
468
C.J. Lin et al.
favorites, and home. The dependent variable is the time which the subjects need to finish the tasks. Experiment Environment and Procedures. A personal computer, with 19-inch LCD monitor, and an interface written in Builder C++ 6.0 were prepared for the study. Before starting the tests, there were thirty minutes for the subjects to practice the experiment interface. The experiment adopts the within subject design. For each subject, there were ten trials, five on each experimental interface simulating either IE6 or IE7. To avoid tiredness, subjects were asked to take 5 minutes break between two trials and do not end until the experiment finishes. When the subject presses the “Enter” button, the time starts to be counted, and at this time the page will turn to the IE interface to ask the subject to carry out the instructions. In every experiment, there will be instructions for each task, such as “press previous button, please”. When the subject completed the click on the target, the time will be recorded. 3.2 Experiment 2 The same subjects were used as the experiment 1. Among them, there are 7 subjects using IE7 for over one year and the rest still using IE6. The experiment hypothesis is the speed to switch different pages by using the function of Quick Tabs in IE7 is faster than in IE6. The independent variable is the function of Quick Tabs (Fig. 3.) and the traditional window types (Fig. 4.). The dependent variable is the time which the subjects need to finish the task.
Fig. 3. Quick Tabs Experiment Environment and Procedures. The study prepared two computers with the same standards, with 19-inch LCD monitor, and pixel size of 1280*1024. Among them, one browser is IE7 and the other is IE6. Nine different pages were opened at the same time. When the subjects were familiar with the experiment interface, the experiment started. Each subject performed two trials. Each task trail is to have the subjects randomly picked an interface and later upon instruction clicked on the target page. The time was recorded between the subjects obtaining the instruction and finishing choosing the target. After finishing one experiment, there will be a tenminute break before the next experiment. Before starting the traditional interface
Comparing the Usability of the Icons and Functions between IE6.0 and IE7.0
469
experiment, this study will shrink all the pages to the lower task row and put the cursor of mouse at the center of the monitor. And before the new Index Tabs experiment starts, this study will also shrink all the pages to the upper task row and ask the subjects to open the function of Quick Tabs.
Fig. 4. Traditional window types
4 Results 4.1 Result of Experiment 1 After analyzing the data from experiment 1 with ANOVA, we can find the interface has significant influence on operation performance (F=8.49, P =0.004), and the operation time using IE6 is less than IE7 (Fig. 5). The function key has significant influence on operation performance (F=5.06, P =0.001) The time with which the subjects click the “Page Up”/”Page Down” is less than other function keys and the time to click “Homepage” is more than other function keys (Fig. 6). The order has significant influence on operation performance (F=13.98, P =0.000). The time the subjects spend in first order is more than other orders (Fig. 7). There is significant interaction between the interface and the function keys, and the difference of the two function keys “Homepage” and “My Favorites” is greater in IE6 and IE7(Fig. 8).
Fig. 5. Mean time for IE6 and IE7 in experiment 1
470
C.J. Lin et al.
4.2 Result of Experiment 2 After analyzing the data from experiment 2 with ANOVA, we can find the interface has significant influence on operation performance (F=25.65, P =0.001), and the operation time which the subjects use in IE7 is less than in IE6 (Fig. 9).
Fig. 6. Mean time for different icons
Fig. 7. Interaction plot for time between interface and order
Fig. 8. Interaction plot for time between interface and icon
Comparing the Usability of the Icons and Functions between IE6.0 and IE7.0
471
Fig. 9. Mean time for IE6 and IE7 in experiment 2
5 Discussion 5.1 Experiment 1 Total Time. The time which the subjects need to finish the task in IE7 is obviously more than IE6. The reason might be the distribution of keys, for IE6 adopts centralized distribution and it is easier for the subjects to find the key they need. However, there is not significant difference on the time of every key (Fig. 8), the result will be discussed separately as follows. “Page Up”/”Page Down”. Although in the two different browsers, the color is different and smaller in IE7, yet the time is not significantly different. The speculation is that both the two interfaces use the bright color. Besides, the place and the shape of the arrow are almost the same; therefore, it might increase the memorable, and the understandable of the icon. That can also explain why this key can perform the best among these five keys. “Stop” and “Refresh”. In different browsers, there is no difference on the key” Stop” and “Refresh”. The speculation is that the color and the icon are totally the same in IE6 and IE7. Although the keys are smaller in IE7, yet the subjects can search the keys they need by the bright colors. In addition, these two keys stand side by side in the two interfaces, and increase the importance and memorable of message. “My Favorites”. There is significant difference on searching in the two interfaces. The speculation is that the size is the most different so that it takes the subjects more time to search. “Homepage”. There is significant difference on searching in the two interfaces. The speculation is that the color of the icon is similar to the color of the background in IE7, and the size is smaller so that the subjects cannot find this key in the first moment. The size seems to neglect the principle of attractive making the subjects spend more time to search in IE7.
472
C.J. Lin et al.
5.2 Order and Interface After analyzing the experiment order and the interface time, we find that interface time decreases with the experiment order; in other words, there are learning effects. However, the operation time in IE7 is still longer than that in IE6 with the learning effects. Interestingly, for the four subjects who are previously used to IE7 still finished the tasks in IE7 longer than in IE6. Therefore, the performance of subjects using IE6 is better than IE7. 5.3 Experiment 2 This experiment focuses on the new function in IE7, and the research finds that the searching time for using Quick Tabs to switch the pages becomes shorter obviously. Because Quick Tabs can present all the browsing pages in one window without clicking each window individually. Therefore, Quick Tabs does really work on browsing.
6 Conclusions The new products should be designed to be more convenient and faster for the users to use. When designing the product, visibility is a very important principle. The controls to be operated correctly should be obvious and the system should provide the users visible information (Norman, 2000). In this research, we find that some icon designs in IE7 are not referenced with the principle of visibility. Therefore it is hard for the users to browse pages fast and conveniently. The performance was worse than the old browser in most keys tested in this study. However, the function Quick Tabs in IE7 does really increase the performance of browsing the pages for the users. This research finds that the size, color, and place of the icon do really influence the users’ performance. Accordingly, the designers should think over this factor. Future userinterface designers should be aware of the importance of these findings and take them into account when designing new products so as to enhance users’ efficient performance on Internet usage.
Acknowledgement This study is financially supported by a project from the National Science Council of Taiwan under contract No. NSC-97-2629-E-033-001.
References 1. FIND. Global network popular rate grows 10% in 2006 (2007), http://www.find.org.tw/find/home.aspx?page=news&id=4726 2. Horton, W.: The icon book: visual symbols for computer systems and documentation. John Wiley & Sons, Chichester (1994) 3. Lindberg, T., Näsänen, R.: The effect of icon spacing and size on the speed of icon processing in the human visual system. Displays 24, 111–120 (2003)
Comparing the Usability of the Icons and Functions between IE6.0 and IE7.0
473
4. Norman, D.A.: The psychology of everyday things. Yuan-Liou, TW (2000) 5. Shneiderman, B., Plaisant, C.: Designing the user interface: strategies for effective humancomputer interface, 4th edn. Addison Wesley, Reading (2005) 6. Wickens, C.D., Hollands, J.G.: Engineering Psychology and Human Performance, 3rd edn. Prentice-Hall, Englewood Cliffs (1992) 7. Weidenbeck, S.: The use of icons and labels in an end user application program: an empirical study of learning and retention. Behavior & Information Technology 18(2), 68–82 (1999) 8. WIKIPEDIA. Usage share of web browsers (2006), http://zh.wikipedia.org/w/ index.php?title=%E7%B6%B2%E9%A0%81%E7%80%8F%E8%A6%BD%E5%99%A8 %E7%9A%84%E4%BD%BF%E7%94%A8%E5%88%86%E4%BD%88&variant=zh-hant
Goods-Finding and Orientation in the Elderly on 3D Virtual Store Interface: The Impact of Classification and Landmarks Cheng-Li Liu1, Shiaw-Tsyr Uang2, and Chen-Hao Chang3 1
Department of Industrial Management Vanung University, Taoyuan, Taiwan 2 Department of Industrial Engineering and Management Minghsin University of Science & Technology, Hsinchu, Taiwan 3 Granduate School of Business and Management Vanung University, Taoyuan County, Taiwan johnny@vnu.edu.tw Absract. The internet 3D virtual store has received wide attention from researchers and practitioners due to the fact that it is one of the most killing applications customers can feel in a real shopping environment and possibly increase satisfaction. Though numerous studies have been performed on various issues of the internet store, some research issues relating to the spatial cognition of the elderly when immersed in a 3D virtual store still await further empirical investigation. The objective of this study was to examine how elderly users acquire spatial cognition in an on-screen virtual store. Specifically, the impact of presence and absence of goods-classification on the acquisition of route and survey knowledge was examined. Since landmarks are associated with both route and survey knowledge, we expected to observe the impact of different types of landmarks with both presence and absence of goods-classification. The experimental results indicated that the presence of goods-classification was more important in constructing route knowledge than in absence, and the time of duration of goods-finding would be shorter. However, we also found that the measuring scores of survey knowledge in presence of goods-classification were not significantly larger than in absence. In addition, the measuring scores of route knowledge were the largest and the time of duration of goods-finding was shorter while the presence of goods-classification combined with landmark in the type of alphanumeric + 2D picture + 3D object. Simultaneously, it could be found in absence of goods-classification. Therefore, while the goodsclassification is absent, the landmarks could be seemed as redundant codes for goods-finding in 3D virtual store. Keywords: 3D virtual store, Goods-finding, Goods-classification, Landmarks, Route knowledge, Survey knowledge.
Goods-Finding and Orientation in the Elderly on 3D Virtual Store Interface
475
things on the Internet. It lets us buy what we want, when we want at our convenience, and helps us to imagine ourselves buying, owning, and having positive outcomes by the goods available out there on the web [6]. Shopping has been a way of identifying oneself in today's culture by what we purchase and how we use our purchases. Online shopping has been quiet popular nowadays since its first arrival on the Internet in society. Although the percentage of older adults (i.e. silver tsunami) using the web is less than the percentage of younger individuals, surveys indicate that this may not be the case for long. The World Health Organization estimates that by the year 2020, 24% of Europeans, 17% of Asians, and 23% of North Americans will be over the age of 60 [32]. By 2020, the world will have more than 1 billion people age 60 and over. The older population is growing rapidly worldwide and is becoming an increasingly important demographic to understand. According to Coyne and Nielsen [4], there are an estimated 4.2 million Internet users over the age of 65 in the United States. This number will continue to grow internationally at a rate that reflects the overall population trends previously discussed. However, most internet stores adopt letters, two dimensional (2D) pictures, voice and cartoons to display goods, which are short of the intuition of goods and systems do not support interaction with the goods [8]. It would influence the customers’ real shopping experience; what’s more, it minifies the shopping desire of the customers. These problems can be solved by utilizing the technology of virtual environments (VEs). In Web-based virtual environments, scenes and physical images can be compressed and transferred through network with limited bandwidth and 3D scenes of goods can be built on the web using some newest techniques. This technology allows the simulation of 3D VEs on a computer: humans can experience those environments by active exploration [11]. In 3D virtual store there provides a computer-synthesized world in which users can interact with goods and perform various activities. Although 3D virtual store improves the sense of reality and interaction with goods, there are some issues need to be discussed. Generally in large-scale VEs, the user’s viewpoint cannot encompass the entire environment [26]. In such VEs, navigation is a fundamental activity, and successful use of the VE requires that the user be able to easily and efficiently navigate from one location to another [5]. However, previous researches have shown that users of VEs are often disoriented and feel lost in hyperspace. Therefore, users often have extreme difficulty in completing navigational tasks [3][7]. It also has the same problems in 3D virtual store. Although the 3D virtual store is characterized by on-screen small-scale virtual environments, the user has an egocentric viewpoint that is within the environment, and it visually affords the experience of movement, rotation, and changing the elevation of the view in such environment [29]. In conventional grocery store, it is often divided into several areas or rooms, each built around a particular shopping theme [27]. Of course, in 3D virtual store, the store layout is also built for helping customer easier to browse. Sometimes the destination is not immediately visible. Such can be the case also with the typical on-screen virtual environment which may include, for example, walls that make up internal rooms and corridors visually obscuring the destination [21]. Therefore, the human spatial abilities that influence their acquisition and usage need to explore, particularly when interacting with such environments. Those abilities, in the 3 D virtual store, consist primarily of orientation and goods-finding. Additionally, previous studies on cognitive aging have found that certain aspects of human information processing
476
C.-L. Liu et al.
abilities are negatively correlated with age [20]. Specifically, Park summarized four basic mechanisms accounting for age-related decline in cognitive functions, including processing speed, working memory, sensory function and inhibition. Lin found that older people were more likely to get disoriented in hypertext perusal and they also failed to browse the document as broadly as young adults could [16]. Neither did the aged people manage to retain browsed information as accurately as their young counterpart even when assisted with multimedia presentations. In view of this trend, there is an increasing need to investigate and better understand the abilities of orientation and goods-finding in the elderly, particularly when interacting with a 3D virtual store. In general virtual environment, the ability of orientation influences the efficiency and effectiveness of wayfinding. Wayfinding is a term that can refer to a rather narrow concern: That is, how well people are able to find their way to their particular destination without delay or undue anxiety [23]. It is knowing where you are in a building or an environment, knowing where your desired location is, knowing how to get there from your present location, and used in the context of architecture to refer to the user experience of orientation and choosing a path within the built environment. Travelers find their way using landmark knowledge, route knowledge, and survey knowledge [18]. The first two types of knowledge are more specific knowledge representing sensory experience [14]. Landmark knowledge records the visual features of landmarks, including their shape, size, texture, and so forth [10]. For a structure to be a landmark, it must have high imageability; that is, it must be distinctive and memorable [17]. Most often, landmark knowledge is an acquired first when encountering and learning to know a new and unfamiliar environment. The recognition of landmarks becomes a part of the construction of route knowledge, where the landmarks are the points that make up the routs. Later, landmarks are the objects and elements in the survey knowledge and they are part of constructing the layout and relational configuration of the elements in the environment [2][30]. Survey knowledge provides a map-like, bird’s-eye view of a region and contains spatial information including locations, orientations, and sizes of regional features [9]. Each type of knowledge helps the traveler construct a cognitive map of a region and thereafter find their way using that map [1][22]. In study of orientation, the notion of a “cognitive map” has received much research attention [13][19]. At its most general, a cognitive map is a mental construct which we use to understand and know the environment [12]. The term assumes that people store information about their environment which they use to make spatial decision [13]. Because of the importance of landmark knowledge for constructing a cognitive map, much research was devoted to the impact of landmarks on orientation. Several researchers have studied the value of 2D landmarks (i.e. textual and 2D images) for wayfinding. Witmer et al. used verbal directions and photographs to study route learning [31]. Waller et al. applied cardboard numeral, images of stuffed animal and arrow to study real-world task training and found that long exposure fostered good spatial knowledge under several VEs [28]. Additionally, several researches have been done on the effects of 3D landmarks for wayfinding. Elvins found that 3D thumbnails make better guidebooks for 3D worlds than do text and 2D thumbnail images [9]. Parush and Berman discussed the impact of navigation aids and 3D landmarks on the acquisition of route and survey knowledge in spatial cognition and found that combined impact of both the navigation aids used in the learning and the presence of 3D landmarks was primary evident in the orientation task [21].
Goods-Finding and Orientation in the Elderly on 3D Virtual Store Interface
477
Although each of these studies recognized the value landmark knowledge for wayfinding, none studied the value of combination of 2D and 3D landmarks in familiarizing travelers within a virtual environment. In addition, the environment of 3D virtual store is a small geographical area. We are interested in the efficiency and effectiveness of goods-finding, not wayfinding. The showrooms in the store could be considered as blocks in the map and the cabinet in the showroom as the buildings in the block. So the landmarks should be also important for user to construct a “cognitive map” on goods-finding in 3D virtual store, and there are some different types of landmarks could be applied, such as 3D object, 2D image and alphanumeric icon. In addition to landmarks, the goods-classification should be another cue for goodsfinding. Goods-classification divides goods into some classes by natures or trademarks, and allows people to seek a good appropriately from these classes. Whether the combined impact of both the goods-classification and different types of landmarks (i.e. alphanumeric, 2D image and 3D object) are significant for good-finding in the elderly within the 3D virtual store or not? This is a primary goal of our study. Therefore, the objective of this study was to examine how elderly users acquire spatial cognition in an on-screen virtual store. Specifically, the impact of different types of landmarks on the acquisition of route and survey knowledge was examined. In addition, their combined effect with the presence or absence of goods-classification was also examined. If goods-classification is associated more with the acquisition of spatial knowledge, we expected to observe its greatest impact during the goods-finding. Since landmarks are associated with both route and survey knowledge, we expected to observe the impact of different types of landmarks with both presence and absence of goods-classification.
2 Method 2.1 Participants There were 32 people (average age of 69.5 years) selected to participate in the experiment. They were paid a nominal NTD500 as compensation for their time. All participants were fully informed and had signed a consent form. Some researchers found that repeated exposure to the same virtual environment with separation of less than seven days could significantly affect the levels of cybersickness which would induce participant’s disorientation and nausea [25][15][24]. Therefore, the participants had not been exposed to the experimental VE in the previous 2 weeks. 2.2 Apparatus and the VE The VE experiment was constructed using a virtual environment developing software (MAYA and Virtool) and presented on a 19” TFT-LCD display. The scene was designed as a retail store which contained four showrooms. There were two conditions to be designed. One condition was that the store was divided into four subject showrooms including stationeries, hand tools, cleaning articles and toiletries as shown in Figure 1. The other was just divided into four showrooms, not classified. The landmarks in the specific 3D environment were highly visual and salient with a unique
478
C.-L. Liu et al.
shape and volume in contrast to the goods. The landmarks were typical home and office objects such as a painting, a metaphor picture, a flowerpot and others. 2.3 Experimental Design and Procedures The study involved a 2 (Goods classification: absence or presence) × 8 (Types of landmarks: none, alphanumeric, 2D, 3D, alphanumeric + 2D, alphanumeric + 3D, 2D + 3D and alphanumeric + 2D +3D) between-subjects experiment, resulting in a fullfactorial design with 16 treatment conditions. Each participant was randomly assigned one of the eight conditions of landmarks in presence or absence of goodsclassification to do the task of goods-finding. Therefore, there were two participants was randomly assigned to one of the sixteen conditions. In other words, there were two participants assigned to each condition. Alphanumeric
Hand Tools
Cleaning Articles
2D
Stationeries
Toiletries Entrance
(a) A top view of the experimental 3D store
3D
(b) A view of the stationery room
(c) Three types of landmarks are presented in the hand tools room
Fig. 1. The 3D virtual environment of the retail store
During the exposure period, participants were asked to search for and confirm some goods in the store. There were eight target goods, two in each showroom, need participants to search for. When they found the target, s/he should move the cursor on the object and click the left button of mouse twice to respond ‘hit.’ There were four stages in the experiment. In each stage, participant was randomly assigned two target goods to find and asked to attempt to recall the location of target goods and complete an oral questionnaire of route knowledge when two target goods had been found. When all eight target goods had been found, s/he was asked to complete a questionnaire of survey knowledge. 2.4 Measurements Spatial knowledge and performance measurements were recorded and collected for all trials. The measurements included the following:
Goods-Finding and Orientation in the Elderly on 3D Virtual Store Interface
479
1. Measurement of route knowledge: When participants finished two goods finding and returned to the entrance in each stage, they were requested to complete an oral questionnaire to describe the correct position of one of the two target objects. Participants would receive 9 points if they were able to point out the correct showroom, 6 points if they were unable to point out the showroom but the next door and 3 points the next to next door; 7 points if they were able to point out the correct cabinet where the target object was posited, 5 points if they were unable to point out the cabinet but the next door and 3 points the next to next door; 5 points if they were able to point out the correct shelf where the target object was posited, 4 points if they were unable to point out the shelf but the next door and 3 points the next to next door. 2. Measurement of survey knowledge: When participants finished all eight goods finding, they were requested to point out the correct position of all eight goods on the map. The calculation of scores was the same as route knowledge. 3. Goods-finding duration: The time from beginning of the trial until the Entrance key was pressed indicating the end of the trial.
3 Results and Discussion This experimental study was designed to investigate the impact of presence and absence of goods-classification on the acquisition of route and survey knowledge and the effects of different types of landmarks with both presence and absence of goodsclassification. The score analysis in the route knowledge revealed that the presence of goods-classification was significantly better than in absence (F1,16 = 25.20, P = 0.000). However, the F for survey knowledge was not statistically significant (F1,16 = 2.99, P = 0.103). The time of goods-finding in presence of goods-classification was also significant shorter than in absence (F1,16 = 30.98, P = 0.000). In addition, the scores of route knowledge with landmarks in presence of goods-classification were significant larger than in absence (t(26)=4.69, P=0.000) and the time of duration of goods-finding was also significant shorter (t(26)=4..85, P=0.000). These results showed that the goods-classification was positive and important for the acquisition of route knowledge and good-finding. When the participant had more route knowledge, the time of duration of goods-finding would be shorter. While the route knowledge was constructed, the survey knowledge was also made up but slowly. Goods-classification and landmarks were good help for spatial knowledge in the beginning, especially in route knowledge. After the learning phase, the survey knowledge would be made up clearly even though the goods-classification and landmarks were absent. Therefore, the time of duration of good-finding in the forth stage was shorter than the first stage. Because the survey knowledge was measured when experiment finished, the scores of survey knowledge were not significantly different on the means of each condition. In effects of landmarks, we found that different types of landmarks were not significantly different in acquisition of route knowledge (F7,16 = 1.74, P = 0.169) and survey knowledge (F7,16 = 0.61, P = 0.741). However, in comparing the difference between the presence and absence of landmarks, we found that the scores of route knowledge in presence of landmarks were significantly larger than in absence when goods-classification was present (t(26)=4.78, P=0.000) and absent (t(26)=2.59,
480
C.-L. Liu et al.
P=0.008), and the time of duration of goods-finding was also significantly shorter for participants that navigated with landmarks as compared to without landmarks in presence (t(26)=22.14, P=0.000) and absence (t(26)=13.65, P=0.000) of goods-classification. The scores of survey knowledge were significantly different between presence and absence of landmarks in presence of goods-classification (t(26)=2.27, P=0.016), but not significant different in absence of goods-classification (t(26)=1.19, P=0.123). It can be seen that landmarks were also important for goods-finding in 3D virtual store, no matter what combined type was used. This finding is in agreement with many previous studies indicating the impact of landmarks on spatial cognition. The finding also reflects the more advantage of having landmarks in the 3D virtual store environment particularly when goods were classified. According to the previous results, goods-classification is associated more with the acquisition of spatial knowledge. Finally, an additional analysis was performed to evaluate what type of landmarks would be the greatest impact in goods-finding while the goods-classification was present. Figure 2 displays the mean scores of route and survey knowledge and duration time on the eight types of landmarks for presence and absence of goods-classification. It can be seen that the measuring scores of route knowledge were the largest and the time of duration of goods-finding was shorter while the presence of goods-classification combined with landmarks in the type of alphanumeric + 2D picture + 3D object. Simultaneously, it happened in absence of goods-classification. Taken together, the findings show that goods-classification was much important in goods-finding when participants navigated in the 3D virtual store, and the landmarks could be seemed as redundant codes. If there is more information from landmarks on goods-finding in the 3D virtual store, the acquisition of route knowledge of participant would be better. Additionally, if the duration of goodsfinding in 3D virtual store is short, the survey knowledge would be not easily made up and little impact on goods-finding.
180
180
90
90
160
80
140
70
140
70
120
60
100
50
80
40
60
30
M e a n sc o re s o f k n o w l e d g e
80
M e a n sc o re s o f k n o w l e d g e
160
120
60
100
50
80
40
60
30
40
20
40
20
20
10
20
10
Route knowledge Survey knowledge Duration
Survey knowledge Duration
(min.)
(min.)
0
Route knowledge
Types of landmarks
Presence of goods-classification
3D
+3 nu ha
alp
ha
me
nu
r ic
+2
me
D+
D r ic
2D
+3
D
D
3D alp
ha
nu
me
ha
r ic
nu
+2
me
2D
r ic
ne
nu
alp
a lp
ha
No
3D +2 r ic me
nu ha alp
alp
r ic me
me nu ha alp
D+
D +3
+2 r ic
nu ha alp
2D +3
D
D
3D
2D
me
No
r ic
ne
0
Types of landmarks
Absence of goods-classification
Fig. 2. Mean scores of route and survey knowledge and duration time on the eight types of landmarks for presence (left panel) and absence (right panel) of goods-classification
Goods-Finding and Orientation in the Elderly on 3D Virtual Store Interface
481
4 Conclusion 1. Analysis of the goods-finding task with goods-classification in the 3D virtual store indicates that classification had a significant impact on acquisition of spatial knowledge and goods-finding. When the goods are classified in different showrooms in accordance with natures, the showrooms would be regarded as a regular graphic representation of the entire geographic area, including the layout of elements (i.e. showcases) and the spatial relationships among them. Therefore, a clear cognitive map representing the environment including position and direction information would be constructed easily. 2. Landmarks are also important for goods-finding in 3D virtual store, no matter what combined type is used. The finding also reflects the more advantage of having landmarks in the 3D virtual store environment particularly when goods were classified. However, If goods-classification was absent in such an environment, one could have additional information from landmarks to process. Such information enables the determination of one s current position and basis for goods-finding. 3. In correct goods-finding responses, we found that no mater goods are classified or not, the landmark in type of alphanumeric + 2D picture + 3D object has good effect on the acquired spatial cognition. 4. Landmarks are not the only resource to make up spatial knowledge in 3D virtual store; goods-classification also is a good one and may be more important than landmarks.
’
Acknowledgement The authors would like to thank the National Science Council of the Republic of China for financially supporting this work under Contract No. 97-2221-E-238-013.
References 1. Appleyard, D.A.: Planning a Pluralistic City: Confliicting Realities in Ciudad Guayana. MIT Press, Cambridge (1976) 2. Chen, C.: Information Visualization and Virtual Environments. Springer, London (1999) 3. Conroy, R.: Spatial Navigation in Immersive Virtual Environments. Thesis, University College London (2001) 4. Coyne, K., Nielsen, J.: Web Usability for Senior Citizens: 46 Design Guidelines Based on Usability Studies with People Age 65 and Older. Nielsen Norman Group Report (2002) 5. Darken, R., Peterson, B.: Spatial Orientation, Wayfinding, and Representation. In: Stanney, K. (ed.) Handbook of Virtual Environments: Design, Implementation, and Applications, Lawrence Erlbaum Associates, Mahwah (2002) 6. Davis, S.G.: Culture Works the Political Economy of Culture. Minneapolis. University of Minnesota Press, London (2001) 7. van Dijk, B., op den Akker, R., Nijholt, A., Zwiers, J.: Navigatiion Assistance in Virtual Worlds. Informing Science Journal 6, 115–125 (2003)
482
C.-L. Liu et al.
8. Ding, J., Yu, L., Wang, Y., Pan, Z.: EasyHouse-I: A Virtual House Presentation System Based on Internet. In: 11th International Conference on Human-Computer Interaction, Las Vegas (2005) 9. Elvins, T.T., Nadeau, D.R., Schul, R., Kirsh, D.: Worldlets: 3-D Thumbnails for Wayfinding in Large Virtual Worlds. Presence 10, 565–582 (2001) 10. Goldin, S.E., Thorndyke, P.W.: Simulating Navigation for Spatial Knowledge Acquisition. Human Factors 24, 457–471 (1982) 11. Jansen-Osmann, P.: Using Desktop Virtual Environments to nvestigate the role of landmarks. Computers in Human Behavior 18, 427–436 (2002) 12. Kaplan, S.: Cognitive Maps in Perception and Thought. In: Downs, R.M., Stea, D. (eds.) Image and Environment, Aldine, Chicago (1973) 13. Kitchin, R.M.: Cognitive Maps: What are They and Why Study Them? Journal of Environmental Psychology 14, 1–19 (1994) 14. Kitchin, R., Blades, M.: The Cognition of Geographic Space. I. B. Tauris Publishers (2002) 15. Lathan, R.: Tutorial: a Brief Introduction to Simulation Sickness and Motion Programming. Real Time Graphics 9, 3–5 (2001) 16. Lin, D.-Y.M.: Evaluating Older Adults’ Retention in Hypertext Perusal: Perfects of Presentation Media as a Function of Text Topology. Computers in Human Behavior 20, 491– 503 (2003) 17. Lynch, K.: The Image of the City. MIT Press, Cambridge (1960) 18. Montello, D.: A New Framework for Understanding the Acquisition of Spatial Knowledge in large-scale Environment. In: Egenhofer, M., Golledge, R. (eds.) Spatial and Temporal Reasoning in Geographic Information Systems. Spatial Information Systems, pp. 143–154. Oxford University Press, New York (1998) 19. Omer, I., Goldblatt, R.: The Implications of Inter-Visibility Between Landmarks on Wayfinding Performance: an Investigation Using a Virtual Urban Environment. Computers, Environment and Urban Systems 31, 520–534 (2007) 20. Park, D.C.: The Basic Mechanisms Accounting for Age-related Decline in Cognitive Function. In: Park, D.C., Schwarz, N. (eds.) Cognitive Aging: A Primer, pp. 3–21. Psychology Press, Philadelphia (2000) 21. Parush, A., Berman, D.: Navigation and Orientation in 3D User Interfaces: the Impact of Navigation Aids and Landmarks. International Journal of Human-Computer Studies 61, 375–395 (2004) 22. Passini, R.: Wayfinding in Architecture, 2nd edn. Van Nostrand Reinhold, New York (1992) 23. Peponis, J., Zimring, C., Choi, Y.K.: Finding the Building in Wayfinding. Environment and behavior 22, 555–590 (1990) 24. Regan, E.C., Price, K.R.: The Frequency and Severity of Side-Effects of Immersion Virtual Reality. Aviat. Space Environ. Med. 65, 527–530 (1994) 25. Stanney, K.M.: Handbook of Virtual Environments. Earlbaum, New York (2002) 26. Vinson, N.: Design Guidelines for Landmarks to Support Navigation in Virtual Environments. In: ACM Conference on Human Factors in Computing Systems, pp. 278–285. Pitttsburgh, Pennsylvania (1999) 27. Vrechopoulos, A.P., O’Keefe, R.M., Doukidis, G.I., Siomkos, G.J.: Virtual Store Layout: an Experimental Comparison in the Context of Grocery Retail. Journal of Retailing 80, 13–22 (2004) 28. Waller, D., Hunt, E., Knapp, D.: The Transfer of Spatial Knowledge in VE Training. Presence: Teleoperators and Virtual Environments 7, 129–143 (1998)
Goods-Finding and Orientation in the Elderly on 3D Virtual Store Interface
483
29. Wickens, C.D.: Frames of Reference for Navigation. In: Gopher, D., Koriat, A. (eds.) Attention and Performance 16, pp. 130–144. Academic Press, Orlando (1999) 30. Wickens, C.D., Hollands, J.G.: Engineering Psychology and Human Performance, 3rd edn. Prentice-Hall, Englewood Cliffs (1999) 31. Witmer, G.G., Bailey, J.H., Knerr, B.W., Parsons, K.C.: Virtual Spaces and Real World Places: Transfer of Route Knowledge. Int. J. Human-computer Studies 45, 413–428 (1996) 32. World Health Organization, Press Release WHO/69, http://www.who.ch/press/1997/pr97-69.htm
Effects of Gender Difference on Emergency Operation Interface Design in Semiconductor Industry Hunszu Liu Department of Industrial Engineering and Management, Minghsin University of Science and Technology, No.1, Xinxing Rd., Xinfeng Hsinchu 30401, Taiwan R.O.C hliu@must.edu.tw
Abstract. This research investigates the effects of gender difference on emergency operation interface design through studying monitoring operations performed at emergency response center. An experiment is designed to test the performance differences between fifteen male and fifteen female college engineering students. The signal detection time, incident processing time, number of errors, and duration of experiment are dependant variables to measure the participants’ performance. Statistical analysis indicates that no significant differences can be found between males’ and females’ performances except the number of errors. Female participants make more errors than male participants. A training program is suggested to help female workers familiar with the emergency operations. The research results provide evidences for adjusting current disaster prevention personnel recruitment policy and suggestions for further improvements of emergency operation interface design in semiconductor industry. Keywords: User interface design, emergency management, human performance, gender differences.
Effects of Gender Difference on Emergency Operation Interface Design
485
of various highly toxic chemicals in tightly enclosed buildings or area, require continuously surveillance on the factory activities and facilities to ensure safe operations. The facility management and control system (FMCS), widely installed in the emergency response center (ERC) located within Taiwan semiconductor industries, is designed to serve this purpose. The FMCS monitors the status of power supply subsystem, gas supply subsystem, water processing subsystem, clean room air control subsystem, accident control subsystem, chemical control subsystem, process cooling water subsystem, air condition subsystem, gas detection subsystem, very early smoke detection alarm subsystem etc. through the integrations of communication and information technologies. Each subsystem, automatic or semi-automatic, collects system operation data through sensors, video camera, alarms, and communication devices located around shop floors. Typical task performed by operators in the ERC is to monitor the information display devices, such as computer screens, TV screens, telephones, inter-com devices, and broadcasting devices and to perform the information processing tasks. Whenever abnormal signal presents, operators are required to detect the signal and initiate appropriate actions in time based on the standard operation procedures. It is generally recognized by local industry that male workers are more capable of performing these tasks than female workers. This Current approach may put the emergency response operation under the risks of poor human performance and not use the company resource effectively. This research investigates the effects of gender difference on emergency operation interface design through studying monitoring operations performed at emergency response center.
2 Background Information The concept of establishment of ERC emerged after some severe accidents, which have cost Taiwan semiconductor industry more than 20 billions of losses since 1996. The inherent risk manufacturing characteristics, such as utilization of various highly toxic chemicals during continue production process and long working hours, usually 12 hours per shift, of semiconductor industry require continue surveillance on the factory activities to ensure the safe operations of its production. In Taiwan, the responsibilities of monitoring the semiconductor plant activities often fall upon the shoulders of working teams in Emergency Response Center (ERC) through the integration of data from automatic sensor devices, video camera and communication channels. Duty supervisor examines all these data, and appropriate instructions are given to related personnel to perform correct rectified actions. Computer programs are utilized to assist the data integration process, and some decision supporting systems are also installed to reduce the mental workload of monitors. As a result, various types of data are presented on several computer screens, which require experienced workers to interpret and transform them into useful information. The interface design plays as one of the crucial factors for ERC supervisors’ performances. Studies of gender differences on safety present different results. Crowe (1995) argued that although earlier study results show that females are more safety conscious than males, gender is not a good estimate of their safe practices [1]. Mardis & Pratt, (2003), Mayhew & Quinlan (2002), and Breslin & Polzer et al. (2007) found no
486
H. Liu
gender differences in injury rate [2][3][4]. On the other hand, Islam & Mannering (2006) concluded that there are statistically significant differences between male and female drivers [5]. Monarrez-Espino (2006) presented similar result that men’s fatality rate was five times higher that that of women for single car crashes [6]. These studies have shown that the common impressions of gender differences on safety shared by industry persons may not be right. Further studies on this issue are needed.
3 Experiment Historical data of FMCS in a local semiconductor company has shown that the gas detection system records the most frequent events. Therefore, the gas detector alarm system is chosen to test the gender differences in interface design. The structure of the Gas Detection System is depicted in Fig. 1. The major components of this system are gas detector, programmable logic controller, supervisory control and data acquisition server (SCADA), and FMCS display screen. The gas detector is installed around the shop floor. When gas detector is activated, the alarm will sound and alarm signal will appear on FMCS display screen (Fig. 2). The ERC operator is required to pick the icon on screen to identify the location of detector and inform related site operator to take necessary precautions through communication system, such as phone line, radio system or broadcasting devices (Fig. 3). The FMCS is an online monitoring system, which required fully function twentyfour hours. It is not practical to be used directly as the experiment facility. In stead, in this research, a computer program, which simulates the operation of gas detector alarm informing procedure, is used as testing tool to investigate the performance of participants (Fig. 4). If there is incident message appearing on screen, he/she must perform the informing procedure immediately (Fig. 5). Fifteen male students and fifteen female students who have the potential to work as supervisor voluntarily participate this experiment. They all have engineering background and interested in safety and health career. Participants were randomly mixed and trained to familiar with the gas alarm informing procedure. After he/she has confidence to accept the test, each subject runs the testing program with three different frequencies of incident rate. Performance measurement index include signal detection time, incident processing time, number of errors, and durations of experiment.
Fig. 1. The structure of gas detector alarm system
Effects of Gender Difference on Emergency Operation Interface Design
487
The environment setting is arranged according to the layout of ERC (Fig 6). The FMCS system is installed in No. 1 and No. 2 Display, where are the source of incident information. Each participant is required to monitor these two screens during the experiment. The cellular phones are used as the communication devices to inform related personnel for further instructions.
Fig. 2. Display screen of the FMCS display screen
Fig. 3. Display screen of the gas detector alarm system
Fig. 4. Simulation of FMCS
488
H. Liu
Fig. 5. Display screen of informing procedure
Fig. 6. Layout of experiment
Four hypothesis tests were conducts to explore the gender differences in interface design. Hypothesis 1: Female students take longer time to detect the signal than male. Hypothesis 2: Female student take longer time to process the incident. Hypothesis 3: Female student make more errors than male. Hypothesis 4: Female student take more time to finish the experiment than male.
4 Results and Discussions The interface design of emergency operation may greatly affect the effectiveness and efficiency of ERC operations and the system safety. This research investigates the effects of gender difference on emergency operation interface design through studying monitoring operations performed at emergency response center. In this study, the gas detection subsystem is classified as the most active subsystem, which required greater attentions from workers. Simulation program and working environment are developed in the lab to mock up the scenario of gas detection subsystem. An experiment is designed to test the performance differences between fifteen male and fifteen female engineering college students. The participants are required to identify time, node, tag-name, description, value, status which were shown at the lower part of screen when the alarm signal appeared on the screen. Based on these information participants have to locate corresponding area through maneuvering the FMCS and
Effects of Gender Difference on Emergency Operation Interface Design
489
obtained the identification of machine and gas type. Then, record this incident on daily activity logs. The signal detection time, incident processing time, number of errors, and duration of experiment are dependant variables to measure the participants’ performance. Statistical analysis indicates that no significant differences can be found between males’ and females’ performances except the number of errors. Female participants make more errors than male participants.
5 Conclusions In the past, female worker can easily apply safety and health related job. In fact, many safety and health personnel in Taiwanese manufacturing environment are female. The establishment of ERC has limited the job right of female safety and health practitioners. This study result shows that female make more errors than male. Further discussions with female participants indicate that system familiarity is the main reason for this problem. The feasible solutions include the interface design improvement and more training for female workers. The capable ERC supervisors are hard to find. Most new recruits need to be trained with not only the unique FMCS operations, but also other safety and health operations. With little improvement of interface design, female safety and health personnel can perform the ERC tasks as well as male does. The research results provide evidences for measuring the appropriateness of current disaster prevention personnel recruitment policy and suggestions for further improvements of emergency operation interface design in semiconductor industry.
Acknowledgements This research is funded by National Science Council in Taiwan (NSC# 97-2221-E159-011).
References 1. Crowe, J.W.: Safety Values and Safe Practices Among College Students. Journal of Safety Research 26(3), 187–195 (1995) 2. Mardis, A.L., Pratt, S.G.: National injuries to young workers in the retail trades and services industries in 1998. Journal of Occupational and Environmental Medicine 43, 316–323 (2003) 3. Mayhew, C., Quinlan, M.: Fordism in the fast food industry: Pervasive management control and occupational health and safety risks for young temporary workers. Sociology of health and Illness 24, 261–284 (2002) 4. Breslin, F.C., Polzer, J., MacEachen, E., Morrongiello, B., Shannon, H.: Workplace injury or part of the job?: Towards a gendered understanding of injuries and complaints among young workers. Social Science & Medicine 64, 782–793 (2007) 5. Islam, S., Mannering, F.: Driver aging and its effect on male and female single-vehicle accident injuries: Some additional evidence. Journal of safety Research 37, 267–276 (2006) 6. Monarrez-Espino, J., Hasselberg, M., Laflamme, L.: First year as a licensed card driver: Gender differences in car crash experience. Safety Science 44, 75–85 (2006)
Evaluating a Personal Communication Tool: Sidebar Malena Mesarina, Jhilmil Jain, Craig Sayers, Tyler Close, and John Recker HP Labs, Palo Alto firstName.lastName@hp.com
Abstract. By more closely integrating email with the web we aim to bring organization to email and more collaboration to the web. To this end we developed the Sidebar, a web-browser plug which displays email messages which link to the currently displayed URL. We conducted longitudinal studies on two versions of Sidebar to observe the usage of Sidebar and determine if it improves communications productivity. We found that providing an email summary in Sidebar resulted in raised awareness of the email collaborations, increased serendipitous discovery of information, and resulted in higher reported communication productivity. This paper summarizes Sidebar’s operation, describes the user studies, and presents conclusions. Keywords: Personal communication, browser plug-in, longitudinal user study, interviews, diary study, surveys, usability evaluation, conversation visualization, information visualization, email visualization, conversational thumbnail, email, related links, related web-pages.
email editor to display a pre-populated message whose To and CC fields are filled with all those people who have previously been involved in email discussions about the relevant web page and where the body of the email includes the URL and title for the web page.
Fig. 1. Sidebar at work for the W3C Web Security Context Working Group. The plugin on the left-hand side summarizes email that links to the currently displayed document. It is sorted by date and URL fragment identifier.
While the basic operation is simple, Sidebar leverages both the existing email infrastructure for content storage/delivery and the existing web infrastructure for navigation. As a result it serves several roles: 1. It allows serendipitous email discovery. Users may just browse the web as usual and Sidebar will discover emails that refer to the visited web page. One common means of navigating the web is by clicking on a link inside an email message and in this case the Sidebar conveniently shows not only that email message, but also any other messages the user may have received about the same page.
492
M. Mesarina et al.
2. It provides organization for email discourse. When starting a new conversation, just include a link to a related web page. For example, if discussing a trip you might link to an appropriate map. Now, whenever the user revisits that page he or she will see all the related discussions even though those emails may have had different recipients, subjects and creation dates. Think of the web as providing a navigational structure on which you can hang conversations. Note that end-users also have the option to generate pages themselves (using a wiki for example) just to serve as handles for later conversations and that in corporate environments domain experts can mine existing databases to automatically generate a set of web pages which can then serve a navigational role. 3. It collates private and public information by showing your personal email alongside relevant web pages. For example, when visiting a wiki page you can choose to publish content in a relatively public way by editing the wiki page or in a relatively private way by authoring an email to a more limited set of individuals. Sidebar shows both communication paths side-by-side and allows email to serve as a replacement for access-controlled collaborative systems. This is particularly valuable in corporate environments where security and document retention policies for email are already well-defined and understood. 1.1 Visualization Thumbnails While showing summary text of every relevant message is adequate for short conversations it does not scale well. This is particularly true for our application since the email conversations are peripheral to the browsing task and so are given only limited display area. Accordingly, we explored the design of small visual images to represent email conversations. An example of these conversational thumbnails is shown in Figure 2.
Fig. 2. A visual thumbnail is shown in the sidebar at left. A closeup of the visualization is shown on the right. Along the bottom are conversational participants, the arcs show message paths, and the small planet-like objects are links to related web pages.
The thumbnail depicts conversational participants, the flow of messages from sender to receiver, and other URL links found in the email conversations that refer to the current URL. Size and relative positions were used to visually relate the importance of these objects to the user.
Evaluating a Personal Communication Tool: Sidebar
493
People are represented with icons whose proximity to the current user (the blue person icon) reveals how important the person is to the user in conversations about the current web page. More recent message which are either directly to (not CC or BCC’d) or from the current user are considered of higher importance. The color assigned to each person is a function of their email addresses. The flow of messages between participants is depicted by the curved links between the dots on the white circle. The white circle is a symbolic representation of the current conversation, where the user is represented by the blue dot at the center and dots at the border are the rest of the people involved in the conversations about the web page. Each curved link represents a message exchange whose color depends on the color of the sender. Note that some links do not pass through the center (as in the above figure) and these indicate messages on which the current user was CC’d. Related web sites are depicted by the grey circles, that we call “planets” orbiting the current conversation. The size and vertical location of the planets depend on their relevance to the current user – was he or she the sender or recipient, how frequently was the link mentioned in email, and how long ago did the communication happen. Further details of the visualization are available in [9].
2 The Study We conducted two studies, one for each version of Sidebar. The first study on the “text” summary version of Sidebar was designed to find whether Sidebar improved collaborative communications in job related tasks. The second study on the “visual” version of Sidebar was conducted to find whether the visualization thumbnails provided an added benefit to the usability of Sidebar. 2.1 First Study We conducted the first study in early 2008. The participants were selected from a marketing communications (communications) and administrative (admins) assistants staff at HP Labs. Sending URLs of web sites as references for work related content were typical activities of both groups. Five of the participants (all females) were admins and the rest (4 females, 1 male) were from communications, with ages in the 20-60 range. We asked the participants to install and use Sidebar for three weeks. We used the diary study methodology to capture participant experiences during the entire three week period. Additionally, we conducted weekly semi-structured interviews and surveys with all participants to study how usage behavior and perceptions change over time. We also encouraged participants to report any particularly good or bad experience via email (the link was provided on the Sidebar interface) as soon as possible after the relevant interaction. 2.2 Second Study For the second study we selected five new participants with different job functions: research manager, scientist, writer, admin and research intern. Two were female, three male, and they ranged in age from 30 to 45. After installing the version of Sidebar
494
M. Mesarina et al.
that included the visualization thumbnail, we conducted a two-week study. We did not explain the visualization, but instead asked them to explore it on their own. We interviewed them after 20 minutes and then again after one week and asked them to explain the various facets of the visualization.
3 Findings 3.1 Purpose of Using URLs during Communication One of the goals of the study was to uncover the primary reasons why people send URLs to themselves and to other people. We identified five reasons: a) transferring data from one machine to another, i.e. laptop to desktop and vice versa, or one physical location to another, i.e. work to home; b) sharing events (upcoming conference) or knowledge (articles, papers); c) decision making – which component/device should I buy?, recommendations for books; d) archiving – since this is more permanent than bookmarking and is also accessible from any machine, any browser, and any location; and e) reminding- taxes/forms to fill, conference registration, etc. 3.2 Adding Comments from Sidebar Though Sidebar provided a link "Start a new email thread", we observed that while this was the favorite feature of one admin, many still preferred a manual cut and paste mechanism to send emails. It could be that these users were struggling between habit and convenience, or they simply preferred to use the same mechanism as needed to insert other content into a message. When the user clicks the "Start a new email thread" hyperlink, Sidebar automatically adds all the people engaged in the various conversation threads as recipients of the generated email. Somewhat surprisingly, most users did not appreciate this convenience and were wary of accidentally sending email to unintended recipients. 3.3 Wiki+Sidebar = Topic Organization One of our participants from the communications department is responsible for organizing customer visits to HP. After using Sidebar for a week, he began using the company’s Wiki tool along with Sidebar to organize each visit. He created a new wiki page for each visit where he recorded organizational details and the agenda for the visit. Then he would include a link to the relevant agenda page within each message. In this manner, he could use Sidebar to organize his emails relevant to each meeting. Interestingly, in contrast to the concerns of most other participants, he found the prepopulation of emails addresses of people involved in a thread useful. 3.4 Effects of Combining Email and Browsing Most of our target users use Microsoft Outlook for their email communications, and Internet Explorer/Firefox for browsing. Keeping this in mind, we designed Sidebar such that all the email-related tasks are removed from it in order to reduce the cognitive load. But we observed that once users became comfortable with using Sidebar,
Evaluating a Personal Communication Tool: Sidebar
495
they wanted to be able to access and search all their email from Sidebar itself. When surveyed, almost 80% of the participants wanted a tool that could integrate email and web page browsing. 3.5 Communication Productivity Each week we asked users if they felt that Sidebar had increased their communication productivity. Here we observed an interesting phenomenon; participants felt that Sidebar had resulted in increased productivity especially when they did not remember the URL. Note that in this scenario, participants typically had to first search inside Outlook using keywords to locate any one email that contains the (unknown) URL in question. They would then click on the URL link in that email to display the page in the browser and use Sidebar to extract other related emails. Thus it appears that users saw Sidebar’s ability to display all the other messages as valuable. Participants’ web browsing sessions consisted of viewing not only web pages that they were collaborating on, but also web pages that they were not. Thus, from a communications point of view, only a subset of the web pages viewed in the browsing session were relevant to the participants. Sidebar, however, shows relevant emails for any web page that the participant may have sent email about. This feature revealed something interesting. When viewing known URLs in the browser, the participants tended to not interact with the emails since they were already aware of the content of the conversations about the URL. However, when the URL was unknown, the participants always interacted with the emails displayed by Sidebar since they were specifically using Sidebar to extract related emails. Thus most participants had the perception that Sidebar was more favorable when the URLs were not known. During this study we observed there were a number of times when participants desired other web pages related to the web page being viewed, or needed help to locate emails for an unknown URL. We conclude that when displaying communication related to a web page, the email communication channel becomes intertwined with browsing, and an integrated interface that combines the relevant functionalities of both becomes essential. 3.6 Multiple Communication Channels People generally use multiple communication channels that (typically) consist of more than one email account, instant messaging, SMS, Twitter etc. When they share URLs they also tend to use more than one of these channels. Note that while email was a predominant channel at work, we observed that participants used Jabber (IM client) or SMS to send short snippets of information. Typically, Jabber was used when information was not of much importance, and SMS was used mostly for urgent or time critical issues. When users were viewing a URL, if they had used a channel other than their work email for sending the URL or communication related to it, this information would not appear on the Sidebar interface (since at the time of development Sidebar only indexed emails from Outlook) thus causing confusion. This was an interesting finding and one of the key problems of integrating web and communication channels.
496
M. Mesarina et al.
3.7 Personal Productivity Trends We observe that some users were quite surprised to discover both how many of their incoming email messages already contained URLs and how many messages containing links they were sending. Included with Sidebar is a web application for extracting reports about the Sidebar index. These reports show how long the user has been talking about a topic, what days of the week the topic is normally discussed and what the hottest topics of conversation are. Sidebar trends track the popularity of topics in the user's social sphere and help determine communication productivity around a topic. 3.8 A Personalized Index into Long Web Documents When Sidebar is indexing the user's email it recognizes those messages which refer to particular sections within a larger document (using Fragment Identifiers [7]) and arranges the email messages using section headings extracted from the web document. Users can then navigate the larger document by clicking on sub-headings in the Sidebar, or navigate among messages by clicking on links to sections of the document. One of the co-authors of this paper utilized this feature extensively during his role as editor of a W3C working group document. As working group members sent emails with comments and links to the corresponding sections that had a URL fragment, Sidebar would group and display the emails referring to the appropriate section next to it. This is an example where the integration of email content with web documents makes editorial work more efficient, however our user groups did not appear to notice this feature and it is unlikely to be appreciated until the use of fragment identifiers becomes more pervasive. 3.9 Privacy Initial observations (conducted before the longitudinal study) indicated that some users were concerned about the location of the email index created by Sidebar. We specifically addressed this by making clear to participants that the index was being created on their own local machine. During the weekly interviews we asked users if the email indexes were created as per their privacy expectations, and all the 11 participants reported that it was. We were also concerned about inadvertent disclosure caused by having Sidebar open while browsing the web in a group setting. None of the users reported any discomfort and several mentioned that it was very easy to toggle between opening and closing Sidebar in the browser. 3.10 Visualization Thumbnail All participants understood that the people displayed in the thumbnail represented those involved in the email conversations about the currently displayed web page. Although the relationship between the seating position of a person and the importance of his/her emails was not apparent to the participants, all mentioned that seeing the number of people involved in a conversation suggested how important a web page was (in a collaborative context).
Evaluating a Personal Communication Tool: Sidebar
497
Everyone figured out that they were at the center of the circle, but the abstraction of messages as lines was not apparent to either the writer or the admin. Interestingly in both cases they had tried hovering over and clicking on the message lines and people spots and so if we had supported an appropriate action in each case then the meaning may have been more readily apparent. All users noted that the planets represented links related to the current web page, and thought that the size of the planet depended on the number of times the URL was mentioned in an email (the latter is not exactly correct since we had actually computed the size based on a weighted function so that, for example, links I send carry more weight than those I receive). No users correctly interpreted the planet positions. Even though several noticed that pages closer to the current page were more to the right and that similar web pages appeared close together, they were confused when dissimilar pages also appeared close together on the outer bits.
4 Related Work Several tools, such as Remembrance Agent [6], Margin Notes [5] and VistaBar [4], have been developed for indexing email, files and other online documents based on web page content. One of the main differences between Sidebar and these tools is the relevancy of the matched documents. Sidebar uses the webpage URL to search for messages embedding the same URL, thus it is able to find messages that are, most of the time, about the web page. In contrast [6], [5], and [4] use keyword matching between the content of a page and other documents. Anchored Conversations [2] is a tool for adding comments to documents. It is a plug-in for Microsoft Word that allows users to attach multiple chat windows to sections for real-time discussions. In the case of Sidebar, email messages with a URL fragment point to the corresponding sections under that fragment on a webpage. There are several academic and commercial tools for searching and sorting email. Two prominent solutions are Google Desktop Search [3] and Xobni [8]. Although one feature of Sidebar is that email messages are organized based on shared embedded URLs, Sidebar is more of an integration tool between email communications and web content, rather than a mail organizer or search tool, and in fact coexists well and with those search tools.
5 Discussions and Conclusions Sidebar provides a summary of user email which is relevant to a currently-visited web resource. After conducting two user studies, we conclude that: • Sidebar provides serendipitous discovery of emails when browsing. In this case, the email summaries served mostly as historical reminders, and the participants tended not to interact with them. • Sidebar also aids in search. Once users had used existing email search tools to find one message containing a URL, they could select it to see all other relevant messages within the Sidebar.
498
M. Mesarina et al.
• Some participants preferred manually creating email messages and copying URLs even though Sidebar could automatically provide this service. We note that users are very comfortable with existing email and web browsing tools, and that some users were nervous when address fields were automatically pre-populated. • Hanging conversations off web-pages is a powerful organizational tool, and we expect it to have the most benefit in organizations where there is already a suitable set of web addresses or where users change their behavior to accommodate it. A particularly good usage pattern is to create a twiki page for each meeting and then include a link to that in all related emails. • Participants expressed the desire to have a single unified experience combining web browsing with email and other communication channels. • Although the document indexing capability of Sidebar using fragment identifiers was very useful for online editing, this seems to not be a prevalent activity among users. • In term of privacy issues, although at first the participants raised some privacy concerns related to the usage of their email archives, Sidebar’s approach of storing personal email on the local machine matched user privacy expectations. • Displaying people and related links on the “visual” version of Sidebar were both immediately appreciated by all our test users. Interestingly, while we’d computed multi-dimensional weighted functions to determine position and size, our test users tended to assume a much simpler and more democratic relationship: size of object is proportional to the number of messages. • The visualization has room for improvement, especially in the messaging display which caused the most difficulty for our users. Improved interactivity should help. In today’s workplace, a large percentage of communication is conducted using email, leaving behind large but mostly unorganized email archives. Today’s web provides a wealth of organizational structures, as well as technology for bringing other organizational structures onto the web. By more closely integrating email with the web, Sidebar brings organization to email and more collaboration to the web.
Acknowledgements Thanks to Venugopal Srinivasmurthy and Badrinath Ramamurthy for their assistance earlier in this research project, to Vlad Bolshakov who helped create the Outlook plugin, and to all those developers who contributed to the open source libraries on which our solution depends. Thanks also to our colleagues at HP for their advice and encouragement, especially our anonymous test users.
References 1. Close, T., Recker, J., Sayers, C., Badrinath, R.: Sidebar: Ad-hoc, yet organized Personal Collaboration. HP Labs Technical Report HPL-2008-17 (2008), http://www.hpl.hp.com/techreports/2008/HPL-2008-17.html 2. Churchill, E.F., Trevor, J., Bly, S.A., Nelson, L., Cubranic, D.: Anchored Conversations: Chatting in the Context of a Document. In: Proc. CHI 2000, pp. 454–461. ACM Press, New York (2000)
Evaluating a Personal Communication Tool: Sidebar
499
3. Google Desktop, http://desktop.google.com/features.htm 4. Marais, H., Bharat, K.: Supporting Cooperative and Personal Surfing with a Desktop Assistant. In: Proc. of UIST 1997, pp. 129–138. ACM Press, New York (1997) 5. Rhodes, B.: Margin notes: Building a contextually aware associative memory. In: Proc. IUI 2000, pp. 219–224. ACM Press, New York (2000) 6. Rhodes, B., Starner, T.: Remembrance agent: a continuously running automated information retrieval system. In: Proc. PAAM 1996, pp. 487–495. The Practical Application Company Ltd., Blackpool (1996) 7. URL Fragment Identifiers, http://www.w3.org/Addressing/URL/4_2_Fragments.html 8. Xobni, http://www.xobni.com/ 9. Sayers, C., Mesarina, M., Jain, J., Recker, J.: Visualizing Email Conversations and Related Web Resources, HP Labs Technical Report HPL-2008-138, Hewlett Packard (October 2008)
“You’ve Got IMs!” How People Manage Concurrent Instant Messages Shailendra Rao1, Judy Chen2, Robin Jeffries3, and Richard Boardman3 1 Stanford University University of California, Irvine 3 Google shailo@stanford.edu, judychen@ics.uci.edu, {jeffries,rickb}@google.com 2
Abstract. Instant Messaging (IM) clients allow users to conduct multiple simultaneous conversations, which we term “concurrent IMs.” In this study we investigate how adults manage concurrent IMs both in the workplace and within the context of a goal-directed, time-bounded recreational task. We discuss differences in behavior between engaging in a single IM conversation and engaging in concurrent IMs. We document the errors that arise as a consequence of concurrent IMs and identify four main strategies users employ to manage them: controlling the pace of conversations, limiting the number of simultaneous conversations, window management, and using tabbed IM windows. Finally, we explore the pros and cons of these strategies and examine design tradeoffs to enable effective space and attention management while minimizing disruption to the user. Keywords: Instant messaging, concurrent IMs, multitasking, informal communication, notifications, tabs.
“You’ve Got IMs!” How People Manage Concurrent Instant Messages
501
while using IM frequently occurs in the workplace and observed users participating in multiple simultaneous one-to-one IM conversations, which we term concurrent IMs. Although previous work has studied the management of multitasking – that is, the engagement of multiple activities [8], the phenomenon of engaging in concurrent IM conversations introduces several meta-issues to the already complex nature of multitasking. Although studies such as [5] and [6] have recognized the occurrence of concurrent IM conversations, we examine this behavior in depth and explore the issues that arise as a consequence of engaging in multiple IM conversations simultaneously. How does one decide on the degree of attention to give to a particular conversation? What are the challenges people encounter when managing multiple simultaneous conversations? What strategies are useful in dealing with those challenges? What design tradeoffs follow from engaging in concurrent IMs while multitasking?
2 Method Because users' goals and context influence how they communicate over IM, we chose two real world settings to observe and learn about how people manage and negotiate multiple conversations. The first was an investigation of IM usage in the workplace, where users at a technology company communicate with co-workers as well as outside friends and family. In the second phase we explored IM usage in an online Fantasy Football draft, where participants chat with multiple people during a goaldirected, time-bounded task. In total, the collected observational data consisted of 29 hours (14 hours of IM in the workplace and 15 hours of Fantasy Football drafts). In addition, we performed 25 hours of interviews (20 with the workers and 5 with the Fantasy Football managers). 2.1 Phase I: IM in the Workplace In the first phase of the study, we investigated the IM usage of 20 employees at a large technology company. All of the participants were experienced IM users (1+ years of usage.) The participants in our workplace sample included a receptionist, an administrative assistant, an internal communications specialist, customer support representatives, software engineers, facilities coordinators, and several interns. We conducted a 1-hour long, semi-structured interview with each participant to understand their typical IM usage. We asked participants about their recent experiences with concurrent IMs and group chats, as well as how they prioritized conversations. After the initial interview, we asked the participants to provide us with an hour-long screen-capture of their IM activity. In a follow-up interview, we reviewed the screencapture with each participant and asked them to provide us with context, informing us of the tasks they were working on while using IM, with whom they were chatting, and whether or not their conversations were related to the other tasks in which they were engaged. We compensated the participants for the initial interview with reward packages valued at $50. Participants received an additional $25 after they submitted a screen-capture of their IM activity and participated in a follow-up interview. We received screen-captures from 14 participants.
502
S. Rao et al.
We anticipated that participants would have privacy concerns about participating in a study in which the content of their IM conversations would be visible. This is an especially sensitive issue in the workplace, because participants may feel selfconscious about holding non-work-related conversations and could cause participants to depart from their normal IM behavior. We were also aware that we could potentially observe a participant for hours and not see any IM activity. As an alternative to observing the participants live, we asked them to use screen-capture software to record their activity. We wanted to reassure participants that the content of their IM conversations was not the focus of our study, so the screen-captures were deliberately of low quality, enabling us to see screen activity but not read any specific text on the screen. Participants were also given full control over the timing and content of their submissions; they decided when and what they wanted to capture. 2.2 Phase II: Fantasy Football Draft and IM The second phase of our study examined IM use in Fantasy Football, in which participants play the role of a manager of a National Football League (NFL) team. Near the start of the season managers conduct a draft in which they forecast which NFL players will have the best statistics during the season and select players accordingly. After the draft, team managers earn points based on their players’ performance in weekly NFL games [3]. An online Fantasy Football draft user interface typically supports a group chat and a timer. The group chat is seen by all league members in their draft window and is usually used for draft-related discussion and banter. The timer ensures that all managers have no more than a specified amount of time to select a player. In addition to the timed task of drafting a roster and using the group chat, managers can engage in IM conversations, phone calls, face-to-face conversations, and email, outside the draft interface. Draft participants can contact or be contacted by people in their league about content in the group chat, such as picks, trades, advice, as well as jokes via a private backchannel. They can also be contacted by people outside their league about things that are not related to the draft. We chose to study an online Fantasy Football draft because it represents a setting in which both group chats and concurrent IMs can occur while people are multitasking. The draft is a unique environment because of its fast pace and massive inflow and outflow of communication, making it an interesting arena to study concurrent IM usage in the context of a goal-directed, time-bounded recreational task. We observed seven managers from six different Fantasy Football leagues do their drafts and interviewed them afterwards. We recorded the Fantasy Football managers’ computer screens with screen-capture software. After the draft, we interviewed each participant, following the same protocol we used in the first phase of the study. We reviewed the screen-capture with the participants and asked them to comment on their behavior, communication, and task management strategies. Because we were studying people in a recreational setting, some of the limitations that restricted us in the workplace setting did not apply. Live observation was appropriate because the content of Fantasy Football participants’ conversations was less likely to be sensitive. The Fantasy Football managers were compensated with reward packages worth $50.
“You’ve Got IMs!” How People Manage Concurrent Instant Messages
503
3 Results and Discussion Table 1 summarizes the IM usage data from the participants in both phases of the study. We begin our discussion by examining how our participants across both phases behaved differently when managing an individual IM as opposed to concurrent IMs. Then, we document some of the common errors people made when engaged in concurrent IMs. We then identify four key strategies that people utilized to manage concurrent IMs, which emerged from our interviews and observations. Finally, we discuss key design tradeoffs that follow from our findings about concurrent IMs.
always on always on on when available always on always on always on always on always on always on always on always on always on always on always on; invisible when unavailable often on always on always on always on on when available always on
Fantasy Football Participant
P14 P15 P16 P17 P18 P19 P20
F F M F M F F F M M F M F
Reported Comfortable # Concurrent IMs
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13
Gender
Participant
Table 1. A summary of the participants’ IM usage for the Phase I participants (left) and the Phase II participants (right)
F1
M
5
F2
M
3
F3
M
9
F4
M
4
F5
M
0
F6
M
4
F7
M
3
3.1 Comparing Behavior between Individual and Concurrent IMs Both the workplace IM and Fantasy Football participants reported engaging in different behavior when participating in concurrent IMs compared to chatting in a single conversation. Shorter Responses. Six workplace participants and two Fantasy Football participants reported that they gave shorter responses to their conversation partners when they were chatting with multiple people. One of the workplace participants said that he did
504
S. Rao et al.
not mind giving terser responses since he felt people generally expect short and abrupt conversations over IM. Less Attention per Each Concurrent IM. Five workplace participants reported that they paid less attention to each conversation when there were concurrent IMs. With concurrent IMs attention is divided across conversations. Such division is potentially unequal, depending on a user’s context, their relationship to their chatting partners, and the content of each dialogue. Splitting attention across concurrent IMs is not easy. Four technology workers noted that they have a hard time keeping track of multiple IM conversations. As we expected, several participants recalled memory lapses where they had forgotten what had been said in certain conversations. Multitasking Stress. Three workplace participants and one Fantasy Football manager explicitly noted that handling concurrent IMs can be stressful. One possible explanation for this stress is that distributing attention across multiple conversations increases the cognitive load. IM system notifications, specifically blinking windows, can also make it difficult for a user to focus on a particular IM conversation, let alone deal with other tasks and applications. 3.2 Errors with Concurrent IMs Participants reported making the following errors with greater frequency when managing concurrent IMs as opposed to a single IM conversation: sending a message to the wrong person, forgetting about chat partners, accidentally closing an IM window, and sending a message in a language the partner did not understand. Sending a Message to the Wrong Person. The most frequent error participants reported was sending a message to someone other than the intended recipient. Six of the tech workers and one of the Fantasy Football managers recalled making this error. We also observed one Fantasy Football manager commit this error during their draft. Several participants reported being worried about making this mistake whenever they engage in concurrent IMs. As Grinter and Palen pointed out, the consequences of making this mistake can vary drastically in severity [5]. Mistakenly sending one casual chat line to the wrong friend may be inconsequential. However, sending a message to the wrong person, particularly in the workplace, can have serious consequences. One of our participants from the tech company reported that she once accidentally told her boss to “hold on a freaking second” while she was chatting with several people simultaneously. Luckily, her boss was understanding when she later explained her mistake. Forgetting about Chat Partners. Three work IM users and one Fantasy Football participant recalled instances where they forgot about chat partners when they had concurrent IMs. This is expected when a user is dividing their attention unevenly across IMs. It may arise as a consequence of chat windows being obscured by other windows or overlooking IM notifications. Accidentally Closing Windows. We interviewed three workplace IM users who told us that they have accidentally closed IM windows by clicking the “x” on the window when they meant to click the minimize button. One Fantasy Football manager also
“You’ve Got IMs!” How People Manage Concurrent Instant Messages
505
made this error during his draft. On many clients, concurrent IMs require multiple windows, which in turn can lead to window management errors. Recovering from this error can be as trivial as reopening the chat window or as difficult as recalling the topic and the conversation from scratch. This problem is exacerbated when IM users use clients that do not support conversation logging and history or when they have not enabled this feature. Confusing Languages. Two of our work IM participants regularly spoke to people over IM in different languages and scripts. One of these participants reported sending something in the wrong language over IM. Unlike the error of sending a message to the wrong person, this mistake was not due to addressing the wrong conversation window, but instead missing a critical pragmatic cue. 3.3 Strategies for Managing Concurrent IMs Our interviews and observations uncovered several key strategies participants used to manage concurrent IMs and deal with the aforementioned challenges. Controlling the number of conversations. One of the ways people manage concurrent IMs is by reducing them to a number they feel comfortable managing. Grinter and Palen reported that some IM users felt that they had a personal threshold beyond which they were unable to monitor their conversations sufficiently [5]. This threshold depends on the individual and varies according to the nature of the conversations, the chat partners involved, and the deadlines of the other tasks they are managing. Common ways of controlling the number of conversations we observed included adjusting online status and visibility (e.g. available, idle, away) and using different screennames or IM services to divide up different groups of contacts and tasks. One workplace IM participant and one Fantasy Football manager reported that they frequently quit their IM programs when they reach their upper limit of concurrent IMs. Signing off or exiting the IM program altogether essentially reduces the number of IM conversations to zero. Another strategy for controlling the number of conversations was to merge them by creating a group chat. It was interesting to note that merging IMs into a group chat did not necessarily decrease the number of conversations. Some participants kept their individual IMs active to maintain private backchannels. This was common among the Fantasy Football managers. In this case, IM users are not decreasing the number of conversations, but rather establishing a shared space so that messages did not need to be repeated across individual conversations. With a group chat and private backchannels, chat participants are controlling both the amount and type of content in the concurrent IMs. Controlling the pace. IM is inherently asynchronous since users can decide if and when they will respond to a message. While recent work has been done on predicting whether or not a user is likely to respond to an incoming message within a certain period of time [1], we explored responsiveness with respect to the way it was used to control the pace of IM conversations. Monitoring one’s response speed and attending to conversational cues about one’s conversation partner are some of the ways to achieve this. We often observed participants intentionally ignoring their chat partners while they were participating in a conversation with someone else or engaged in
506
S. Rao et al.
another activity altogether. Participants often found themselves synchronizing their pace with their chat partners, which reduced the number of overlapping message transmissions and interleaved conversation threads. IM Window Management. We observed two approaches to IM window management: grouping and closing. Participants typically kept all of their IM conversation windows in a specific area of the screen, leaving the rest of the screen available for other computer-related tasks. With respect to closing, there were three strategies to IM window management. Aggressive closers were users who closed a conversation window or tab before a conversation is over. These users typically closed an IM window or tab after each message interchange. Moderate closers tended to close IM windows or tabs after a conversation ends. The non-closer usually left all IM windows and tabs open indefinitely, or until they quit the IM client or turned off their computer. We observed participants employing a combination of different strategies depending on their context and their chat partners. Using Tabbed IM Windows. Many participants employed tabbed IM window features to manage concurrent IMs. From our observations, it appeared the participants who used tabbed IM windows were more likely to maintain more IM conversations than the participants who did not use tabbed IM windows. The main advantage to tabbed IM windows is that they save screen real estate. Instead of several chat windows populating a user’s computer screen, there is a single window dedicated to IM conversations (see Figure 1). Furthermore, the participant needs to engage in less window management. Tabbed IM windows were reported to be less disruptive, since new conversations pop up as unfocused tabs, rather than as new windows. They also do not capture keyboard input focus, reducing the likelihood of a participant unintentionally typing in the wrong IM window and sending a message to an unintended contact. However, tabbed IM windows have weaker visual cues than non-tabbed IMs. When a minimized IM window blinks, non-tab users can tell whom the message is from, since only one chat partner is associated with the window. However, with concurrent IMs tabbed IM users do not know who the new incoming IM message is from based only on the blinking notification. This is of particular concern for IM users who prioritize conversations based on person or content since there is no way to differentiate conversations based on window level notification schemes.
Fig. 1. A tabbed IM window
“You’ve Got IMs!” How People Manage Concurrent Instant Messages
507
Other Strategies for Managing Concurrent IMs. Nearly half (9) of the workplace IM participants reported that they prioritize concurrent IMs based on either their chatting partner or the content of the conversation. This supports the intuition that some people pick particular conversations to pay attention to when handling more than a single IM. It can be a conscious decision in which concurrent IM users impose meaning on their chat windows, rather than let the window cues and placement always dictate their attention and response strategies. It has been previously documented that a specialized language filled with abbreviations, acronyms, and contractions has evolved with text messaging, including IM and email [2,4]. Three of our participants reported they find themselves using abbreviations, shorthand, and acronyms more often when engaging in concurrent conversations than with an individual conversation as a way of responding to their chat partners more efficiently. Some examples of such abbreviations are “busy ttyl” for “I’m busy, I’ll talk to you later”, “brb” for “be right back”, and one participant’s shared convention of “222” for “In a Meeting”. The text equivalent of “uh huh” and emoticons were employed so that users could quickly let their chatting partners know that they were paying attention. Two workplace participants and one Fantasy Football manager noted that they would avoid asking in-depth questions of their conversation partners when holding multiple simultaneous conversations. Their justification was that an in-depth question could result in a long, engaging conversation that requires increased attention and greater cognitive effort. Six workplace IM users and two Fantasy Football participants also reported that they often had to repeat themselves during concurrent IM conversations. One strategy to expedite constructing these repetitive messages was to copy and paste text between different IM conversations.
4 Design Tradeoffs This study uncovered two sets of key tradeoffs with concurrent IMs while multitasking: managing multiple windows versus managing tabs and notifications versus disruptions. 4.1 Managing Multiple Windows versus Tabs Tabs are one attempt to deal with the window management issues that arise with concurrent IMs. This approach brings a set of tradeoffs. Compared to separate non-tabbed windows, tabs require a different set of physical actions. Unlike non-tabbed windows, with tabs there is only one window to move and rearrange regardless of how many conversations are being managed. This can potentially make it easier for users handling concurrent IMs. Tabs can also cause more physical action and effort compared to non-tabbed windows. With tabs, if a window with concurrent IMs is minimized to the task bar, only the focused conversation’s title will be visible. Adding to an ongoing conversation with someone other than the partner in the focused tab means opening the tabbed window and then selecting the appropriate tab. This is an increase in effort compared to selecting a single non-tabbed window.
508
S. Rao et al.
Working with tabbed IMs can affect how users impose meaning on their conversation windows. Tabs make it easy to focus on one conversation at a time, which helps people who privilege one partner’s chat over other ongoing IM conversations. Users can leave a conversation in the foreground of the tab set if they are awaiting a message that they deem important. Conversely, users can leave a conversation in an unfocused tab if they are trying to hide that content from people passing by (i.e. co-workers). Separate IM windows can help users who want to spread their attention across multiple conversations simultaneously. We observed one technology worker who purposely placed two conversation windows horizontally side by side. They reported that the conversations had equal importance to them, and this arrangement helped them attend to both equally. Tabs make only one conversation visible at a time, so this mechanism for assigning importance would not be possible. 4.2 Managing Alerts versus Disruptions Our study has begun to uncover the tradeoffs between the alerts that notifications provide and the disruptions they may impose. This tradeoff is dependent on both a user's situation and their personal alert preferences. Designing an effective notification system is challenging and rests on many subtleties. The different types of notifications each tell us something different about alerts in IM. The visual cues of color, pop-ups, and blinking windows differ in intensity and effectiveness. Color, the weakest of the visual cues, does not attract attention as much as the other three. None of our users turned color off, suggesting that this notification did not by itself place excessive demands on attention in the context of IM. Pop-up notifications are stronger cues than color because the human visual system is sensitive to motion. All of our participants turned off the pop-up notifications for their contacts' status. There could be three reasons for this: 1) the contact status information is not useful, 2) the pop-up action itself is distracting, or 3) the utility of the information is does not require such a strong notification cue. Contact status updates don't seem to be needed since over 80% (22/27) of our users made a conscious effort to keep their contact lists visible all times. This suggests that notifications relying on motion need to be carefully considered and selected. Window blinking, the strongest visual notification, is not without its tradeoffs. It is attention grabbing and hard to ignore because of its constant motion. In some cases this alerts users appropriately, but in others it becomes a disruption rather than an alert. This suggests the need for window blinking IM notifications to be rethought. Our participants preferred to have sound turned off. Sound may not be as disruptive as other cues to the user, but unlike the visual cues it has the potential to disturb co-located people who are focused on their own tasks.
5 Conclusion Although IM has been the focus of many studies, concurrent IM conversations have not yet been widely explored. Given the fragmented nature that is inherent of IM use, understanding the differences between one-to-one IM and concurrent IM use would enable designers to design for effective space and attention management while minimizing disruption. Although still in its exploratory stages, this study has uncovered
“You’ve Got IMs!” How People Manage Concurrent Instant Messages
509
that concurrent IM use is highly situated, requiring the user to constantly make decisions regarding attention, window, and conversation management. The study has also allowed us to gain a better understanding of the behavioral differences between oneto-one and concurrent IM, highlighting some of the challenges users face when engaged in concurrent conversations.
References 1. Avrahami, D., Hudson, S.E.: Responsiveness in instant messaging: Predictive Models Supporting Inter-Personal Communication. In: SIGCHI Conference on Human Factors in Computing Systems, pp. 731–740. ACM Press, New York (2006) 2. Baron, N.S.: See You Online: Gender Issues in College Student Use of Instant Messaging. J. Language and Social Psychology 23(4), 397–423 (2004) 3. Fantasy Football, http://en.wikipedia.org/wiki/Fantasy_football_American 4. Grinter, R.E., Eldridge, M.: Y do tngrs luv 2 txt msg? In: Seventh European Conference on Computer Supported Cooperative Work, pp. 219–238. Kluwer Academic Publishers, Netherlands (2001) 5. Grinter, R.E., Palen, L.: Instant Messaging in Teen Life. In: ACM Conference on Computer Supported Cooperative Work, pp. 21–30. ACM Press, New York (2002) 6. Isaacs, E., Walendowski, A., Whittaker, S., Schiano, D.J., Kamm, C.: The Character, Functions, and Styles of Instant Messaging in the Workplace. In: ACM Conference on Computer Supported Cooperative Work, pp. 11–20. ACM Press, New York (2002) 7. Kraut, R.E., Fish, R.S., Root, R.W., Chalfonte, B.L.: Informal Communication in Organizations: Form, Function, and Technology. In: People’s Reactions to Technology (Claremont Symposium on Applied Social Psychology), pp. 145–199. Sage, Thousand Oaks (1990) 8. Mark, G., Gonzalez, V.M., Harris, J.: No Task Left Behind? Examining the Nature of Fragmented Work. In: SIGCHI Conference on Human Factors in Computing Systems, pp. 321– 330. ACM Press, New York (2005) 9. Nardi, B.A., Whittaker, S., Bradner, E.: Interaction and Outeraction: Instant Messaging in Action. In: ACM Conference on Computer Supported Cooperative Work, pp. 79–88. ACM Press, New York (2000)
Investigating Children Preferences of a User Interface Design Jamaliah Taslim, Wan Adilah Wan Adnan, and Noor Azyanti Abu Bakar Faculty of Information Technology & Quantitative Sciences, Universiti Teknologi MARA Malaysia jamaliah@tmsk.uitm.edu.my
Abstract. Though there have been many studies of user interface design preferences, only a few have considered the children preferences. This paper presents an investigation into the children preferences regarding user interface design. The objective of studying this area is to investigate the differences of children preferences on the elements of a user interface design. An experiment was conducted regarding five elements of user interface design: font type, font size, background color and interface type. Findings show that there is a significant differences in the children preferences for interface type, font type and background color. Further analysis was conducted and the results indicate that there is a significant difference between gender groups for background color, interface type and font color. This study provides empirical evidence on the importance of considering the children in the interface design. Keywords: Children, User interface design, Preference, Color, Interface type.
1 Introduction Currently, almost the applications designed for children are developed by adults and they do not consider the children’s skills and preferences. As a result, the applications may not be easily learned and used by children [1]. Besides that, majority of the tools available are for the expert users which are not suitable for novice users particularly for children who have very limited knowledge in computer. In addition, the importance of individual differences such as gender, has been emphasized in the human computer interaction literature regarding the user interface design. However there is still lacking of empirical studies that examine the gender differences among children in their preferences of user interface design. Further research are required to strengthened the empirical evidence on the gender differences among children’s in the user interface design issues [2]. The purpose of this paper is to investigate the differences of children preferences on the elements of a user interface design. An experimental study was conducted for this purpose.
Investigating Children Preferences of a User Interface Design
511
for children do not consider their skills and preferences [1]. Interface developers should not design with the expectation that the child is able to understand in interaction with an extremely complex machine. The principle for the user centered design practices is that there is no design that fits all, but design should be driven by the knowledge of the target users. There is a growing amount of attention given to children as a special user group [3]. Many authors have discussed interface design, and it is common to have an emphasis on the user in the discussion. Shneiderman [4] did argue that any design should be based upon an understanding of its intended users. Among the important user characteristics that should be considered are age group and gender. Shneiderman argued that it is very common to find in practice that the children are not being considered in the user interface design guidelines. In fact, the involvement of children themselves in the design process is very unusual. Therefore, the interface designers and developers should be responsible to seek for a good quality products design which will positively contribute towards the children’s development and health [5]. The children’s interactions with the technologies are different depending on their age level which reflects their changing interests, characters, humors and contexts. According to Acuff and Reiher [6], ages between 8-12 the children are in the rule or role stage. In this age group, the interests gradually shifted from fantasy to reality. They become more interested in competition and prefer play in pairs and groups. A sense of logic and reasoning and simple abstractions start developing. This is a stage of shifting from main influence of parents and schools to a bigger influence from peers. From the age of 8 to 12 children start to understand more abstract terms and longer and more complex sentences. They develop the ability to analyse critically what they read. Children at the age of 9 and 10 are still not very good at planning their story and start telling the story straight away. Handheld computing devices and laptops are examples of current products targeting to this age user group. The design of these devices are more adult-like such as using less bright colors, than those designed for them when they were at their younger ages. More complex interfaces often provided by these new products such as having several functions represented in one button and varieties of menu structures available for them to explore. The functionality gives them more freedom in performing their task. Children, like adults often use the technology to perform their tasks. Markopoulos and Bekker [7] argue that the interface design need to be extended and specifically address the needs of children. They have pointed out two major issues In the context of designing for children: age-specific interaction styles, e.g. how to structure menu, font, interface type, color, etc.; and the involvement of children in the design process. According to them, research in the former is very sparse. One of the study related to children is by Inkpen [8]. This study reports that children ages between 9 to 13 preferred point and click over drag and drop. In addition, Read et al. [9] discussed the different text input techniques suitable for children. This research is rather limited compared to the corresponding research for adults. In addition, the research especially on the user interface elements e.g. font type, font size, color and interface type is still lacking. Standard user centered design approaches need to be adapted when considering the specific needs for children. Current design guideline compilations still focus mostly
512
J. Taslim, W.A. Wan Adnan, and N.A. Abu Bakar
on adult users. Gilutz and Neilsen [10] take initiative to compile guidelines for web sites for children.
3 Method An experimental study was conducted with 40 primary schoolchildren of Sekolah Kebangsaan Seksyen 6, Shah Alam Malaysia. They were randomly divided into groups with five children per group. The range of their age is from 10 to 12 years old. A briefing on the purpose of the experiment and about the instruction was given to each of the group before they start the experiment. Each of the participants was given a maximum of 15 minutes to complete the task. Five user interface elements had been tested in this study, namely font type, font size, background color, interface type. For font type, 4 conditions were tested which were Arial, Comic Sans MS, Courier New and Times New Roman. For font size, 2 conditions were tested which were 12 and 14. For background color, 5 colors were chosen for the experiment namely green, blue, purple, red, and yellow. The interface types were categorized as simple and complex. The participants were asked to select the most preferred choice for each of these interface elements.
4 Results Results from the Mann-Whitney Test for analyzing gender difference have shown that there were significant differences between boys and girls in their preferences for background color with p = 0.001, and interface complexity with p = 0.036. In addition, there was marginal significance in their preferences for font type but no significant difference for font size. From the cross tabulation analysis, it was found that majority of girls prefer purple whereas the boys prefer blue for the background color. For the interface type, all girls have chosen simple interface type as their preferences. However, there were 20% of the boys preferred the complex interface type. Results for the font type, have shown that majority of the boys have chosen Arial as the most preferred and Comic Sans MS as the least preferred. In contrast, majority of the girls have chosen Comic Sans MS as the most preferred and Times New Roman as the least preferred. Further analysis has been conducted to examine the age difference among the children using Kruskal Wallis Test. The results shows that there were marginal age difference among children in the background color (p=0.063) and interface type (p=0.073).
5 Conclusions Interface design guidelines are not hard to find, but typically they also meant for adults rather than young users. This study examines the children preferences on the interface design. Five interface design elements were tested. Results showed that there are significant differences in children preferences for interface type and background color. In addition the result also highlights the importance of considering the effects
Investigating Children Preferences of a User Interface Design
513
of gender-based differences in the user interface design for children. From these findings it is concluded that a specific interface design guidelines are required for children rather than simply relying upon general design guidelines and it is necessary to involve these users in the design process in order to formulate those guidelines.
References 1. Hutchinson, H.B., Bederson, B.B.: Interface for Children’s Searching and Browsing (2005) 2. Zaman, Geerts: Gender differences in children creative game play. Young People & New Technology, UK Northampton (2005) 3. Bruckman, A., Bandlow, A.: HCI for Kids. In: Jacko, J., Sears, A. (eds.) Human-Computer Interaction Handbook, pp. 428–440. Lawence Erlbaum, Hillsdale, NJ (2003) 4. Shneiderman, B., Plaisant, C.: Designing the User Interface: Strategies for Effective Human-Computer Interaction, 3rd edn. Addison-Wesley, Reading, MA (1998) 5. Wartella, E., O’Keefe, B., Scantin, R.: Children and Interactive Media. A Compendium of Current Research and Directions for the Future, Markle Foundation (2000) 6. Acuff, D.S., Reiher, R.H.: What Kids Buy and Why. The Psychology of Marketing to Kids. The Free Press, New York (1997) 7. Markopoulos, P., Bekker, M.: Interaction Design and Children. Interacting with Computers 15(2), 141–149 (2003) 8. Inkpen, K.M.: Drag-and-drop versus point-and-click: mouse interaction styles for children. ACM Transactions on Computer Human Interaction, Full Text via CrossRef 8(1), 1–33 (2001) 9. Read, J.C., MacFarlane, S.J., Casey, C.: Proceedings BCS HCI 2001, Lille, France, pp. 559–573. Springer, London (2001) 10. Gilutz, S., Neilsen, J.: Usability Websites for Children: 70 Design Guidelines. Neilsen Norman Group (2002), http://www.NNgroup.com/report/kids
Usability Evaluation of Graphic Design for Ilmu’s Interface Tengku Siti Meriam Tengku Wook1 and Siti Salwa Salim2 1 Faculty of Information Science and Technology National University of Malaysia, 43600 Bangi Selangor 2 Faculty of Computer Science and Information Technology University of Malaya, 50603 Kuala Lumpur tsm@ftsm.ukm.my, salwa@um.edu.my
Abstract. Graphic design is fundamental to Ilmu’s interface (i.e. WebOPAC for children) and is the focus of this study. A usability evaluation is carried out for the new prototype of Ilmu’s interface which gives the emphasis to the components of graphic design. Questionnaire and observation methods are used to accumulate the usability data. The usability of Ilmu's new interface is shown to be significantly better through t-testing, and statistical testing using chi square (χ2 ). Keywords: Usability, graphic design and children’s interface.
Usability Evaluation of Graphic Design for Ilmu’s Interface
515
2 Hypotheses The objective of carrying out the usability evaluation is to determine whether there is a significant difference and effects between Ilmu_1 and Ilmu_2 designs. The following five hypotheses serve the basis for conducting a usability evaluation of Ilmu’s interface: H1. There is a significant difference between the use of space in the Ilmu_1 and Ilmu_2 designs. H2. There is a significant difference in the content arrangement between Ilmu_1 and Ilmu_2 designs. H3. There is a significant difference for the functional accessory between Ilmu_1 and Ilmu_2 designs. H4. There is a significant difference in the color coordination arrangement between Ilmu_1 and Ilmu_2 designs. H5. There is an excellent level of acceptance by Malaysian students of the new Ilmu_2, design.
Fig. 1. Relationship of independent variables between Ilmu_1 and Ilmu_2
The main aim of hypotheses 1 – 4(H1 – H4) is to demonstrate any significant difference of usability score between Ilmu_1 and Ilmu_2. This is tested using t-test (paired sample test). Figure 1 shows the independent variables (Ilmu_1 and Ilmu_2), the components of graphic design (use of space, content arrangement, functional accessory, and color coordination – which are the main focus of Ilmu_2 design), and the usability factors for each component of graphic design. Usability factors used in this research are the effectiveness, accessibility, easy to learn, and enjoyable. The aim of hypothesis 5 (H5) is to observe student’s perspective towards Ilmu_2’s graphic design (use of space, content arrangement, functional accessory, and color coordination) in relation to the usability factors, hence, making conclusion the level of acceptance by students of Ilmu_2’s interface. This is tested using chi-square (χ2).
516
T.S.M. Tengku Wook k and S.S. Salim
3 Usability Evaluatio on Methods Two usability evaluation teechniques were applied, which involved students providding feedback via a questionnairre and observing students interaction with Ilmu_2 interfaace. One hundred students particcipated in the questionnaire exercise while twenty studeents were involved in the observ vation process. 3.1 Questionnaire The survey required the sttudents to answer the questions using a 1-5 Likert sccale range. A t-test (paired samp ple t-test) was applied to monitor any significant differeence of usability score between the t Ilmu_1 and Ilmu_2 designs. 3.2 Observation The observation required the researcher to observe children’s behavior and the understanding and ability to o search and browse books using Ilmu_2. As to ensure tthat data is collected consistenttly from the students during the observation, a checklisst is used to record the findings, concentrating on the four components of graphic desiign. The data was gathered quantitatively according to the measurement criteeria categorized by Dumas and Redish R [6]. Excellent Acceptable Unacceptable
Ilmu_2 in nterface is effective, practical and easy to learn to seaarch for the biibliographic information Students are satisfied with the searching nterface is not ‘OK’. Students are having difficulties ussing Ilmu_2 in Ilmu_2 in nterface to search for the bibliographic information
4 The Results of Usa ability Evaluation Figure 2 shows the rangee of mean scores for the components of graphic dessign between Ilmu_1 and Ilmu u_2. The usability score of Ilmu_2 interface shows an increase of 1.56 points forr use of space, 1.58 points for content arrangement, 11.61 points for functional accessory and 1.12 points for color coordination. 5.00 4.00 3.00 2.00 1.00 0.00
4.47 2.91
4.41 2.83
4.37 2.76
4.39 3.27 Ilmu 1 Ilmu 2
Use of Space
Functional Color Content aarrangement Accessory coordination
Fig. 2. Mean scores off Ilmu_1 and Ilmu_2 for the components of graphic design
Usability Evaluation of Graphic Design for Ilmu’s Interface
517
4.1 Results of H1 As shown in figure 2, there is a significant difference between the use of space in Ilmu_1 and Ilmu_2 (t = 39.546, p < .05). Table 1 shows the mean scores and percentages of usability factors for use of space component. Ilmu_2 has recorded a positive increase in the easy to learn and enjoyable factors. Table 1. Use of Space Component
Usability Factors Effectiveness Easy to learn Enjoyable
Mean Score 4.43 4.87 4.87
Percentage 33.05% 33.47% 33.47%
A walkthrough technique played a major role in the improvement of the use of space in Ilmu_2. Through its implementation, students are allowed to move the mouse (cursor) to the right or left during their 360° environment exploration. Students are free to explore and carry out daily activities on the screen without any assistance from teachers or their elders. 4.2 Results of H2 There is also a significant difference for content arrangement between Ilmu_1 and Ilmu_2 (t = 37.954, p < .05). The strength of content arrangement in Ilmu_2 lies on the application of tree-maps technique. Text and graphic types of information are displayed hierarchically and in a structured manner which enhances the usability of Ilmu_2. The location of objects such as menu, instructions, buttons, lines and images was aligned horizontally with the movement of the mouse (to the left and right) during the exploration. A comic-strip technique was implemented in performing the arrangement of sub-subject folders in a cabinet. Table 2. Content Arrangement Component
Usability Factors Effectiveness Easy to learn Enjoyable
Mean Score 4.462 4.378 4.39
Percentage 33.73% 33.09% 33.18%
4.3 Results of H3 As shown in figure 2, functional accessory has the most significant difference between Ilmu_1 and Ilmu_2 (t = 39.304, p < .05). This component in Imu_2 lies in the deployment of a label function, animation function, terminology and the caterpillar character that act as an assistant. Students were satisfied and it is easy for them to use Ilmu_2 on their own. A clear and concise set of instructions on the menu using bigger fonts provided the easy access.
Usability Factors Effectiveness Easy to learn Enjoyable
Mean Score 4.365 4.39 4.345
Percentage 33.32% 33.51% 33.17%
4.4 Results of H4 Color coordination has the least significant difference between Ilmu_1 and Ilmu_2 (t = 24.485, p < .05). Ilmu_2 uses a combination of light and cheerful colors. Appropriate selection of colors adds to the student’s enjoyment as they feel happy and comfortable while they search and surf. Table 4. Color Coordination Component
Usability Factors Effectiveness Easy to learn Enjoyable
Mean Score 4.47 4.383 4.33
Percentage 33.91% 33.25% 32.84%
4.5 Results of H5 Results obtained from the statistical evaluation using the chi-square (χ2) shows scattered data for the excellent and acceptable parameters to be χ2 (10, N = 30) = 240.8, p < 0.05. Table 5 shows excellent feedback from students at a level of 83.93% and none rejected Ilmu_2. Table 5. Students’ acceptability towards Ilmu_2
Adaptability Excellent Acceptable Unacceptable
Mean Score 23.5 4.5 0
Percentage 83.93% 16.07% 0%
5 Conclusion Graphic design is a vital element to creating children’s WebOPAC. The usability of Ilmu_2 is shown to be significantly better through t-testing, and statistical testing using chi square (χ2). Table 6 compares the graphic design techniques applied between Ilmu_1 and Ilmu_2.
Usability Evaluation of Graphic Design for Ilmu’s Interface
519
Table 6. Comparison of the application of graphic design techniques Graphic Design Use of space
Searching Technique Keyword Subject
Content Arrangement Functional Accessory
Location Keyword Subject Location Keyword Subject
Location
Ilmu_1 Interface
Ilmu_2 Interface
Exact match Boolean Operation Image or text hyperlink Non hierarchical
Exact match
Non hierarchical Use of label, icon and button Use of label, icon and button
-
Image or text hyperlink Pan/zoom Hierarchical (tree-maps) Hierarchical (Comic strip) Magnification glass (Lens) Use of label, icon and button. Use of label, icon, button and image Worm character (interface agent) Use of label, icon, button and image Caterpillar character (interface agent)
References 1. Meriam, T.S., Wook, T., Salim, S.S.: User Testing of Children’s WebOPAC: A Malaysian Experience. In: The Seventh Asia-Pacific Conference on Computer Human Interaction, Taiwan (2006) 2. Hutchinson, H.B.: Children’s interface design for hierarchical search and browse. ACM SIGCAPH Newsletter. College Park, pp. 11-12 (2003) 3. Christoffel, M., Schmitt, B.: Accessing libraries as easy a game: Visual Interface to Digital Libraries, pp. 25–38. Springer, Berlin (2002) 4. Murch, G.M.: Physiological principles for the effective use of color. In: IEEE Computer Graphics and Applications, pp. 49–54. IEEE Computer Society Press, Los Alamitos (1984) 5. Oosterholt, R., Kusano, M., Vries, G.: Interaction design and human factors support in the development of a personal communicator for children. In: Computer Human Interaction, pp. 450–457. ACM, Vancouver (1996) 6. Dumas, J.S., Redish, J.C.: Creating Task Scenario. A Practical Guide to Usability Testing. Intellect, USA (1999)
Are We Trapped by Majority Influences in Electronic Word-of-Mouth? Yu Tong and Yinqing Zhong Department of Information Systems, School of Computing, National University of Singapore, 3 Science Drive 2, Singapore 117543 {tongyu,zhongyin}@comp.nus.edu.sg
Abstract. Being an effective online mechanism to generate large-scale electronic Word-of-Mouth (EWOM), online feedback systems (OFS) offers a variety of system design cues to facilitate consumers’ decision making. However, such cues may lead consumers to make inferences based on an overall picture of the majority opinion without scrutinizing the content of reviews. This study draws on theories of majority/minority influence and dual-process to explore the influences of OFS design cues on consumers’ learning outcomes (i.e., awareness of product/service, confidence in judgment, intention to searching for additional information and intention to conform to majority). Numerical and power majority influences are examined through two design cues: review clustering format (list-clustering vs. pair-clustering) and source credibility (available vs. unavailable). Keywords: Word-of-mouth, online feedback system, majority influence, system design.
Are We Trapped by Majority Influences in Electronic Word-of-Mouth?
521
thus have led to the questions: Whether a consumer’s product evaluation can be influenced by cues embedded in OFS designs rather than review contents? If yes, through what system designs? While extant research has focused on the impacts of reviews on trust building and sales revenue (e.g., [4, 22]), relatively little effort has been devoted to examining the relationships between system designs on consumer learning and product evaluation. We seek address this gap in the literature by drawing on theories of majority/minority influence and dual-process to explore the influences of two majority designs in OFSs (i.e., review clustering format and availability of source credibility) on consumers’ learning outcomes. The approach is appropriate as there is evidence from duo-process theories that an individual may be influenced by heuristic cues without scrutinizing the content of the information [31].
2 Theoretical Foundation 2.1 Majority and Minority Influences Prior studies on the majority/minority influences have inconclusive views on how individuals change their behaviors/attitudes based on the majority or minority’s view. Conversion theory [29], which forms the foundation of many early studies on this issue, posits that individuals undergo different forms of cognitive and motivational processes with different source of influences (i.e., majority or minority). When the influence is received from the majority, individuals would choose to conform to the majority’s attitudes publicly without carefully examine the arguments. On the other hand, when the influence is received from the minority, individuals would pay more attention on interpreting the information from the minority viewpoint. Consequently, they may convert their attitudes privately [15, 29], which induce more enduring changes [5]. Different from conversion theory, other studies propose that majority influence can also exert a significant impact on sustained attitude changes (e.g., [21, 25, 32]. Mackie’s [26] objective consensus approach contend that it is the majority that elicits greater message processing, as people tend to accept the majority viewpoint as objective reality. Consequently, people do no engage in a great deal of cognition over the minority message. Based on the conversion theory and the objective consensus approach, Erb et al. [15] examine the effects of recipients’ prior attitudes on message scrutiny in the context of minority and majority influence. Specifically, majority message is processed more extensively than minority message when recipients held a moderate prior attitude. When recipients held an opposing prior attitude, however, the minority message was processed more extensively than the majority message. Notwithstanding the contention in the majority/minority literature, previous studies highlight in common the positive effects of processing message content in decision making. Minority view, even when wrong, stimulates reappraisal and a consideration of more alternatives [30]. Previous research contends that dual-process theories of information processing offer an insight view on why and how people process messages from the majority and/or minority group (e.g., [21]).
522
Y. Tong and Y. Zhong
2.2 Dual-Process Theories of Information Processing The dual-process theories of information processing encompass a family of theories, which examine the roles played by both the message content and the contextual factors. Both the elaboration likelihood model (ELM) [31] and the heuristic-systematic model (HSM) [14] of persuasion identify two information processing modes that people use when assessing the validity of a message. Systematic processing, also named as central route, refers to the process in which individuals carefully scrutinize all relevant information and try to incorporate it into what they already know. Heuristic processing, also named as peripheral route, refers to the process in which individuals use cues or heuristics, such as “credibility implies correctness”, to assess content validity [11]. When the systematic processing is used, the attitude change is apt to be relatively enduring and probably affects the subsequent behaviors. On the other hand, when heuristic processing is used, individuals tend to be influenced by factors other than the message itself. As a consequence, any change in attitude is likely to be temporarily and has less effect on how they actually act. Elaboration likelihood refers to the degree of systematic processing an individual uses in processing the message [23]. Previous research shows that elaboration likelihood could be affected by three factors: relevance of the topic to the person, enjoyment of thinking, and diversity of argument [23]. Specifically, the more important the topic is, the more likely the individual will scrutinize and think about the messages. The more an individual enjoys mulling over arguments, the more likely he/she will use systematic processing. While confronted with diverse opinions and the relevance and enjoyment of thinking being equal, individuals tend to process information systematically as making peripheral judgments is not easy [1]. The dual-process theories of information processing have been commonly adopted in prior research to explain how group opinion operating as simple cues causes individual members to use heuristics in thinking, particularly when they are moderately motivated and unfamiliar with a given topic [3]. Individuals tend to easily reject the minority message using the heuristic “a lack of consensus implies a lack of validity”. Instead of scrutinizing the argument, they make judgments quickly on the majority/minority cues [23].
3 Research Model and Hypotheses Fig. 1 depicts the research model. Our main proposition is that consumers’ learning outcomes are contingent on majority cues embedded in two OFS designs (i.e., reviews clustering format and source credibility). In line with previous literature [33], majority is defined in two dimensions: (1) number of members (i.e., majority group is numerically greater than minority group); (2) relative power (i.e., majority group is relatively more powerful than minority group). The former is a common cue used when people rely on the sheer number of arguments to determine whether to accept a perspective [31]. Similarly, the credibility of the source can be used as a cue [27]. When source credibility is high, the message is more likely to be scrutinized and accepted.
Are We Trapped by Majority Influences in Electronic Word-of-Mouth?
523
Fig. 1. Research Model
3.1 Dependent Variable: Consumer Learning Consumer learning plays a critical role in affecting consumers’ consumption behaviors and manifests itself in three dimensions: cognitive, affective, and conative [20, 24, 34]. Cognitive dimension measures the ability of a medium such as OFS designs to attract attention and ultimately transfer product/service information to consumers’ memory. In this study, awareness of product/service is examined as a dependent variable to indicate the extent of product information consumers learn from the OFS [2]. This variable reflects consumers’ efforts in generating awareness, establishing product/service knowledge, increasing comprehension of the brand name, and understanding the information presented [2, 8, 18]. Affective dimension measures either established or created attitudes from a medium, such as OFS [28]. However, in this study the ‘absolute’ value in product attitudes are not of interest, as it is contingent on the perspective of the reviews. Instead, we examine consumers’ confidence in judgment, which is the degree to which consumers feel confident in their evaluation of the product [7]. Lastly, conative dimension involves some types of behavioral intention, such as searching for additional information [17]. Similarly as affective dimension, purchase intention is out of this study’s scope as the content of product reviews could be either positive or negative. Instead, intention to search for addition information and intention to conform to majority are chosen as dependent variables in dimension. 3.2 Review Clustering Format: List-Clustering vs. Pair-Clustering Clustering is increasingly adopted by some OFSs to organize reviews pertaining to a product/service. Through an empirical comparison between standard list and clustered list of a search engine interface, Zamir and Etzioni [35] reported that substantial differences exist in using these two formats. In this study, influences from reviews
524
Y. Tong and Y. Zhong
clustering format (i.e., numerical majority/minority) are manipulated into two types: list-clustering vs. pair-clustering. List-clustering refers to a format, in which positive and negative reviews are grouped into two separate and adjacent lists. Deduced form the HSM, we posit that list-clustering format makes the cue of numerical majority salient to consumers as consumers can easily form an impression about reviewers’ overall appraisal of the product/service. As individuals tend to rely on the sheer number of arguments to determine whether to accept a perspective [31], instead of scrutinizing the content of reviews, they are likely to follow numerical majority’s perspective using the heuristic “a lack of consensus implies a lack of validity”. According to HSM, any changes resulting from peripheral route tend not to be lasting. To diminish the majority effects on consumers’ information processing, we propose one specific review clustering format, namely pair-clustering. This format pairs up positive and negative reviews and displays them in an alternative manner inside one list. In contrast to list-clustering format, consumers need to go through all reviews (positive and negative) to glean the best information as positive and negative reviews are displayed alternatively. When the heuristic cue is not salient (i.e., it is difficult to judge the number of reviews on each perspective), consumers have to make the judgment by systematically scrutinizing the messages in the OFS. Thus, we hypothesize, H1: Pair-clustering format will lead to higher awareness of product/service than list-clustering format. H2: Pair-clustering format will lead to higher confidence in judgment than listclustering format. H3: Pair-clustering format will lead to higher intention to search for additional information than list-clustering format. H4: Pair-clustering format will lead to lower intention to conform to majority than list-clustering format. 3.3 Availability of Source Credibility Source credibility, which refers to a communicator’s ability to affect receivers’ information acceptance in communication, can be considered as a form of power inside a group [13]. When an individual possesses higher credibility, his or her statements are perceived as more persuasive to others even though such statements may only represent the perspective of a numerically small group. In line with this categorization, a majority group is one that is considered more credible compared with a minority group. Research shows that source credibility is determined by two elements, source expertise and source bias [9]. Source expertise is defined as “the perceived competence of the source providing the information”. Source bias is defined as “the possible bias/incentives that may be reflected in the source’s information” [9, p.6]. Essentially, a source is considered as more credible when it possesses greater expertise and has less incentive to bias the information [10]. In the context of OFS, a consumer who provides his or her review on a product/service is less prone to bias since he/she is unlikely to have an ulterior motive (e.g. sell the product) [16]. Therefore, the level of expertise substantially determines perceived credibility of the reviewer.
Are We Trapped by Majority Influences in Electronic Word-of-Mouth?
525
Due to the anonymous nature of online environment, it is relatively difficult to directly evaluate a reviewer’s expertise. Practically, many OFSs provide features to identify “expert” reviewers in a certain product category. For example, at Epinions.com, reviewers constantly providing high-quality reviews can obtain a status such as category leaders and top reviewers. When such reviewers write reviews about a certain product, their status are highlighted on top of the reviews. As the calculation of status is based on cumulative and objective criteria (in terms of quantity and quality of the reviews), reviewers who gain such status from an OFS is perceived to be highly credible. Source credibility can be used as a cue or heuristic. When an individual is moderately involved in a focal topic, he or she may base on a simple heuristic (i.e. “expert’s statements can be trusted”) to evaluate the content validity [1]. In the context of OFS, when consumers observe a top reviewer’s product review, they may treat this source credibility as a cue and go through heuristic processing. Under such circumstances, consumers’ attitude changes are made based on what the expert says rather than scrutinizing the content of other reviews. Therefore, it is unlikely for them to have an enduring attitude change. However, when there is no indication of reviewer’s status, the heuristic cue in terms of source credibility is not salient. In such circumstances, consumers who need to evaluate the product have to go through systematic processing by scrutinizing content of all relevant reviews. Thus, we hypothesize, H5: Availability of credible source will lead to lower awareness of product/service than the condition when credible source is unavailable. H6: Availability of credible source will lead to lower confidence in judgment than the condition when credible source is unavailable. H7: Availability of credible source will lead to lower intention to search for additional information than the condition when credible source is unavailable. H8: Availability of credible source will lead to higher intention to conform to majority than the condition when credible source is unavailable.
4 Research Method The research model will be tested in a laboratory experiment with three betweensubjects factors: review clustering format (list-clustering vs. pair-clustering), source credibility (available vs. unavailable), and the attitude of numerical majority (positive vs. negative). When a credible source is available, we counterbalance the alignment of numerical majority and credible source (yes or no). Thus, we employ a completely counterbalanced, full factorial design, which provides 12 combinations in total. An air-con maintenance service company is chosen for the experiment mainly based on two reasons. First, intangible service is an ideal context for studying OFS effect as it is hard for consumers to obtain a comprehensive judgment before consumption. Second, air-con maintenance service allows us to control for a moderate initial motivation before subjects (university students) read the reviews. 4.1 Independent Variables Review clustering format is manipulated on the product reviews page. The number and content of reviews displayed in two formats are identical. In the list-clustering
526
Y. Tong and Y. Zhong
condition, all positive reviews will be placed in the left list and all negative reviews will be placed in the right list. In the pair-clustering condition, five pairs of alterative reviews (positive + negative) are displayed first followed by the rest five reviews with the majority attitude. All reviews consist of an overall rating, representing the attitude toward the service, and written comment. While only the first two lines will be displayed for each review, subjects can read the full review in a popup window when clicking on “read full review” link at the end of the each review. Two levels of source credibility are studied: available and unavailable. Subjects in available credible source group will notice a top reviewer status placed on the first review in either numerical majority or minority side. Subjects in unavailable source credibility groups will see all reviews without the top reviewer status. 4.2 Dependent Variables The measurements for dependent variables are adapted from previous validated scales. Awareness of product/service is operationalized as the extent of message recall measured using thought-listing techniques [5]. Subjects are asked to list all of the thoughts they had after reading the reviews, classifying each according to attributes provided by researchers (e.g. price, membership). Two researchers will independently rate each subject’s thoughts using a 1 to 10 scale. Any disconformity will be checked. Confidence in judgment is measured using the scale of attitude confidence from Berger and Mitchell [7] and adapted in our context. It is measured with two 7-point scales anchored by the statements: “not at all certain/very certain”, and “not at all confident/completely confident”. Both intention to search for additional information and intention to conform to majority are measured using four 7-point scales anchored by the statements (“unlikely/likely, improbable/probable, uncertain/certain, definitely/ definite not”) adapted from Bearden et al. [6] to determine the likelihood that subjects intend to perform the two behaviors. 4.3 Pre-test The literature suggests that argument quality can influence an individual’s information processing [5]. A pre-test is conducted on 20 subjects from the same subject pool of the main experiment. 40 reviews toward the air-con maintenance service company (30 positive and 10 negative) are presented. All reviews contain the same number of sentences and all sentences in one review represent same attitude (positive or negative). To control for the argument quality, subjects are asked to rate their perceived persuasiveness for each review. 10 positive reviews and 5 negative reviews deemed at similar level of persuasiveness are selected for the main experiment. Subjects are also asked whether these two numbers can correctly represent numerical majority and minority group. An OFS is developed specific for this study. It is installed on a server in the same local area network as the computers in the laboratory to ensure a consistent high network speed for all subjects. Subjects are asked to comment on the design of the OFS from various aspects such as speed, and layout. Feedbacks from the subjects are used to fine-tune the design of the OFS. In addition, we also check subjects’ prior attitudes toward the air-con service company to ensure subjects have moderate motivation before reading the reviews.
Are We Trapped by Majority Influences in Electronic Word-of-Mouth?
527
4.4 Participants and Experimental Procedures 160 students from a large university are recruited and randomly assigned to 12 groups. Subjects are told to evaluate a new air-con service company given the information from a specialized OFS in this industry. Before they look at the reviews, a brief instruction of the OFS and the procedure to compute top reviewers is explained to establish a common frame of reference. Next, subjects are instructed to browse the OFS and locate information on a specific service to become familiar with the system layout and its various features. Subjects are then instructed to evaluate the experimental service at their own pace and to raise their questions if any. Subjects are required to fill a post questionnaire, which includes manipulation checks, measurements for dependent variables and demographic information. Lastly, subjects are debriefed and given a token for participation.
5 Concluding Remarks This study aims to advance the theoretical understanding in the area of EWOM and majority influence. First, this paper constitutes one of the first studies in the EWOM literature investigating the effects of possible system designs on consumer learning outcomes. Second, this study extends the literature by examining influences from both positive and negative reviews. Third, this study advances the majority influence literature by considering a broader conceptualization of majority, including both numerical and power majority. This study can also draw interesting practical implications for OFS designers. First, as consumers are possibly subjected to the influence from review clustering format, OFS designers should appropriately select the format that is helpful for consumers to make product evaluation. For example, they can give consumers the option to select their preferred format or automatically choose format according to consumers’ indicated preferences in OFS or their log history. Second, as source credibility could significantly influence consumer’s evaluation process, practitioners should scrutinize the algorithm used in computing top reviewers.
References 1. Areni, C.S., Ferrell, M.E., Wilcox, J.B.: The Persuasive Impact of Reported Group Opinions on Individuals Low vs. High in Need for Cognition: Rationalization vs. Biased Elaboration? Psych. and Marketing 17, 855–875 (2000) 2. Ariely, D.: Controlling the Information Flow: Effects on Consumers’ Decision Making and Preferences. J. Consumer Res. 27, 233–249 (2000) 3. Axsom, D., Yates, S., Chaiken, S.: Audience Response as a Heuristic Cue in Persuasion. J. Pers. Soc. Psych. 53, 30–40 (1987) 4. Ba, S., Pavlou, P.A.: Evidence of the Effect of Trust Building Technology in Electronic Markets: Price Premiums and Buyer Behavior. MIS Quart. 26, 243–268 (2002) 5. Baker, S.M., Petty, R.E.: Majority and Minority Influence: Source-Position Imbalance as a Determinant of Message Scrutiny. J. Pers. Soc. Psych. 67, 5–19 (1994)
528
Y. Tong and Y. Zhong
6. Bearden, W.O., Lichtenstein, D.R., Teel, J.E.: Comparison Price, Coupon, and Brand Effects on Consumer Reactions to Retail Newspaper Advertisements. J. Retailing 60, 11–34 (1984) 7. Berger, I.E., Mitchell, A.A.: The Effect of Advertising on Attitude Accessibility, Attitude Confidence, and the Attitude-Behavior Relationship. J. Consumer Res. 16, 269–279 (1989) 8. Braun, K.A.: Postexperience Advertising Effects on Consumer Memory. J. Consumer Res. 25, 319–334 (1999) 9. Brown, J., Broderick, A.J., Lee, N.: Word of Mouth Communication within Online Communities: Conceptualizing the Online Social Network. J. Interactive Marketing 21, 2–20 (2007) 10. Buda, R., Zhang, Y.: Consumer Product Evaluation: The Interactive Effect of Message Framing, Presentation Order, and Source Credibility. Internat. J. Management 9, 229–242 (2000) 11. Chaiken, S., Liberman, A., Eagly, A.H.: Heuristic and Systematic Information Processing within and beyond the Persuasion Context. In: Uleman, J.S., Bargh, J.A. (eds.) Unintended Thought, pp. 212–252. Guilford, New York (1989) 12. Dellarocas, C.: The Digitization of Word of Mouth: Promise and Challenges of Online Feedback Mechanisms. Management Sci. 49, 1407–1424 (2003) 13. Dholakia, R.R., Sternthal, B.: Highly Credible Sources: Persuasive Facilitators or Persuasive Liabilities? J. Consumer Res. 3, 223–232 (1977) 14. Eagly, A.H., Chaiken, S.: The Psychology of Attitudes. Harcourt, Brace, & Janovich, Orlando (1993) 15. Erb, H.-P., Bohner, G., Rank, S., Einwiller, S.: Processing Minority and Majority Communications: The Role of Conflict with Prior Attitudes. Pers. and Soc. Psych. Bull. 28, 1172– 1182 (2002) 16. Grewal, R., Cline, T.W., Davies, A.: Early-Entrant Advantage, Word-of-Mouth Communication, Brand Similarity, and the Consumer Decision-Making Process. J. Consumer Psych. 13, 187–197 (2003) 17. Hoch, S.J., Ha, Y.W.: Consumer Learning: Advertising and the Ambiguity of Product Experience. J. Consumer Res. 13, 221–233 (1986) 18. Hoffman, D.L., Novak, T.P.: Marketing in Hypermedia Computer-Based Environments: Conceptual Foundations. J. Marketing 60, 50–68 (1996) 19. Hsieh-Yee, I.: Research on Web Search Behavior. Library and Inform. Sci. Res. 23, 167– 185 (2001) 20. Hutchinson, J.W., Alba, J.W.: Ignoring Irrelevant Information: Situational Determinants of Consumer Learning. J. Consumer Res. 18, 325–345 (1991) 21. Kerr, N.L.: When is a Minority a Minority? Active versus Passive Minority Advocacy and Social Influence. European J. Soc. Psych. 32, 471–483 (2002) 22. Lee, J., Park, D.H., Han, I.: The Effect of Negative Online Consumer Reviews on Product Attitude: An Information Processing View. Electronic Commerce Res. and Applications. 7, 341–352 (2007) 23. Littlejohn, S.W.: Theories of Human Communication, 7th edn. Wadsworth Publishing Company, California (1997) 24. Lutz, R.J.: Changing Brand Attitudes through Modification of Cognitive Structure. J. Consumer Res. 1, 49–59 (1975) 25. Maas, A., Clark III, R.D.: Hidden Impact of Minorities: Fifteen Years of Minority Influence Research. Psych. Bull. 95, 428–450 (1984) 26. Mackie, D.M.: Systematic and Nonsystematic Processing of Majority and Minority Persuasive Communications. J. Pers. Soc. Psych. 53, 41–52 (1987)
Are We Trapped by Majority Influences in Electronic Word-of-Mouth?
529
27. Maheswaran, D., Caiken, S.: Promoting Systematic Processing in Low Motivation Settings: The Effect of Incongruent Information on Processing and Judgment. J. Pers. Soc. Psych. 61, 13–25 (1991) 28. Mehta, A.: Advertising Attitudes and Advertising Effectiveness. J. Advertising Res. 40, 62–72 (2000) 29. Moscovici, S.: Toward a Theory of Conversion Behavior. Advances in Experimental Soc. Psych. 13, 209–239 (1980) 30. Nemeth, C.J.: Differential Contributions of Majority and Minority Influence. Psych. Rev. 93, 23–32 (1986) 31. Petty, R.E., Cacioppo, J.T.: Communication and Persuasion: Central and Peripheral Routes to Attitude Change. Springer, New York (1986) 32. Tanford, S., Penrod, S.: Social Influence Model: A Formal Integration of Research on Majority and Minority Influence Processes. Psych. Bull. 95, 189–225 (1984) 33. Worchel, S., Grossman, M., Coutant, D.: Minority Influence in the Group Context: How Group Factors Affect when the Minority will be Influential. In: Moscovici, S., MucchiFaina, A., Maass, A. (eds.) Minority Influence, pp. 97–114. Nelson-Hall, Chicago (1994) 34. Wright, P., Rip, P.D.: Product Class Advertising Effects on First Time Buyers’ Decision Strategies. J. Consumer Res. 7, 708–718 (1980) 35. Zamir, O., Etzioni, O.: Grouper: A Dynamic Clustering Interface to Web Search Results. In: Proc. of the Eighth WWW Conference, Toronto, Canada, pp. 1361–1374 (1999)
Leveraging a User Research Framework to Guide Research Investments: Windows Vista Case Study Gayna Williams Principal Program Manager, Developer Division, Microsoft*
Abstract. During the development of Windows Vista we had the opportunity to invest in new methods to understand user behavior. We leveraged standard usability methods to work on feature areas during development; however, we had to invent and adapt new approaches to measure holistic experiences. In this area user research methods are evolving, due to the integration of technologies and changes in the definition of a successful experience. While considering the methods that suited our needs, a user research framework was created. This helped us manage investments in research activities. The framework is organized along two dimensions: perspective and time. Perspective refers to the breadth of the experience being considered: ‘narrow’ defines a focus on an individual feature area or small product area, and ‘broad’ defines a focus on an integrated experience. Time can indicate either a product cycle or real time. For product cycle most of the research is spent on the evaluation of the designs of the features and experiences related to predicting user behavior for a particular release of a product, whereas real time is our research investment into understanding how products are used in the wild without our intervention. Each quadrant of the two-dimensional framework highlights different research methods and purposes. It’s important to realize that the value of the framework comes from the integration of findings that provides a rich holistic picture of our users to ultimately guide product decisions. This paper describes some of the methods that were evolved and created during the development of Windows Vista and their relationship to the user research framework. The methods described in the paper include user experience score-carding, measurement of desirability, and the impact of the consumer adoption program. These methods continued to be used today in the development of Windows 7.
1 Introduction One challenge in working on an operating system is that it contributes to a computer experience in more than one significant way. It provides stand-alone experiences and it contributes substantially to extended experiences. When developing Windows Vista, the user research team had the challenge of considering how to provide deep insight in particular areas to impact product creation, while also playing a critical role in understanding the quality of the holistic experience. Many parts contribute to the ecosystem that users of Windows experience. Our role was to understand this holistically and to drive that understanding into product development. *
Leveraging a User Research Framework to Guide Research Investments
531
Before diving into the research framework I’d like to provide some context. The Windows organization is a large organization that includes over 5000 people. Windows Vista was not the only product to be produced by this organization as the core components contribute to other products (e.g., Windows Server), or service packs (e.g., Windows XP SP2). The user experience team is a centralized organization and consists of user research, design, and user assistance. During the Windows Vista development cycle these three groups stepped up their accountability to raise the importance in the product experience. For design, there was an increase in demonstrating how design could lead product definition, and also the continuous engagement from product inception through to marketing messaging and branding. For user assistance there was a change in focus from being a team that documents help to work towards the goal of being a continuous publishing group with a data driven content strategy. And for user research we stepped up to consider how to drive accountability for user experience across teams in a holistic way, which is what this paper describes further. I was the user research director of the research team. The team was approximately 24 people in size. The team consisted of 14 user researchers who did much of the iterative work with product teams and also owned particular experiences, plus some researchers also owned particular projects or research methods that benefited the whole team, two anthropologists, one project manager, one product planner (a role focused on identifying opportunities through working with internal and external partners), two data analysts and a small development team (4 people) for building tools and managing the instrumentation projects. The mission for the team was, “Deliver outstanding Windows client and partner experiences that build upon a deep understanding of people”. It is important to understand the deliberate decision to use the word ‘people’ in the statement. So often in usability we are focused on “the user,” defining the user as the person actually using the system in contrast to the customer, the person responsible for purchasing the system. However, we realized that we needed to understand many people within the ecosystem in order to deliver the right experience. We also realized that to succeed in delivering on what people perceive to be the Windows experience required assisting partners to also understand how to make their part of the experience better. For example, most people experience Windows when they purchase a new computer. The original equipment manufacturer (OEM) is responsible for part of the experience, and when a user sees the desktop for the first time it is a joint responsibility of Windows and the OEM. One option when we started to work on Windows Vista was to map user researchers to particular teams within the Windows Client Organization and then manage their workload in strict alignment with the teams. However because we were organized as a central team we had the opportunity to set our own priorities and focus areas. Our position gave us a unique opportunity to have a perspective across the whole experience. We decided to leverage this position to drive product development from the perspective of holistic customer understanding. Achieving this perspective required us to approach our work differently and invest resources differently. The necessity of getting this rich view shaped the user research framework.
532
G. Williams
2 User Research Framework There are many tried and tested research methods for improving usability during a product cycle. It is encouraging that today more user researchers succeed in implementing these methods throughout a cycle rather than being brought in at the end of a cycle to validate decisions or create recommendations where time is too short to respond to them. More recently, challenges have arisen as product teams seek to know more about the emotional experiences of users with their products. Research methods have been evolving to accommodate this need, but when we were working on Windows Vista (2002-2006) these methods were less established than they are now. At best the methods available then were for evaluating finished products, not products in development. How were we to get a sense of the overall emotional experience of Windows Vista three years ahead of the product release? Another challenge was that the Windows engineering team is very large (i.e., a few thousand).We knew most individual tasks with Windows Vista would require people to use UI elements produced by several teams, who might be in more than one division. We decided to step up to the challenge by creating a list of tasks to serve as a common reference point for much of our research work. More about how we created and leveraged these tasks is discussed later in the paper. So to tackle the research work for Windows Vista we needed to invest in narrowly focused but deep usability activities that aligned with the product teams, and we had to work in lock-step with the schedule of the product teams. However, to do our jobs well and deliver on an outstanding holistic experience, we knew we had to invest in broader expansive research that tackled some of the challenging new wave of requirements that target ‘experience’. These two types of investment are represented in the first row of the user research framework in Figure 1, which focuses on user research work that aims to predict user behavior with the product when it is complete. I define first a narrow perspective, the investment in mapping research activities to the requirements of product engineering. The second perspective is broad by comparison, extending across engineering insofar as it is encountered throughout the product, or across feature areas as it spans possible feature boundaries. Although the second is a critical investment if user research is to deliver ‘experiences,’ it had less history to guide us in successfully integrating it into the product development cycle. To achieve the product, holistic understanding we went outside the product design cycle to study people currently using Windows. We realized that the current use of the product was influenced by how the previous version had been created, this knowledge provides tremendous value. We invested time in instrumentation, survey techniques, and field work throughout the development cycle, and as a result were always able to learn from current user behaviors as one input to informed decision-making. We were careful not to influence the current users we were learning from by revealing information obtained from other users exposed to prototypes or other information that we were using to help assess the future behavior of Windows Vista users.
Leveraging a User Research Framework to Guide Research Investments
533
Fig. 1. User Research Framework
The final perspective that was used to ground our understanding of people and their behaviors was obtained through an investment in collecting ethnographies. In this area we removed the restrictions of considering our product and even our company’s technologies from the research brief and focused on audiences and situations that were considered to be of future interest. This work provided a rich context of understanding the world in which our finished products would be situated (or not!). The latter two perspectives are key elements in the lower row of the framework. This is referred to as Real Time because for the most part we are not influencing behavior when collecting observations by trialing software or scenarios with the users. The lower left cell represents a narrow product perspective, meaning we define the audience we’re engaging with by the product we are interested in, whereas the lower right cell is a life perspective as we do our best to observe the situations and audience without making a priori decisions as to which products we wish to see used. The framework allowed us to consider how to invest resources in tool and method development, and how to invest our research time. Below, I go through the framework in greater detail.
3 Applying Windows Vista User Research to the Framework 3.1 Product Cycle – Narrow: Features/Product Area This is the part of the framework that I felt is best understood through wellestablished methods, such as iterative usability testing, heuristic evaluations, and paper prototyping. As a team we were significantly invested in this work, which maps most closely to how the engineering teams work; when user researchers (URs) are well integrated with the teams they work with, it is easier for the research to have an impact. The URs were assigned to work with a particular themed area (e.g., Photos & Video, or Storage) which usually mapped to a particular product team (and sometimes
534
G. Williams
to more than one as the elements of a themed area were distributed across teams). However, URs were aware that while they worked in detail with the teams on their areas, they were also accountable for driving the broader holistic goals of the product through broader user tasks. This work then started to enter the larger experience work. Occasionally the UR might be in conflict with the team with which they worked most closely in order to drive for a change that would benefit a high level task –one of the challenging aspects of being a user researcher is being able to maintain a trusted relationship with a team while driving a user issue. Product teams who may have had a dedicated user research for their work previously had to adjust to the UR driving a broader charter. 3.2 Product Cycle-Broad: User Experiences Cross Product Experiences An operating system supports and enables many different tasks for many different audiences (home user, enterprise, IT specialist). We needed to prioritize which areas of Windows Vista we cared about most. This required setting up criteria to evaluate different tasks. The criteria we considered included task frequency, known task difficulty (based on our previous understanding of customer challenges), and newly enabled tasks. We were able to leverage previous research work from field, lab, and ethnographies to help us in identifying these tasks. We also had to define what a task was and how this differed from focusing on features. A task is defined in user language; to complete the task the user may use several features. For example, for a user to download 50 photos and send her favorite to her friend in email involves multiple features provided by several different teams (devices, photo download, file management, email setup, email send/receive, add attachment, receive an attachment). Although we were responsible for creating the list of tasks, we also needed buy-in from the individual teams that these were indeed tasks that they wanted to address with their features. We had to work with the development teams to create success criteria that were acceptable to development and research (e.g., 80% of participants should complete the task successfully) plus we had to incorporate some leeway in success criteria that would allow for emotional evaluation of experience, and customer site visit feedback. We established a list of over 160 tasks that we tracked during the development of Windows Vista. This list of tasks provided significant benefit to the development team throughout the development cycle. Frequently a group such as the performance test team would ask for the top scenarios that Windows Vista was targeting, and our list was defined with sufficient detail to be an actionable starting point for responding to such requests. This list of tasks provided a critical starting point for driving accountability into engineering through the creation of a User Experience Scorecard (Figure 2). We iterated several times on creating a scoring system that Fig. 2. Scorecard example teams would respond to and
Leveraging a User Research Framework to Guide Research Investments
535
that we felt reflected the experience we were on track to ship. This included allowing heuristic assessment of plans and specifications to be incorporated during the early stages of development, and evaluation as the milestones progressed in the development cycle. We used a three color rating system (red, yellow, green). Because we had detailed task success measures we used these as the primary assignment of the color rating. However if an additional data source provided insight that suggested a serious user problem we took that into account in the rating - mostly this would prevent a task that was completed successfully in a lab situation from being green if field insights suggested challenges. We made sure that anything less than green was accompanied by actionable bugs to be addressed. At the end of most ship cycles are quality gates that must be met for a product to be released. Typically quality gates are test quality measurements, such as reliability, performance, and security. The rigorous procedures of our scorecarding method enabled us to adapt our scorecard to become part of the quality gate process. From our list of tasks we defined a subset that were considered ‘ship-stoppers’: critical tasks for which a failure to meet the criteria would lead to having the bugs and issues examined at a more detailed and senior review level to insure that things were fixed. The User Experience scorecard and task list was used to drive many product changes, but identifying and eliminating task seams was a major benefit of the method. There were challenges for the user researchers (URs) in driving the issues. Most of the URs worked closely with particular feature teams, but not with all the teams that might contribute to a particular experience, so to stay up to date on relevant feature plans required additional effort. This was one of our bets in terms of allocating resources--we decided the benefit for user experience of investing the time to track experiences across the product that mapped to user tasks would be greater than additional individual depth in particular niche areas. It was better to make the effort required to work across experiences than to leave to the users to work across siloed experiences after the product shipped. With this investment we uncovered many seams that might not have been addressed in the product had we not done this. Emotional Connection We were very much aware that an emotional experience is inextricably tied to satisfaction with a product, especially in the consumer market. At the time of working on Windows Vista we found methods that had been trialed to evaluate desirability, but the challenge was how to use these methods during the development phase and how to make the insights actionable. Benedek and Miner [1], members of the research team, created a desirability toolkit to help us evaluate these experiences. The tool is very simple but it provided insightful data that the URs and the designers could collaboratively turn into impactful action. After interacting with a product or prototype, a user is asked to select from a list of words, those words that they associate with the experience. The UR then discusses with the user why they selected particular words. The most important part of the assessment is the user’s explanations. We used this tool during lab usability tests, benchmarks and in the field (with an automated version of the tool). Miner and Benedek were responsible for mining the themes across the studies and assessing how the particular lab study (or situation) may have influenced the selection of words and explanations. This was another example of how results pulled from the product-deep work were used to inform the broader experience of the product.
536
G. Williams
Productivity Early in the development of Windows Vista we were asked what we could do to demonstrate improved productivity with the use of Windows Vista. As we unpacked what productivity meant in the context of Windows Vista use, we realized that it would be a difficult concept to measure for enterprise workers. After exploring the topic further with field representatives who work with our enterprise customers, we learned that they were less interested in demonstrating improved productivity than in knowing how we would assist people in climbing the learning curve as they deployed the new operating system. This insight led to a different approach to understanding how the enterprise learning experience should unfold. The feedback told us that we didn’t need to build everything into the product to remove a seam—in this case, a companion experience could solve the problem. We developed an Enterprise Learning Framework (ELF) [2]. Working with enterprise users, we reviewed what should be included in the ELF. It included a time line (week before deployment, day of, day after, etc.), and what topics would be relevant to which users at that time. The topics then hooked up to the help system. In working through the topics we leveraged the insights URs had from working deep with feature teams to determine what would be useful to mention or areas in which users might have difficulties. We provided guidance to User Assistance about content to cover, something that may not otherwise have been included. To accompany the website a whitepaper was produced by Nowicki [3] which leveraged her learning from the research and creation of the framework. A triumph of the framework was in responding to enterprise customers’ requests that it include both Office information and Windows Vista information, since they roll out desktops (Office and Windows), not individual pieces. So again the investment of tackling productivity as a cross-product experience paid off rather than requesting for teams to think in an individual way about productivity. 3.3 Real Time – Narrow: Product Customer Feedback Panel We wanted to know a lot about user’s behavior with Windows XP. To understand how a very large group of users were using Windows XP, we invested in creating the Windows Customer Feedback Panel [5]. Windows XP itself is not instrumented so we built a research platform that allowed us to upload data collecting tools to PCs over the Internet, which would then collect data from those machines on a regular basis. We recruited participants who were willing to allow us to gather instrumented data from their computers and associate it with other data sets related to them to enable us to ask follow-up questions. The advantage of leveraging a panel of known users is that we could profile characteristics of usage that applied to particular user groups. We could also survey this set of users on a needs basis. Because of the flexibility of the research platform we could adjust the data we were collecting–when new questions came up we could adjust the data collection tools to provide answers. As with all research, it was important to consider the sample bias. Although we were gathering data from more than 10K users, we knew they were slightly more technical users than average and were installing the data collection tools on home machines more often than work machines. This research platform allowed us to gather data we were not previously able to get, and was extremely good at gathering hardware,
Leveraging a User Research Framework to Guide Research Investments
537
configuration, file arrangements, and installed apps types of data. Understanding these dimensions of usage aided teams, such as the application compatibility team and the performance team, that we would not have been able to help using our regular user research. Send-a-smile We were now understanding what was happening on panelists’ PCs. We also wanted to capture spontaneous emotional moments that arise during use. We created a tool called ‘Send-a-smile’ as part of the customer feedback toolset (Figure 3). A green smiley and red frown face were situated in the system tray (icons near the clock). When a user had a good moment she could click on the green face, or after a bad moment click on the red face. These would pop up a window with a text field for entering a comment and a screenshot of what was visible on the desktop. Comments and images were returned to us through the feedback tool. It was a very engaging tool to use, but as with all verbose feedback tools it was challenging to review all the feedback and turn into actionable suggestions or bugs to be entered into the bug data base [6, January 2007]. We used this Send-a-smile to gather feedback on the use of Windows XP and Vista, but it was product agnostic and was also leveraged by other teams at Microsoft. Fig. 3. Send-a-smile Customer Adoption Panel Windows has extensive beta programs, but most people who participate in them, especially in operating system betas, tend to be people who are relatively technicallyminded. We knew it was important to include less-technical home users in the beta programs to get a rounded view of bugs and feedback on experiences. The research team owned the consumer adoption program for Windows Vista and had participants from throughout the US and overseas [6, January 2007]. The research program called “Living with Windows Vista” was an opportunity to provide all the usual bug feedback required from betas while also leveraging our research toolkit to evaluate additional dimensions of experience and use. This panel was relatively small (approximately 30 families) but we had deep engagement with them. The panel was invaluable because not only did it generate unique bugs but also we used our observations to change features, and several default settings based on problems encountered. 3.4 Real Time – Broad: Life Studies Exploratory Ethnographies The real time–broad cell covers an area of work that is basically understanding people without intervention, or with as little intervention as possible. Two anthropologists were on the research team. They were tasked with exploratory work. Their research areas were broad and not necessarily tied to technology; they could consider areas that might benefit from the introduction of technology. This set of work included research
538
G. Williams
in different geographical locations to understand emerging markets, the digital divide, the relationship between baby-boomers and their parents, dawn to dusk lives of small businesses, and other topics [6, 2005]. Each of the projects was uniquely designed, for example some were single day shadowing of participants, others were longitudinal over the course of a year. The challenge with this type of work was to allow sufficient freedom in the research to truly enable the discovery about peoples’ lives. The second challenge was how to share the insights from this work with the engineering and product marketing team. One strength of the work was in creating team member empathy for people and situations. This led to devising creative ways to communicate the findings, including photo-story narrations at the espresso coffee stand [4], posters in the buildings, and engagement through the creation of events related to the populations studied. Not every observation leads to feature improvement, but it does provide the rich perspective of peoples’ lives and their contexts that enable team members to realize how our products or potential products might fit into those lives. Customer Engagements Getting product teams involved in site visits is an activity that has been promoted for many years. We invested time in programs that weren’t research but which were designed to drive empathy with customers. When team members are empathetic with their users they are more receptive to recommendations from user research. We created programs entitled, ‘Know-a-knowledge-worker’ or ‘Get-to-know-an-IT-Pro’. Senior team members and executives were assigned a participant and provided with sufficient guidance to be able to conduct a site visit, and then spent time with a targeted customer to understand what they did in their day-to-day life at work, traveling to interact with them in their work context. The participants were not recruited based on their use of a particular technology, but based on what they did at work. We kept the requirements on reporting back from the visits to a minimum, as at the end of the day the benefit was to have more than 100 people on the team who had experienced what their customers would be doing. It was clear that the visits made an impression as reference to the visits would come up in discussions during development.
4 Summary Although I have mapped the research that took place for Windows Vista to the User Research Framework, it is important to realize that the quadrants didn’t act in isolation. It was the rich integrated insights gained from working in all these ways that provided us with a holistic view of our customers. The framework also provided a way of describing the size of investment in each quadrant. Teams get anxious when they can’t clearly see a connection between research and specific feature impact. Even with this framework the majority of resources are invested in narrow product work – that is the most obvious opportunity to impact product, however we know from our experience that paying attention to the other quadrants has valuable impact on the experience in less obvious ways. Many of the programs and tools established during the Windows Vista development cycle have continued to be used and enhanced by the
Leveraging a User Research Framework to Guide Research Investments
539
Windows 7 team, by other user research teams at Microsoft, and even to assist in the marketing of Windows Vista.
References 1. Benedek, J., Miner, T.: Measuring Desirability: New methods for evaluating desirability in a usability lab setting. In: Usability professional association, 2002 Conference Proceedings (2002), http://www.microsoft.com/usability/UEPostings/ DesirabilityToolkit.doc 2. Enterprise Learning Framework, http://www.microsoft.com/technet/desktopdeployment/bdd/elf/ welcome.aspx.com/presspass/features/2005/apr05/ 04-04Ethnographer.mspx 3. Nowicki, J.: Learning Windows Vista in the Work Place. Microsoft white paper (2006), http://download.microsoft.com/download/d/9/b/d9b9587e-427c439f-b90c-69a1e643de4c/windows_vista_workforce.doc 4. Steele, N., Lovejoy, T.: Engaging our audiences through Photostory. Visual Anthropology Review 20(1) (2004) 5. Windows Customer Feedback, http://wfp.microsoft.com/Welcome.aspx 6. Microsoft Presspass, Large-Scale Research Project Aims to Make Windows Vista Useful, Fun for All (January 2007), http://www.microsoft.com/presspass/features/2007/jan07/ 01-29LivingWithVista.mspx; Making technology conform to people’s lives (April 2005); Interview with anthropologist Tracey Lovejoy, http://www.microsoft.com/presspass/features/2005/ apr05/04-04Ethnographer.mspx; Developers Tap Real-Life Families to Find Out What Consumers Really Want from Windows Vista (January 2007), http://www.microsoft.com/presspass/features/2007/jan07/ 01-29VistaDevelopers.mspx
A Usability Evaluation of Public Icon Interface Sungyoung Yoon, Jonghoon Seo, Joonyoung Yoon, Seungchul Shin, and Tack-Don Han Dept. of Computer Science, Yonsei University, 134, Seodaemun-Gu, Seoul, 120-749, Republic of Korea {freesiz,jonghoon.seo,jyyoon}@msl.yonsei.ac.kr, seungchul.d.shin@samsung.com, hantack@kurene.yonsei.ac.kr
Abstract. Existing image codes interface needs additional visual marker and explanation of the service. To overcome these limitations, there were some researches to use a public icon as an anchor. The public icon is human-readable and does not need additional visual marker or explanation. In this paper, we carried out the usability evaluation of the public icon interface with a high-fidelity prototype in comparison to the existing image code. In addition, we analyze user preferences from the results. From the analysis, we perceived that the public icon interface is better to use in the public because the public icon interface is familiar with people and doesn’t need additional materials or much cognitive load and are in good harmony with current environments. Keywords: Public icon, pictogram, color-based image code, image code, barcode.
2 Analysis In this section, our goal is to analyze technical issues regarding the data type and coverage. Existing barcodes such as 1D barcode [1], QR code [2], ColorCode [4] and pictorial image code [9] put the information in the code itself as a form of the index or full-text and offer various types of service to the user after decoded. Each image code has its own data type and data size [10]. QR code can store full-text up to 300 characters [11], [12]. Color Code and pictorial image code can store a data of index which can distinct up to 17 billion [13]. Data size of the image codes can be increased by attaching extra rows or columns without exceeding technological limit. Therefore, these various types of image codes have enough data coverage of an anchor to provide a specific service for the user. On the other hand, the public icon only indicates what it means after recognized [5]. Additionally, it is sure that data capacity of the public icon can’t be increased as opposed to the other image codes. Therefore, the public icon interface has to reference location information essentially on the purpose of providing appropriate service to the user. It is reasonable to supplement the shortcoming of no-information on itself, as there are no two pictograms which indicate the same meaning at the same location. However, this method has a potentiality of error because some places such as a bus stop have two pictograms of the same meaning in an area of GPS error range. We can remove the potentiality of error by using contexts from the user’s own mobile devices such as a mobile phone or a PDA. There are lots of contexts in the user’s own devices. A user profile, schedule, text or multimedia messages and phone records are good examples. Almost all cases of two-bus-stop pictograms in a GPS error range just indicate the opposite directions. Considering the example we already implemented [5] and other context-aware applications [14], it is not difficult to deduce appropriate direction from the context in the devices. Though this method drops the service performance, the service can be expanded for a variety of purposes and become convenient for use. Table 1 shows what we explained in this phase. Table 1. Comparison of Public Icon and other Image Codes [9] Item Data Type Data Size Expansion Weakness
QR code Full-text Varies by size (about 100 to 300 character) O Blurring due to out of focus
(a) barcode
ColorCode/Pictorial Code Index Varies by size (17 billion patterns for 5x5 code) O Color variation by illumination
(b) QR code (2D)
Public Icon Index Just indicates what it means X Shortcoming of data storage, low performance
(c) Color Code (d) Pictorial image code
Fig. 1. Examples of various image codes
542
S. Yoon et al.
3 High-Fidelity Prototyping and Experimental Evaluation To evaluate usability, we improved high-fidelity prototyping which we used in our previous study [5]. Then, we did usability evaluation with eighteen undergraduate university students by using high-fidelity prototype. 3.1 High-Fidelity Prototyping We evaluated usability with eighteen participants by using high-fidelity prototype. We used UMPC in which installed a public icon decoding program we implemented in our previous study [5]. The UMPC (SONY VAIO VGN-UX57LNS) has Intel core 2 solo processor (1.2 Ghz), 1GB DDR2 SDRAM and 1.3 mega-pixel CCD on the back side [15]. We used AForge.NET 1.5.0.0 for the recognition algorithm which is a C# framework designed for developers and researchers in the fields of Computer Vision and Artificial Intelligence [16]. In Figure 2, an overall system flow including recognition algorithm is depicted. Framework
Public Icon Recognition Service Type
Service Code
Capture Icon
False
Binarization
User Profile
Context Location, Time, etc..
Hobby, fondness …
Noise Filtering
Parameters Extract Code Area User defined parameter
Rotation
Predefined Query Maker
Recover from geometry distortion
Query
Down sampling
Yes No
PID < 3?
No
Selection adaptive result In the database
OK?
Yes PID = PID + 1
No
HTTP Request
Success?
Yes
Web Service Service request Service Daemon
Integrity Check
OK? Service Error
Yes
Load service data
HTTP Response
No
Fig. 2. Flowchart of the public icon interface system: (top left) recognition algorithm, (top right) framework, (bottom) web service
A Usability Evaluation of Public Icon Interface
543
In addition, samples of the public icon (Fig. 3) which stick on real-size pictogram board were provided. We choose three types of the public icon on the purpose of comparing the public icon interface to the other image code interfaces. The first type is a public icon. The second type is a ColorCode on the public icon. In this case, the public icon does not work as a visual marker, but ColorCode does. The reason for choosing the color code instead of QR code is that ColorCode is in harmony with diverse materials [13]. The last one is a pictorial image code. Fig.2 is an example of three types. 3.2 Experimental Evaluation Eighteen participants in the experiment are undergraduate students who are not specialized in computer engineering and all of them have no previous experience of using an image code interface before. All of them have camera phones. They were allowed five minutes to experiment. Once the users execute the public icon decoding program (Fig. 4), the users can see real-time preview image. If an icon or an image code is shown on the preview image, the users can see contents related to the icon or the image code by pressing ‘recognition’ button. We assumed that the samples which have the same public icon provide the same service.
(a) Public icon
(b) Public icon + ColorCode (c) Pictorial image code
Fig. 3. Samples of the public icon: (a) Public icon (b) ColorCode on public icon (c) pictorial image code
Fig. 4. Public icon decoding program: (left-top) preview image, (left-bottom) current location, ‘image recognition’ button and user information, (right) service contents
544
S. Yoon et al.
Table 2. Results of the survey. (Bad/Low: 0-33.3%, Average/Medium: 33.4-66.6%, Good/High: 66.7-100%) Item Harmony with environments Familiarity Physical effort Cognitive load Understanding what it means Speed performance
Public Icon Good Good Medium Medium Good Bad
ColorCode + Public Icon Average Average Medium Medium Good Average
Pictorial Code Average Bad Low Good Average Average
After the experiment, we did a survey about the response of the users by using the questionnaire. We permitted multiple answers. According to the survey, the users felt that the public icon interface is in harmony with environments and familiar with themselves. Besides, they replied that physical effort and cognitive load to use the public icon interface are not worse than the others. Moreover, they answered that understanding what it means is very easy. However, they appealed that speed performance of the public icon interface is worse than the others. Table 2 shows results.
4 Discussion Physical Restrictions, Speed Performance. The public icon interface has a drawback of service providing related to the limitation of data size because the public icon interface cannot store any information on it. On the other hand, all of the existing image codes can save information on itself. To complement this demerit, the public icon interface needs a GPS sensor. It seems reasonable. That is because in most cases, there are not two same public icons with different purpose at the same location. Moreover, it is encouraging that GPS phones became popular. However, there is a possible exception. Bus stop is a good example. As we mentioned above, we can remove the exceptional cases by using contexts from the user’s own mobile devices. However, we should care about the decreasing of performance as result of this method. Harmony, Cognitive Load, Understanding. The public icon is already familiar to everyone because most of people have seen it for a long time. Therefore people feel that the public icon is in harmony with its environments of facilities. However, the familiarity makes it more difficult to recognize its existence compare to the other image codes which have a unique pattern or colors. That is because the unique pattern usually attracts people. However, understanding the public icon what it means is less difficult than pictorial image code. Even though people who have not used the image codes by camera phone already use the public icon as a sign post. From this, we can infer that there is a weak trade-off between cognitive load and understanding what it means. User Interface. The public icon interface decodes an image from camera only one time when a user presses the ‘Recognition’ button. This method decreases hit-rate comparing to the decoding method which decodes all preview images repeatedly on
A Usability Evaluation of Public Icon Interface
545
the background. However, an important characteristic of public icons is that they are installed with several other public icons at a location. Therefore, current decoding method is appropriate to avoid decoding the public icon which a user does not want to get service from. Additionally, most of the public icons in the real world not in a map stick on a high position and big. This attribute makes it possible to recognize a public icon from distance and share service with more people at a time. On the other hand, the user in close proximity has to look up the public icon in the high position through his phone inconveniently. For the public purpose. As we spoke several times in this paper, the public icon interface has a lot of merits in comparison to the other image codes in spite of a few demerits. Human-readable, familiarity, harmony with the environments and no necessity of additional explanation are good examples. Considering these advantages, the public icon interface is outstanding for the public oriented context-awareness service. It makes sense that everyone can acquire the public service only indicating the public icon with camera added phone after downloading a simple service application. It can also save time and troubles of the government, facilities or society. However, a few users may puzzle to predict what service will be provided. Besides, it is possible problem that a variety of services are mapped to a single public icon. In this case, enough advertisement of the service from the facilities in charge will be a good answer.
5 Conclusion and Future Works In this paper, we have performed the usability evaluation of the public icon interface which uses a public icon as a visual marker. From the result of the evaluation, we found out the disadvantages of the public icon interface such as difficulties of confirming what public icon will provide a service and predicting what service will be provided before decoding. Also, necessity of GPS is a drawback, but GPS phone will be in popular soon because of its usefulness [17]. In addition, the participants pointed out lower speed performance than the other image codes. On the other hand, the public icon interface does not need additional materials or explanation on facilities to show its meaning or existence. In addition, the public icon is already in harmony with the environment. In addition, public icon is much bigger than the other image codes so that a user can decode it from much far places. Therfore, numerous users can access to the service conveniently, even if in a crowded place. From this research, we discovered that the public icon interface is suitable to the public objective. Also we found what should be improved to use the public icon as an service pointer instead of the image codes. To improve the public icon interface, we have plans to research on the invisible image code to attach public icons and the recognition algorithm for multiple public icons.
Acknowledgement This work was performed in the research project ‘Mobile computing based ContextAwareness Service Framework’ (11052) supported by the Seoul R&BD Program.
546
S. Yoon et al.
References 1. Barcodes, ISO Standards (September 2007), http://www.iso.org/iso/en/ CombinedQueryResult.CombinedQueryResult?queryString=bar+code/ 2. Info Plant Conducts Survey on QR Code. DigInfo (September 21, 2005), http://www.diginfo.tv/archives/2005/09/21/ info_plant_conducts_survey_on_2.html (September 2008) 3. Pavlidis, T., Swartz, J., Wang, Y.P.: Information Encoding with Two-Dimensional Bar Codes. IEEE Computer 25(6), 18–28 (1992) 4. ColorCode. ColorZip Media Inc. (September 2008), http://www.colozip.com 5. Kim, D., Shin, S., Yoon, S., Han, T.: Public Icon Communication Service System: A Human Readable Tag Interface for Context-awareness Service. In: The 10th International Conference on Ubiquitous Computing Poster 6. Rohs, M., Roduner, C.: Camera Phones with Pen Input as Annotation Devices. In: Proceedings of the Workshop PERMID (2005) 7. Roduner, C., Rohs, M.: Practical issues in physical sign recognition with mobile devices. In: Strang, T., Cahill, V., Quigley, A. (eds.) Pervasive 2006 Workshop Proceedings (Workshop on Pervasive Mobile Interaction Devices, PERMID 2006), Dublin, Ireland, May 2006, pp. 297–304 (2006) 8. ISO 7001:1990 Public information symbols, ISO Standards, http://www.iso.org/iso/iso_catalogue/catalogue_tc/ catalogue_detail.htm?csnumber=13565 (September 2008) 9. Cheong, C., et al.: Pictorial Image Code: A Color Vision-based Automatic Identification Interface for Mobile Computing Environments. In: Proceedings of the Eighth IEEE Workshop on Mobile Computing Systems and Applications (WMCSA 2007) (February 2007) 10. General EAN.UCC Specification v.6.0 (November 2006), http://www.ean.se/GSV6.0/HTML_Files/ Document_Library/Bar_Code/00103/index.html 11. International Standard ISO/IEC 18004, Information technology: Automatic identification and data capture techniques - Bar code symbology - QR Code, 1st edn., ISO/IEC (2000) 12. Info Plant Conducts Survey on QR Code, DigInfo, http://www.diginfo.tv/archives/2005/09/21/ info_plant_conducts_survey_on_2.html (October 2006) 13. Cheong, C., Kim, D.-C., Han, T.-D.: Usability Evaluation of Designed Image Code Interface for Mobile Computing Environment. In: Jacko, J.A. (ed.) HCI 2007. LNCS, vol. 4551, pp. 241–251. Springer, Heidelberg (2007) 14. Chen, G., Lotz, D.: Survey of Context-Aware Mobile Computing Research, Dartmouth Computer Science Technical Report TR2000-381 15. VAIO Online Korea UX57LN/S, http://vaio-online.sony.co.kr/CS/handler/vaio/kr/ VAIOPageViewStart?PageName=notebook/enjoy/ UX57LNSN.icm&ProductID=UX57LNSN 16. AForge.NET Framework, http://www.aforgenet.com/framework/ 17. Kaasinen, E.: User Needs for Location-aware Mobile Services. Personal and Ubiquitous Computing 7(1), 70–79 (2003)
Little Design Up-Front: A Design Science Approach to Integrating Usability into Agile Requirements Engineering Sisira Adikari, Craig McDonald, and John Campbell Faculty of Information Sciences and Engineering, University of Canberra ACT 2601 Australia {Sisira.Adikari,Craig.McDonald,John.Campbell}@canberra.edu.au
Abstract. In recent years, Design Science has gained wide recognition and acceptance as a formal research method in many disciplines including information systems. Design Science research in Human-Computer Interaction is not so abundant. HCI is a discipline primarily focusing on design, evaluation, and implementation where design plays the role as a process as well as an artefact. In this paper, we present a design science approach using “Little Design Up Front” to integrate the User-Centred Design perspective into Agile Requirements Engineering. We also present the results of two agile projects to validate the proposition that incorporating UCD perspective into Agile Software Development improves the design quality of software systems. Keywords: Design Science, Agile Requirements Engineering, Usability.
issues into finished products. As a result, end-user experience and satisfaction are directly affected. In this paper, we present a design science approach using “Little Design Up Front” to integrate User-Centred Design (UCD) perspective into Agile Requirements Engineering. We also present the results of two agile software projects to validate the proposition that incorporating UCD perspective into ASD improves design quality of software systems.
2 Design Science Design Science is a problem solving paradigm which aims at creating and evaluating innovative artifacts that address important and relevant organizational problems [7]. According to March and Smith, there are two fundamental design science processes: ‘build’ and ‘evaluate’, and four types of products namely: ‘constructs’, ‘models’, ‘methods’ and ‘instantiations’. A construct forms the vocabulary of a domain, a model is a set of propositions expressing relationships among constructs, a method is a set of steps used to perform a task, and an instantiation is the realization of an artifact in its environment [8]. 2.1 Design Science Research for Information Systems In recent years, design science has gained a wide recognition and acceptance as a formal research method in many disciplines including Information Systems (IS). The Design Science paradigm has its roots in engineering and the sciences of the artificial [9]. Simon made the distinction between natural science and design science in that the former is concerned with how things are and the latter is concerned with how things ought to be [9]. Behavioral Science research is an origin of natural science and aims at developing and justifying theories which explain or predict organizational human phenomena surrounding the analysis, design, implementation, management, and use of information systems. On the other hand, Design Science Research (DSR) aims at creating innovations that define ideas, practices, technical capabilities, and product through the analysis, design, implementation, management, and use of information systems [7],[8]. As creating design solution artifacts for an important problem in Human-Computer Interaction (HCI) is a combined effort of both behavioral science and design science paradigms, these two research paradigms complement each other. Behavioral Science attempts to “understand” the problem. Design Science attempts to “solve” it. According to Iivari [10], design science is a contrast to natural-behavioral science research which aims at finding empirical regularities, whilst design science aims at building artifacts. Hevner et al. [7] presented an IS research framework that combined both behavioral-science and design-science paradigms for understanding, executing, and evaluating IS research (see Figure 1). In the IS research framework, the Environment defines the scope of the problem domain that includes organizations, technology, and people. IS Research is the research effort conducted by applying behavioral science, through the use of theories that explain or justify the business problem, and design science to address the building and evaluation of artifacts designed to meet the identified business need. The Knowledge Base encompasses all the theoretical foundations, including the research methodologies and the kernel theories.
Little Design Up-Front
551
Fig. 1. Information Systems Research Framework [7]
In a recent paper, Hevner [11] further elaborated the IS research framework in terms of three inherent DSR cycles to enhance the understanding of high quality DSR in IS. Hevner pointed out that these three research cycles must be present and clearly identified in a DSR project. These research cycles within the IS research framework are shown in Figure 2. According to Hevner, the relevance cycle connects the contextual environment of the research project with the design science activities. The main focus of relevance cycle is to capture problem to be addressed or requirements for the research and to provide design solution artifacts to the environment for study and evaluation in the application domain. The rigor cycle connects the design science activities with the knowledge base that informs the research project. That is, it ensures innovation by providing existing knowledge to the research. The knowledge base consists of foundations, existing experiences and expertise, and existing artifacts and processes. The main focus of rigor cycle is to provide applicable knowledge for design science activities
Fig. 2. Design Science Research Cycles [11]
552
S. Adikari, C. McDonald, and J. Campbell
and to feedback the updated knowledge to enrich the knowledge base. The internal design cycle iterates between core activities of building and evaluating the design artifacts and processes of the research. The main focus of the design cycle is to create, evaluate and refine design artifacts until a satisfactory design is achieved. For this research project, we have deployed the information systems research framework associated with DSR cycles (Figure 1 and 2 above).
3 Agile Requirements Engineering and Practice The main distinction between Agile Requirements Engineering (RE) and traditional RE is that the former welcomes rapidly changing requirements even late in the software development process and the latter gathers and specifies requirements up front prior to software development. The dynamic nature of most organizations makes continuously changing requirements normal, hence it is difficult to gather and specify complete, stable and accurate requirements up front. Rapid changes in competitive threats, stakeholder preferences, development technology, and time-to-market pressures make pre-specified requirements inappropriate [12]. A recent empirical case study [13] on ten software development organizations identified seven key agile RE practices namely: Face-to-face communication over written specifications, Iterative requirements engineering, Requirement prioritization, Managing requirements change through constant planning, Prototyping, Test-driven development, and Use review meetings and acceptance tests. These practices are in line with agile principles [14] such as: Satisfy the customer through early and continuous delivery of valuable software; Welcome changing requirements even late in development; Deliver working software frequently; Business and developers work collaboratively throughout the project; Build projects around motivated individuals; Face-to-face conversation as the most efficient and effective method of communication; Working software is the primary measure of progress; Promote sustainable development; Continuous attention to technical excellence and good design; Simplicity; Self-organizing teams and Regular reflections to become more effective.
4 User-Centred Design Integration with Software Engineering In HCI literature, there are many user-centric methods and techniques that have been proposed to assist the production of usable, useful, and desirable software products [15], [16], [17]. Software product development still follows through a software development process where functionality is considered as the main priority. According to the literature, SE and HCI are largely two distinct communities. For the IEEE [18], SE is the application of a systematic, disciplined, quantifiable approach to the development, operation, and maintenance of software where as HCI is a discipline concerned with the design, evaluation and implementation of interactive computing systems for human use in a social context, and with the study of major phenomena surrounding them [19]. Importantly, HCI is by no means considered a central topic in SE and usability is considered as one of many non functional requirements and quality attributes [20].
Little Design Up-Front
553
As recently reported in the literature, there is a growing interest to incorporate user-centric perspective into SE practice so that usability awareness is widely known and software products become more user-centred and usable [21], [22]. This integrated approach is known as Human-Centred or User-Centred Software Engineering. Seffah et al. [23] discussed some of the most relevant HCI and SE integration frameworks and highlight their strengths and weaknesses as well as the level of objectivity in integrating HCI methods and principles for different software engineering methods. The frameworks they summarized were found to be useful for usability and software specialists who are interested in the development of methodologies and standards, who have researched or developed specific user-centered design techniques or who have worked with software development methodologies. Generally these frameworks provided insights in how to integrate user-centered best practices and user experiences with software engineering methodologies [20]. Discussing the importance of user modeling and usability modeling for user-centred software requirements, Adikari et al. [4] presented a framework for integrating ISO 13407 process model into a typical software development life cycle. The particular emphasis of the framework was its framework has the potential for defining the requirements to be more user-centred and task-oriented with lesser turnaround time.
5 Little Design Up-Front Traditional RE stresses that requirements elicitation and specification required to be complete up front prior to the software development. Similar to traditional RE, UCD also assumes that contextual research and design will take place at the start of the project to provide detailed design information for subsequent development and evaluation. In agile environments, this assumption does not hold. Rather than defining requirements up front, agile software processes seek to follow an evolutionary approach to define requirements during the course of analysis, which is known as JustIn-Time (JIT) requirements analysis. As far as UCD is concerned, there should be at least a little contextual information available to support creating the design artifacts and proceed further. Therefore, JIT design approach is quite difficult and not appropriate for creating UCD focused artifacts in agile environments. As a practical solution, we propose Little Design Up Front (LDUF) - an approach providing only required details of UCD information as needed to support the analysis and design in agile iterations. The objective is to provide only sufficient LDUF information to support the popular agile JIT analysis and design so that UCD perspective can be considered without overloading existing agile practices. The LDUF is drawn from design solutions created in a DSR setting using environmental requirements, and applicable knowledge from the knowledge base as shown in Figure 3. Figure 3 is similar to Figure 2 except that the relevance cycle in Figure 2 was replaced with Requirements (an input from environment to the DSR) and Solutions (an input from the DSR to the Environment) and these changes are in line with Figure 1 where Requirements and Solutions are represented by Business Needs and Application in the Appropriate Environment respectively. Moreover, the emphasis of Create Little Design Artifacts has been shown within DSR.
554
S. Adikari, C. McDonald, and J. Campbell
Fig. 3. Design Science Research Cycles with LDUF
6 Research Design This research consisted of two agile projects. The first project was conducted as the baseline reference to compare the project incorporating user-centred design. The first project was a typical agile project with three iterations and its’ research design is shown in Figure 4.
Fig. 4. Research design – Agile project 1
There were three defined roles in project 1 namely Product Owner, Agile Coach, and Agile Team. The product owner provided abstract level requirements for both projects and participated in tasks related to the product backlog analysis. The agile coach provided directions to the project and was responsible for removing any process impediments. The agile team made the necessary decisions to achieve goals of respective iteration and carried out the software development. The second agile project was directed by a different agile coach and two usercentred designers worked with a new agile team in the design analysis providing the LDUF. The research design of the second agile project is shown in Figure 5. 6.1 Research Process There were two different agile teams and agile coaches for project 1 and 2 and there were no other cross-over of resources excepting the product owner, who provided business requirements of an accommodation management system for both projects. The product owner was part of the each big team and was available in all iterations for requirements verification and validation. Project 1 ran first with three iterations. The first iteration was focused on requirements analysis and setting up the product
Little Design Up-Front
555
Fig. 5. Research design – Agile project 2
backlog. The agile team worked under the guidance and direction of the agile coach to produce working software. At the end of the first iteration, the agile team formally presented the first version of the working software to the product owner for assessment. In consultation and agreement with the product owner, the product backlog was then updated and the second iteration was planned. The second and third iterations were conducted in the same way as the first one based on similar agile settings and principles. At the end of the third iteration, the product owner formally assessed the final product delivered by the first project (P1) and signed off. The second project was run in a similar fashion to the first project except that two user-centred designers were allowed to consistently engage with the team to put forward LDUF for design analysis. They worked very closely with the agile team and the product owner to create and assess paper-based artifacts in support of analysis, verification and validation. At the end of the third iteration, the product owner and usercentred designers formally assessed the final product delivered by the second project (P2) and signed off.
7 Product Evaluation The product P1 and P2 were subjected to one-on-one usability evaluations with 16 participants who were randomly drawn from a large pool of users. The evaluation ran
556
S. Adikari, C. McDonald, and J. Campbell
in three stages. Firstly, the product P1 was evaluated with first 8 participants (group U1). Secondly, the product P2 evaluated with the second 8 participants (group U2) followed by first 8 participants (U1). Thirdly, the product P1 was evaluated with second 8 participants (U2). We followed this approach to minimize any learning effect bias in the assessments. We used a number of scenarios to guide the participant to go through the product and complete assigned user tasks. After the evaluation, each participant was given a pack containing the Product Reaction Cards (PRC) [23] and System Usability Scale (SUS) [24] questionnaire. Participants were then asked to reefer to the PRC and tick all words that best described their user experience with the product and then to prioritize five of those words that they thought were most descriptive of the product. We then asked them to reason out why they chose those five words. We used Product Reaction Cards to aid participants to think deeply about their interaction experience. Finally the participant was requested to fill out the SUS questionnaire.
8 Results A repeated measures Analysis of Variance (ANOVA) was conducted on the data for each question from the SUS questionnaire for both products. The aim was to determine if there was a significant difference of agreement of user groups in relation to their interaction with Product P1 and P2. Table 1 shows the mean response values for each product, statistical significance levels, the difference between mean values, and the percentage of change in mean values. Table 1. Analysed results: Product P1 and P2
According to the above Table, for each question, there is a positive difference of agreement from users for Product 2. Importantly, the agreements for the Q3, Q4, and Q7 are of significant difference (as the P<0.05 regarded as significance) yielding that Product 2 is easy to use (Q3), easy to learn (Q7) and product 1 requires additional support to be able to use (Q4). Table 2 shows the SUS percentage for P1 and P2 reported by each participant. The mean of P1 = 47.31 and P2 = 52.95 and the difference is 5.61. The SUS usability difference of P1 and P2 is 11.92%. Accordingly product P2 found to be of better usability than product P1.
Little Design Up-Front
557
Table 2. SUS values for Product P1 and P2
9 Conclusion This paper presented the results of two agile projects to validate the proposition that incorporating a User-Centred Design perspective into Agile Software Development improves design quality of software systems. A design science approach using “Little Design Up Front” was used to integrate the User-Centred Design perspective into development process. The results show that users find products developed using this approach easier to learn, easier to use and require less support to be able to use.
Acknowledgements We would like to thank Andrew Boyd, Donna Spencer, Dulan De Silva, Evan Laybourne, Narayanan Srinivasan, Rowan Bunning and Sandun Kodithuwakku for their advice/ support in this research project.
References 1. Jokela, T.: Guiding Designers to the World of Usability: Determining Usability Requirements through Team Work. In: Human-Centred Software Engineering – Integrating Usability in the Software Development Lifecycle, pp. 61–78 (2004) 2. Sommerville, I.: Software Engineering, p. 122. Pearson Addison-Wesley, England (2004) 3. Bevan, N.: Design for Usability. In: Proceedings of HCI International, pp. 762–767 (1999) 4. Adikari, S., McDonald, C., Lynch, N.: Design Science-Oriented Usability Modelling for Software Requirements. In: Proceedings of HCI International, pp. 373–382 (2007) 5. Kane, D.: Finding a Place for Discount Usability Engineering in Agile Development: Throwing Down the Gauntlet. In: Proceedings of the Agile Development Conference, pp. 40–46 (2003)
558
S. Adikari, C. McDonald, and J. Campbell
6. Düchting, M., Zimmermann, D., Karsten, N.L.: Incorporating User Centered Requirement Engineering into Agile Software Development. In: Proceedings of HCI International, pp. 58–67 (2007) 7. Hevner, A., March, S.T., Park, J., Ram, S.: Design Science Research in Information Systems. MIS Quarterly 28(1), 75–105 (2004) 8. March, S.T., Smith, G.F.: Design and Natural Science Research on Information Technology. Decision Support Systems 15(4), 251–266 (1995) 9. Simon, H.: The Sciences of the Artificial. MIT Press, Cambridge (1996) 10. Iivari, J.: A Paradigmatic Analysis of Information Systems as a Design Science. Scandinavian Journal of Information Systems 19(2), 39–64 (2007) 11. Hevner, A.: A Three Cycle View of Design Science Research. Scandinavian Journal of Information Systems 19(2), 87–92 (2007) 12. Merisalo-Rantanen, H., Tuunanen, T., Rossi, M.: Is Extreme Programming Just Old Wine in New Bottles: A Comparison of Two Cases. Journal of Database Management 16(4), 41– 61 (2005) 13. Cao, L., Ramesh, B.: Agile Requirements Engineering Practices: An Empirical Study. IEEE Software 25(1), 60–67 (2008) 14. Manifesto for Agile Software Development, http://agilemanifesto.org/ 15. Nielsen, J.: Usability Engineering. Academic Press, San Diego (1993) 16. Mayhew, D.J.: The Usability Engineering Lifecycle. Morgan Kaufmann, San Francisco (1999) 17. Constantine, L.L., Lockwood, L.A.D.: Software for Use: A Practical Guide to the Models and Methods of Usage-Centered Design. Addison-Wesley, Boston (1999) 18. IEEE: IEEE Std 610.12-1990. IEEE Standard Glossary of Software Engineering Terminology. IEEE, New York (1990) 19. ACM SIGCHI: Curriculum for Human-Computer Interaction. ACM Press, New York (1992) 20. Seffah, A., Desmarais, M.C., Metzker, E.: HCI, Usability and Software Engineering Integration: Present and Future. In: Human-Centered Software Engineering - Integrating Usability in the Software Development Lifecycle, vol. 8, Springer, Heidelberg (2005) 21. Zimmermann, D., Grötzbach, L.: A Requirement Engineering Approach to User Centered Design. In: Jacko, J.A. (ed.) HCI 2007. LNCS, vol. 4550, pp. 360–369. Springer, Heidelberg (2007) 22. Seffah, A., Gulliksen, J., Desmarais, M.D. (eds.): Human-Centered Software Engineering Integrating Usability in the Development Process. Springer, Heidelberg (2005) 23. Benedek, J., Miner, T.: Measuring Desirability: New Methods for Evaluating Desirability in a Usability Lab Setting. In: Proceedings of UPA, Oralando, Florida (2002) 24. Brook, J.: SUS: A Quick and Dirty Usability Scale. In: Jordan, P.W., McClelland, I.L., Thomas, B. (eds.) Usability Evaluation in Industry, pp. 18–195. Taylor and Francis, London (1996)
Aesthetics in Human-Computer Interaction: Views and Reviews Salah Uddin Ahmed1, Abdullah Al Mahmud2, and Kristin Bergaust3 1 Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway 2 Department of Industrial Design, Eindhoven University of Technology, The Netherlands 3 Faculty of Art, Design and Drama, Oslo University College, Norway salah@idi.ntnu.no, a.al-mahmud@tue.nl, kristin@anart.no
Abstract. There is a growing interest of aesthetics issues in Human-Computer Interaction (HCI) in the recent days. In this article we present our literature review where we investigate where and how aesthetics has been addressed by the HCI researchers. Our objective is to find out the sectors in HCI where aesthetics has a role to play. Aesthetics in HCI can be the common interest that involves both art and technology in HCI research to facilitate from each others discipline in the form of mutual interaction. Keywords: Aesthetics, interaction, usability, art and technology.
aesthetics has been used by the researchers, which contexts it is used and what meaning it is used for. We would like to categorize the field of human computer interaction into some clusters of thematic areas in order to understand how the field looks like when we view it from the point of aesthetics. The objective is to understand the meaning of aesthetics, remove confusions and understand the use, applications and possibilities of aesthetics in the context of human computer interaction. The use of aesthetics in computing specially in human computer interaction is getting lot of attention in the recent days. Call for paper of CHI 2008 (conference on Human Factors in Computing Systems) with a slogan, “Art, Science, Balance” reveals how the influence of aesthetics and art in general is well recognized by the research community recently. But as the area is still very new and as HCI researchers are not experts on arts and aesthetics there is room for misunderstanding. Therefore an investigation of our current understanding and a snapshot of the field viewed from the perspective of aesthetics will help the researchers to better understand the position and role of aesthetics in human computer interaction. Besides aesthetics being a link between technology and art will help future researchers to open up more possibilities of collaboration between art and human computer interaction. The rest of the paper is organized as follows. Section 1.1 presents our research background and section 1.2 describes the research method. Section 2 gives definitions of aesthetics and its importance in HCI. Section 3 provides the result of the review. Section 4 concludes the paper with discussion and our viewpoints reflecting the result obtained from the review. 1.1 Research Background The research is conducted as part of the SArt project [1] inside the Software Engineering group at the Department of Computer Science in the Norwegian University of Science and Technology. Our ultimate objective is to propose, assess, and improve methods, models, and tools for software development in art context while facilitating collaboration with artists. As part of research in SArt we have performed a literature review in order to conceptualize the intersection of software and art [2]. From the review we have discovered that the intersection involves people from diverse background and interest, for example art critics, software developers, educators and so on. This is also visible in the software dependent art projects that SArt group members have participated. From our experience of working with artists we have seen that technologists and artists have different viewpoints. While working with artists we have observed that there is a difference in the way how artists and technologists work as well as differences in how the make meanings of different concepts. In a field where there are both technologists and artists involved there are many issues that has to be considered such as having a common language, good collaboration, coexistence of artistic process and technical process, evaluation of products both from technical and aesthetic viewpoint etc. This work is in line with our investigation of how we can improve the collaboration between artists and technologists by working together and learning from each other. As a part of that goal here we would like to look into the field of human computer interaction to see how aesthetics, a concept from arts has
Aesthetics in Human-Computer Interaction
561
been applied by the technologists to discover the field from artists’ and technologists’ common interest where aesthetics is the bridge between them. 1.2 Research Method We followed a systematic review of the literature published in recent conference proceedings and journals which are easily accessible from ACM, IEEE and Springer digital library. We started the review by following Kitchenham’s principles [3]. At the beginning, we selected papers by reading the titles and abstracts, later we have modified the process as we could not find lot of papers by only title and abstracts. We searched the entire article with the keyword search and if a match was found we read the abstract and part of the article that has the keyword. We discarded articles that has only mentioned the word merely and has no significant meaning or relationship between aesthetics and the main work of the paper. Otherwise we read the full article and listed in our final selection. A total of 67 papers were selected from the following top-level conference proceedings limited with the years mentioned below: • Conference on Human Factors in Computing Systems CHI – Years (from 2008 to 1997) • Conference on Designing Pleasurable Products and Interfaces, DPPI – Years (2007, 2005, 2003) • DIS conferences – with no time limits • TOCHI – with no time limits • NordiCHI and HCII – with no time limits The keywords that we have used are a. aesthetics b. aesthetic. We have chosen both ‘aesthetics’ and ‘aesthetic’ in order not to miss the mentions of word such as aesthetically.
2 Aesthetics and HCI New conferences and workshops that explore various aspects of the consequences of the integration of computing into everyday life are emerging. New terms are entering the HCI vocabulary such as emotion, pleasure, experience, expression, and indeed aesthetics. Aesthetics is increasingly viewed as a key issue with respect to interactive technology. But aesthetics is a general term and it is often used in association with other terms. Here we present what it actually means and how it is defined. The Oxford English Dictionary presents the following definitions for aesthetics: i) the science that treats the conditions of sensuous perception; and ii) the philosophy or theory of taste, or of the perception of the beautiful in nature and art [4]. In the preface of Encyclopedia of Aesthetics which is one of the most comprehensive references on this topic, Kelly states [5], “Ask contemporary aestheticians what they do, however, and they are likely to respond that aesthetics is the philosophical analysis of the beliefs, concepts, and theories implicit in the creation, experience, interpretation, or critique of art.” Britannica encyclopedia defines it as philosophical study of the qualities that make something an object of aesthetic interest and of the nature of aesthetic
562
S.U. Ahmed, A. Al Mahmud, and K. Bergaust
value and judgment [6]. To define its subject matter more precisely is, however, very difficult. It also mentions that self-definition could be said to have been the major task of modern aesthetics. 2.1 Why Aesthetics Human computer interaction started from computing field and now extending its scope in many other directions by including behavioral science, psychology, and sociology and so on. However, there are allegations that human technology interaction has focused almost exclusively in goal driven behavior in all work settings [7]. However, since computing expands its domain from workplace to pervasive and domestic environments, interest in aesthetics for designing is increasing in HCI. Gaver and Martin suggests that the importance of non-instrumental user needs, such as surprise, diversion, or intimacy should be addressed by technology [8]. Jordan proposed a hierarchy of such needs and claimed that – along with the functionality and usability of a system – different aspects of pleasure are important to enhance the user’s interaction with it [9]. 2.2 Types of Aesthetics In the context of HCI two types of aesthetics are mentioned in [10], classical aesthetics and expressive aesthetics. Classical aesthetics refer to traditional notions emphasizing orderly and clear design and expressive aesthetics to designs creativity and originality. Study shows that classical aesthetics is perceived more evenly by users whereas expressive aesthetics can vary depending on framing effects or different cultural and contextual stimuli [7]. Different kinds of quality dimensions are mentioned such as Ergonomic, Hedonic, Instrumental and Non-instrumental. Ergonomic quality comprises quality dimensions that are related to traditional usability, i.e. efficiency and effectiveness [11]. Hedonic quality comprises quality dimensions with no obvious relation to the task the user wants to accomplish with the system, such as originality, innovativeness, beauty etc. In other place, instrumental Quality and Non instrumental Quality is used in relation to perception of user experience which are basically the same as ergonomic and hedonic qualities [7].
3 Aesthetics in Human-Computer Interaction Aesthetics come into play in many stages in many ways in HCI. In this section we would like to present the different areas and contexts where aesthetics is mentioned and addressed in our reviewed literature. From the review, we have identified some of the areas where aesthetics is used in HCI. Some of the key areas that we have identified are: Artifacts design, System design, Attractiveness and look and feel of User Interface (UI), Interaction with a system, Usability and user experience, Research methods for HCI. The following table lists the articles according to the particular addressed themes or areas of HCI.
Aesthetics in Human-Computer Interaction
563
Table 1. Thematic subject areas where aesthetics has been addressed in HCI literature Themes Artifacts Design
System Design Attractiveness and Look and Feel of UI Interaction with a system Usability and User Experience HCI Research methods
What is addressed Design of artifacts and gadgets, evaluation of artifacts, environment related design of artifacts (ubiquitous computing). Software applications, tools, artistic software, games.
[20], [21], [22]
User interface of an application, mobile phone, web sites etc. Interactive art installations, museum guide, interactive learning system, ATM machines etc. Users’ feelings, emotion, usability.
3.1 Artifact Design With the recent shift from narrow focus on work to a broader view of interaction industrial designers, communication designers, and newly minted interaction designers all began to play more important roles in the invention and development of new artifacts meant to address a broad set of problems and opportunities [14]. Future Information appliances design may benefit from considering aesthetics aspects of the gadgets [8]. How digital technologies might be employed in everyday settings can combine concepts from both work and plays and these devices often act not only as an emulator or information source instead creates a new form of appreciation both conceptual and aesthetic [16]. Often these devices are created by artists and are displayed as interactive and or digital art in the art galleries. Evaluation of Artifacts. Aesthetics is not only an issue during the design of these diverse computing devices/artifacts but also in the evaluation of these devices. After running a survey on heuristic evaluation match between design of ambient display and environments in [35], the authors assert, "The display should be pleasing when it is placed in the intended setting." Ubiquitous Computing. Areas such as ubiquitous computing, augmented reality, and physical computing have made it evident that the personal computer is just one out of many possible ways in which we can design how humans interact with computers [12]. The design of these devices should be carefully done so that they consider the contextual qualities of the environment such as aesthetics, emotions and aspirations whether they are place indoor in home, museums or outdoor in public places [15] [36] [34]. For example, when designing an ambient display, one should notice an ambient display because of a change in the data it is presenting and not because its design clashes with its environment [35].
564
S.U. Ahmed, A. Al Mahmud, and K. Bergaust
3.2 System Design By system design, we mean the context of creating new tools or software applications. Hedonic quality plays a substantial role in forming users judgment of appeal and it should be explicitly taken into account when designing a software system [11]. In [17], authors presented a technique used for creating a new kind of tool for 3D drawing. In creative settings where innovation and novelty is sought, artists and technologists work together in close collaboration. 3.3 Attractiveness and Look and Feel of UI Aesthetics has been addressed by many articles in the attractiveness and look and feel of different websites. There is already a transition towards aesthetically pleasing interfaces and it will continue as more importance is placed on the aesthetics of a user interface, and as the proper tools are available to interface designers for creating such interfaces [20]. A theoretical framework for assessing the attractiveness of web sites has been introduced in [22] where as aesthetics is considered an issue for rating web sites in [23]. Aesthetic factors beyond usefulness and traditional usability are increasingly recognized as contributing to the overall success of a product or system [24] [25]. 3.4 Interacting with a System Aesthetics and interaction are interwoven concepts, rather than separate entities [26]. In aesthetics of interaction the emphasis shifts from an aesthetically controlled appearance to an aesthetically controlled interaction, of which appearance is a part. Aesthetics of interaction moves the focus from ease of use to enjoyment of the experience [27]. Mixed Reality and Virtual Reality. Design for mixed reality or virtual reality devices are driven by many contextual requirements of which aesthetics is an important part [36]. Artistic association is also important in virtual reality systems, “The curtain rain was chosen for its aesthetic qualities, both in terms of its striking visual image and sound, its asymmetric transparency, and not least, due to the artistic association of projecting a virtual desert into a curtain of water” [19]. Interactive Art. Interactive art is a new kind of art that is highly dependent on technology and user interaction. Often these kinds of artworks are illustration of interdisciplinary collaboration between research, design, craft and art and involve interaction with the user in a new or innovative way such as presented in the case of computational composite [12] or computational textile kit in [29]. In [28], the authors present a method developed to support design of innovative interactive technology. 3.5 Usability and User Experience The use of aesthetics is not always warmly accepted by HCI researchers. In fact, as mentioned in [37], it is often seen by many professionals as inversely proportional to easiness of use or usability. There has been continuous debate on conflicting impact of usability and aesthetics in HCI [38]. However, later many researchers worked on
Aesthetics in Human-Computer Interaction
565
the positive impact of aesthetics. Now empirical evidence shows correlations between the perceived aesthetic quality of a system’s user interface and overall user satisfaction [31], [32] leading to claims that aesthetic design can be a more important influence on users’ preference than traditional usability [25]. Usability is important but good aesthetic design can overcome some deficit of usability problems. In fact, usability and user experience is related to the appraisal of a system which depends on both instrumental and non instrumental qualities [7]. 3.6 HCI Research Methods HCI has emerged as a design-oriented field of research, directed at large towards innovation, design, and construction of new kinds of information and interaction technology. Three accounts have been named regarding the design theory such as the conservative account, the romantic account, and the pragmatic account of which pragmatic account is the one that considers the issues such as creativity, craft, aesthetics [33]. HCI researchers have adopted approaches based on traditions of artistdesigners. Thus new methods have been developed in HCI such as cultural probes whose purpose is to inspire the creation of appropriate pleasurable even provocative designs [39].
4 Discussions and Conclusion From the review we have seen where and how aesthetics has been used in context of HCI. The outcome reveals us a picture of the relationship between aesthetics and HCI The consideration of aesthetics is visible in many sectors of HCI, from artifact design to research methods for collecting user data or evaluating artifacts. What we see from the review is that the most common use of aesthetics in HCI refers to visual aesthetics or expressive aesthetics. The conflict with usability also accounts in case of expressive aesthetics. We believe that the contradiction arose since we referred to only visual or expressive aesthetics and compared its effects with usability. But aesthetics as a philosophy is a wide concept rather than only the visual or static beauty of interfaces. It refers to the feelings associated with the use and interaction with a system. In this case aesthetics of interaction is not conflicting with usability; rather usability is a part of aesthetics of interaction. Having high expressive aesthetics and less usability can affect the emotion of the user negatively, thus the aesthetics of interaction. So proper aesthetics of interaction should define where and how expressive aesthetics will be included and how much they will act aligned with usability, and overall user experience of the user effecting their emotion positively. In this paper, we have put together different contexts of use of aesthetics in HCI to present the reader different meanings and views that we attach with aesthetics in HCI. As technology will advance giving us more options to use more expressive and innovative interaction with computers, aesthetics in HCI will get even more attention in the future. Proper understanding of the meaning, value and impact would help the future researchers to be more conscious about the role of aesthetics and possibly eliminate confusions around it among them. Aesthetics can in that way bring together
566
S.U. Ahmed, A. Al Mahmud, and K. Bergaust
artists, designers and technologists with creativity, inspiration and engagement in a collaborative and multidisciplinary milieu inside human computer interaction.
References 1. SArt Project, at Norwegian University of Science and Technology, http://prosjekt.idi.ntnu.no/sart/ 2. Ahmed, S.U., J.L., Trifonova, A., Sindre, G.: Conceptual framework for the intersection of software and art. In: Braman, J., Vincenti, G., Trajkovski, G. (eds.) Handbook of Research on Computational Arts and Creative Informatics, Information Science Reference (2009) 3. Kitchenham, B.: Procedures for Performing Systematic Reviews. Keele University Technical Report TR/SE-0401 and NICTA Technical Report 0400011T.1 (2004) 4. Oxford English Dictionary, http://www.oed.com/ 5. Kelly, M. (ed.): Preface to Encyclopedia of Aesthetics, vol. 1. Oxford University Press, New York (1998) 6. Encyclopedia Britannica, http://www.britannica.com 7. Mahlke, S., Thüring, M.: Studying antecedents of emotional experiences in interactive contexts. In: Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, San Jose, California, USA (2007) 8. Gaver, B., Martin, H.: Alternatives: exploring information appliances through conceptual design proposals. In: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 209–216. ACM, The Hague, The Netherlands (2000) 9. Jordan, P.W.: Designing pleasurable products. Taylor & Francis, London (2000) 10. Hartmann, J., Angeli, A.D., Sutcliffe, A.: Framing the user experience: information biases on website quality judgement. In: Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computing systems, pp. 855–864. ACM, Florence, Italy (2008) 11. Hassenzahl, M., Platz, A., Burmester, M., Lehner, K.: Hedonic and ergonomic quality aspects determine a software’s appeal. In: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 201–208. ACM, The Hague, The Netherlands (2000) 12. Vallgårda, A., Redström, J.: Computational composites. In: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 513–522. ACM, San Jose, California, USA (2007) 13. Lehn, D.: Engaging constable: revealing art with new technology. In: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 1485–1494. ACM, San Jose, California, USA (2007) 14. Zimmerman, J., Forlizzi, J., Evenson, S.: Research through design as a method for interaction design research in HCI. In: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 493–502. ACM Press, San Jose, California, USA (2007) 15. Tolmie, P., Pycock, J., Diggins, T., MacLean, A., Karsenty, A.: Unremarkable computing. In: Proceedings of the SIGCHI conference on Human factors in computing systems: Changing our world, changing ourselves, pp. 399–406. ACM Press, Minneapolis, Minnesota, USA (2002) 16. Gaver, W., Boucher, A., Law, A., Pennington, S., Bowers, J., Beaver, J., Humble, J., Kerridge, T., Villar, N., Wilkie, A.: Threshold devices: looking out from the home. In: Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computing systems, pp. 1429–1438. ACM, Florence, Italy (2008)
Aesthetics in Human-Computer Interaction
567
17. Schkolne, S., Pruett, M., Schröder, P.: Surface drawing: creating organic 3D shapes with the hand and tangible tools. In: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 261–268. ACM, Seattle, Washington, United States (2001) 18. Santella, A., Agrawala, M., DeCarlo, D., Salesin, D., Cohen, M.: Gaze-based interaction for semi-automatic photo cropping. In: Proceedings of the SIGCHI conference on Human Factors in computing systems, pp. 771–780. ACM, Montréal, Québec, Canada (2006) 19. Koleva, B., Taylor, I., Benford, S., Fraser, M., Greenhalgh, C., Schnädelbach, H., Lehn, D.v., Heath, C., Row-Farr, J., Adams, M.: Orchestrating a mixed reality performance. In: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 38– 45. ACM, Seattle, Washington, United States (2001) 20. Grossman, T., Kong, N., Balakrishnan, R.: Modeling pointing at targets of arbitrary shapes. In: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 463–472. ACM, San Jose, California, USA (2007) 21. Consolvo, S., McDonald, D.W., Toscos, T., Chen, M.Y., Froehlich, J., Harrison, B., Klasnja, P., LaMarca, A., LeGrand, L., Libby, R., Smith, I., Landay, J.A.: Activity sensing in the wild: a field trial of ubifit garden. In: Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computing systems, pp. 1797–1806. ACM, Florence, Italy (2008) 22. Hartmann, J., Sutcliffe, A., Angeli, A.D.: Investigating attractiveness in web user interfaces. In: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 387–396. ACM, San Jose, California, USA (2007) 23. Ivory, M.Y., Hearst, M.A.: Statistical profiles of highly-rated web sites. In: Proceedings of the SIGCHI conference on Human factors in computing systems: Changing our world, changing ourselves, pp. 367–374. ACM, Minneapolis, Minnesota, USA (2002) 24. Green, W.S., Jordan, P.W.: Pleasure With Products: Beyond Usability. Taylor and Francis, New York (2002) 25. Norman, D.A.: Emotional Design: Why We Love (Or Hate) Everyday Things Basic Books (2003) 26. Djajadiningrat, W., Gaver, W., Fres, J.W.: Interaction Relabelling and Extreme Characters: Methods for Exploring Aesthetic Interactions. In: Proceedings of the conference on Designing interactive systems: processes, practices, methods, and techniques (2000) 27. Isbister, K., Höök, K., Sharp, M., Laaksolahti, J.: The sensual evaluation instrument: developing an affective evaluation tool. In: Proceedings of the SIGCHI conference on Human Factors in computing systems, pp. 1163–1172. ACM, Montréal, Québec, Canada (2006) 28. Ljungblad, S., Holmquist, L.E.: Transfer scenarios: grounding innovation with marginal practices. In: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 737–746. ACM, San Jose, California, USA (2007) 29. Buechley, L., Eisenberg, M., Catchen, J., Crockett, A.: LilyPad Arduino: using computational textiles to investigate engagement, aesthetics, and diversity in computer science education. In: Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computing systems, pp. 423–432. ACM, Florence, Italy (2008) 30. Tohidi, M., Buxton, W., Baecker, R., Sellen, A.: Getting the right design and the design right. In: Proceedings of the SIGCHI conference on Human Factors in computing systems, pp. 1243–1252. ACM, Montréal, Québec, Canada (2006) 31. Tractinsky, N., Katz, A.S., Ikar, D.: What is beautiful is usable. J. Interacting with Computers 13, 127–145 (2000) 32. Lindegaard, G., Dudek, C.: What is this evasive beast we call user satisfaction? J. Interacting with Computers 15, 429–452 (2003)
568
S.U. Ahmed, A. Al Mahmud, and K. Bergaust
33. Fallman, D.: Design-oriented human-computer interaction. In: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 225–232. ACM, Ft. Lauderdale, Florida, USA (2003) 34. Ballegaard, S.A., Hansen, T.R., Kyng, M.: Healthcare in everyday life: designing healthcare services for daily life. In: Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computing systems, pp. 1807–1816. ACM, Florence, Italy (2008) 35. Mankoff, J., Dey, A.K., Hsieh, G., Kientz, J., Lederer, S., Ames, M.: Heuristic evaluation of ambient displays. In: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 169–176. ACM, Ft. Lauderdale, Florida, USA (2003) 36. Schnädelbach, H., Koleva, B., Flintham, M., Fraser, M., Izadi, S., Chandler, P., Foster, M., Benford, S., Greenhalgh, C., Rodden, T.: The augurscope: a mixed reality interface for outdoors. In: Proceedings of the SIGCHI conference on Human factors in computing systems: Changing our world, changing ourselves, pp. 9–16. ACM, Minneapolis, Minnesota, USA (2002) 37. Karvonen, K.: The beauty of simplicity. In: Proceedings on the 2000 conference on Universal Usability, ACM, Arlington, Virginia, United States (2000) 38. Norman, D.A.: The Design of Everyday Things. MIT Press, London (1998) 39. Wolf, T.V., Rode, J.A., Sussman, J., Kellogg, W.A.: Dispelling “design” as the black art of CHI. In: Proceedings of the SIGCHI conference on Human Factors in computing systems, pp. 521–530. ACM, Montréal, Québec, Canada (2006)
Providing an Efficient Way to Make Desktop Icons Visible Toshiya Akasaka and Yusaku Okada Department of Administration Engineering, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, 223-8522 Japan {to48_a,okada}@ae.keio.ac.jp
Abstract. Desktop icons allow users to access files/programs quickly. Some users are struggling to adapt their window management strategy to secure the visibility of desktop icons. In this paper, we propose an approach to provide users with an efficient way to make desktop icons visible in order to reduce the workload of window management. The approach was developed based on careful considerations to the context in which we aim to help users. The experimental results showed that out approach made the process of making desktop icons visible faster. However, it was not confirmed that the workload of window management was reduced. Keywords: Desktop icons, Display space management, Desktop Environment, Window management.
and avoid automated window functions like maximize, imposing additional burden for window management. Furthermore, reserving an area for desktop icons would leave a smaller area than the entire screen for placing open windows. In short, the idea of keeping desktop icons always visible does help users access icons quickly, but it also prevents them from concentrating on the window management and eventually their primary tasks. The goal of our research is to investigate an efficient way to access desktop icons. Our approach is to develop a method that provides an efficient way to make desktop icons visible. With an efficient way to make desktop icons visible, users would not only be able to access icons more quickly, but also be free to place windows over desktop icons, reducing the workload of window management. In this paper, we describe our approach and the results of experiments to test its effectiveness. First, we describe in detail the context in which we aim to assist users. Then, our approach is described along with some comparisons with other approaches. Finally, we present the experimental results conducted to test the effectiveness of our approach.
2 Context The way to interact with desktop icons is different from user to user. The same user may also exhibit different ways of managing icons in different environments. An excellent way to help a certain user access icons in a particular setting may turn out to be useless once it is brought into use for a different user or in a different environment. We need to confine our approach to a certain context. In this chapter, we describe the context in which we aim to help users make desktop icons visible. 2.1 Icon Management Strategy Before describing the context, we must define the icon management strategy (IMS). An IMS is a strategy to access and place desktop icons. The following four factors constitute an IMS: 1. How to make desktop icons visible There are several ways to make desktop icons visible. Moving or iconifying windows is a common way to make icons visible. Special keyboard sequences such as <windows + D> in Microsoft Windows also make icons visible. Alternatively, some users opt to arrange windows carefully so that windows never cover important icons in the first place. 2. How to select a desktop icon Usually desktop icons are selected by clicking the mouse, but the keyboard can also select icons if the desktop window has the input focus. Voice instruction and finger touch can also be options as voice recognition and touch panel have become robust gradually. 3. Placements and sizes of desktop icons Users exhibit various ways to place desktop icons. The most usual placement would be to have icons lined up in either side of the screen. However, some users like to group important icons and place them at an arbitrary area. Size of icons also
Providing an Efficient Way to Make Desktop Icons Visible
571
shows diversity. Mostly, users can choose a favorite size from the three options of normal, small, and large. 4. Types and number of desktop icons Virtually any file and program can be a desktop icon. Some users only place commonly-used programs and temporarily files that are useful for their primary tasks. Sometimes, on the other hand, desktop icons are picked up in the aesthetic viewpoint to decorate the desktop. The number of icons also varies greatly, ranging from a few to 30 or more. Note that the first two factors determine how to access desktop icons, and that the latter two determine how to place desktop icons. Although decision making on these four factors is greatly dependent on users, userindependent elements can also affect decisions. For example, as for how to choose desktop icons’ placements and sizes, some users like grouping important icons at an arbitrary place. However, they may alternatively have icons lined up in either side of the screen when they use a system with a small monitor or a low screen resolution; with a small screen real estate (pixels), desktop icons have a high chance of being covered by windows unless lined up in either side of the screen. Here, the display configuration affects decision making on how to choose desktop icons’ placements. Likewise, decision making on the other three factors can be affected by not only users but also user-independent elements. Figure 1 shows the user-independent elements affecting decision making on the four factors of icon management strategy and the relations of which element affect which factors. As shown in the figure, window manager, input device, display configuration, and mode of using computer are the user-independent elements affecting decision making and hereafter are called environment elements. The details of the four environment elements are as follows: 1. Window Manager A window manager is a module of system software providing users with services to control windows. A window manager is responsible for providing users with desktop icons as well as docks, taskbars, program launchers, wallpaper, etc. 2. Input Device This element relates to what types of input devices are available. Keyboard and mouse is a typical set of input devices, but as mentioned earlier there are other devices as well such as voice input device and touch panel. 3. Display Configuration Display configuration consists of the size and number of monitors as well as screen resolution. 4. Mode of Using Computer This element relates to the purpose and authority of a user using the computer. People use computers at work are likely to pursue a high productivity, which is not always the case with people using computers at home. Home users have the right to modify the desktop as they want, while office users using shared personal computers may have some restrictions.
572
T. Akasaka and Y. Okada
Window Manager
Input Device
How to Select a Desktop Icon
IMS
Placements and Sizes of Desktop Icons
Types and Numbers of Desktop Icons
Display Configuration
How to Make Desktop Icons Visible
Mode of Using Computer Fig. 1. Each environment element affects several factors of Icon Management Strategy (IMS). This diagram shows which element contributes to the form of which factor.
These environment elements directly or indirectly affect decision made about the four factors of IMS. We already mentioned one example of how the display configuration affects one of the factors. The other elements also have impacts on decision making on the factors. For example, some window managers do not have any special keyboard sequence to make desktop icons visible, affecting decision making on how to make desktop icons visible. Voice instruction, one way to select a desktop icon, is not viable unless any voice input device is available. Users using shared personal computers at work may not be able to create desktop icons as they want, restricting decisions made on types and number of desktop icons. Likewise, the environment elements affect decision making on the four factors in many other ways as indicated by arrows in Figure 1. 2.2 Context of Our Approach Having defined the icon space management, we are now in the position to describe the context in which we aim to help users. This means that we describe what type of icon management strategies we assume. As stated in the introduction chapter, our approach is to provide a new way to make desktop icons visible. This offers a new option for one of the four factors of IMS, namely, “How to make desktop icons visible.” Obviously, the three other factors need to be consistent in order to evaluate the effect of our approach. However, as we have seen above, decisions on these factors are affected by the environment elements as well. Therefore, we need to specify these elements as well as the three factors. We first specified the environment elements so that the resulting elements would represent a typical office environment. Note that Windows XP (or below) which is
Providing an Efficient Way to Make Desktop Icons Visible
573
Environment Elements Window Manager Windows XP (or below) Input Device Keyboard and Mouse Display Configuration Monitor: 14’’ – 24’’ single monitor Pixels: 1024x768 – 1600x1200 Mode of Using Computer Using a personal computer at work
Factors of IMS How to select a desktop icon Mouse click Placements and sizes of desktop icons Placement: lined up in either side Size: normal size Types and numbers of desktop icons Type: programs and temporary files Number: 10 – 20
Fig. 2. Context in which we aim to assist user consists of the environment elements and the three factors of ISM. Arrows between the elements and factors are based on those shown in Figure 1.
actually an operating system is designated for the window manager. We used the term Windows XP to indicate its window manager which is tightly integrated with the operating system. Next, we went on to specify the three factors of IMS. We did this so that the resulting strategy would be plausible under the specified environment elements. Figure 2 shows the specified elements and factors, which hereafter we collectively call the context.
3 Approach The context defined in the previous chapter provided us with a basis based on which we investigated a way to make desktop icons visible without hindering users’ window management. To achieve this, we developed the application called Icons Space Saver. We decided to develop an application rather than modifying the underlying window manager as we had chosen as a part of the context Windows XP which could not be modified. In this chapter, we describe the ISS and compare it with other possible approaches to help users to access desktop icons. 3.1 Icon Space Saver The Icon Space Save (ISS) allows users to designate an area around important desktop icons as the Icon Space (IS). Figure 3 shows how the ISS make desktop icons visible. As it is launched, the ISS displays a semi-transparent vertical border on the desktop. The border can be moved only horizontally by grab-and-drag operation. The area between the border and the left edge of the screen is the IS. Windows staying within or straddling the border of the IS are automatically tossed out of the IS when the user moves the mouse cursor into any visible part of the IS. After the user clicks
574
T. Akasaka and Y. Okada
Icon Space (IS)
Move the mouse cursor
Click an icon
Move the mouse cursor
Visible part of the IS
Fig. 3. Icon Space Saver (ISS) moves windows out of the icon space (IS) and bring them back to the original positions automatically; users have only to move the mouse cursor into and out of the IS. The ISS also prevents a maximized window from covering the IS.
an icon and moves the cursor out of the IS, all the tossed windows are brought back to the original positions. With this mechanism, users can make desktop icons visible just by moving the mouse cursor to the IS, causing the least additional mouse movement. Still, they do not have to perform any window operations; covering windows move out of the IS and then get back to the original positions automatically. Also, the ISS allows users to use the automated maximize function without covering desktop icons. When the maximize button of a window is pressed, the ISS catches the event and changes the behavior of the window so that it grows to be as large as possible without covering the IS. This offers users a simple one-click operation to make windows large while maintaining the visibility of desktop icons. 3.2 Comparisons with Other Approaches The ISS is not the only possible option to assist uses with making desktop icons visible. There are many other existing and potential solutions. In this section, we compare the ISS with some other approaches, and discuss the competence and possible problems of the ISS. Special Keyboard Sequence. Windows XP, a part of the context of our approach, offers the special keyboard sequence <Windows + D> to hide all windows and make
Providing an Efficient Way to Make Desktop Icons Visible
575
desktop icons visible. With this function, users do not have to perform any window operation to make icons visible. This function makes all desktop icons visible, which is not always the case with the ISS. Two problems can be pointed out on this function. The first problem which was discovered by Hutchings [1] is that it is hard to remember to use the operation. Probably, this is because it is not intuitive to type the keyboard to make the desktop visible. The second problem is that the hidden windows are not restored automatically. This is problematic when the user wants a newly created window to be visible together with the existing windows. The ISS are free from these problems. It is intuitive to move the mouse cursor to the IS to access a desktop icon. The ISS also can restore the window layout automatically. Quick Launch Tray. Windows XP offers a special interaction place called quick launch tray to put very small icons for program launchers. Among those icons is a special icon to make the desktop visible. Located on the taskbar, the quick launch tray is free from being covered by windows. Thus, the icon for making the desktop visible has the advantage of being always accessible. Like <Windows + D>, the icon also has the advantage of making all desktop icons visible. These two advantages cannot be found in the ISS. However, the required operation for this function lacks intuitiveness just as <Windows + D> does. It also requires the user to move the mouse cursor all the way to the bottom part of the screen. The very small size of the icon is also a problem; the well known Fitts’s law [2] says that the smaller the target is, the longer it takes to set the mouse cursor to it. The ISS, on the other hand, requires the least additional mouse movement and no need to place the mouse cursor at a particular position. Machine Learning Scheme. As machine learning algorithms have proven to be practical, a new approach to develop/improve the computer user interface has emerged which incorporates the machine learning scheme to adapt the user interface to each individual user. For example, there have been some efforts to adapt the UNIX command line shell to predict the user’s next command based on the record of commands issued in the past [3]. A similar approach could be potentially used to assist users with accessing desktop icons. That is, it may be possible to predict the exact file that the user wants to access next and bring the icon for that file to the top near the current place of the mouse cursor. However, we avoided this approach. One reason is that user’s access patterns to desktop icons are unlikely to exist. As for commands in the UNIX shell where programming is one of the most common tasks, some patterns are likely to exist. On the other hand, the context of our approach assumes that users are using desktop icons for general purposes, which makes pattern extraction difficult. Our approach, the ISS, may require users to search for the icon they want to access among many icons, but instead it is able to make important icons visible without any mistake. Having being compared with other approaches, the ISS is found to have some advantages and possible problems. Comparison with the machine learning scheme especially clarified important characteristics of the ISS, namely, putting priority on simplicity and stability rather than using sophisticated technology.
576
T. Akasaka and Y. Okada
4 Experiment The aim of the ISS is to provide users with an easy way to make desktop icons visible and by doing so reduce the burden of window management, increasing the overall productivity. In this chapter, we describe the experiments that we conducted to test whether the ISS really achieved its aim. 14 persons took part in the experiments as subjects. These persons were those who usually used computers in similar contexts to that of our approach. The display of the computer used in the experiments was a 15inch single monitor with the resolution of 1024 x 768; this configuration was also chosen in line with the context of our approach. 4.1 Efficiency to Make Desktop Icons Visible The ISS allows users to make desktop icons visible without any window operations and is obviously efficient in terms of workload. However, it needs to be confirmed that the ISS can also make the process of making desktop icons visible faster than window operations. Therefore, we first conducted experiments to examine how much the ISS could reduce the time it takes to make desktop icons visible.
Task Completion Time (sec)
6 w/o ISS
w/ ISS
5 4 3 2 1 0
**
**
1 2 3 The number of windows initially covering the desktop icons Fig. 4. Without using the ISS the task completion times gets longer as the number of windows increases, while the times are stable over the three conditions when using the ISS
The experiment task was to click the desktop icon specified by voice instruction. A task began with a voice instruction and lasted until the subject clicked the specified icon. 17 desktop icons all representing text files were lined up in the right edge of the screen. The area around the desktop icons was initially covered by windows. Consequently, subjects first needed to move windows away from the area before clicking the specified icon. To move windows without using the ISS, they had to either iconify or grab/drag windows. It was up to each subject which of the two operations to use. When using the ISS, on the other hand, subjects were requested to use function of the ISS without performing any window operation. Subjects performed the task under the
Providing an Efficient Way to Make Desktop Icons Visible
577
three conditions. The difference between the conditions was the number of windows that initially covered the area around the desktop icons; the number varied from 1 to 3. For each condition, a subject performed the task five times. Each time the task completion time was measured. Figure 4 shows the completion times averaged over all the subjects (14 subjects x 5 trials = 70 trials) with the error bars indicating the 95% confidence intervals. When subjects performed the task without using ISS (w/o ISS), the task completion time gets longer as more windows cover the area around the desktop icons. This rise in time is inevitable as subjects needed to perform window operations on each window. In contrast, when subjects used the ISS (w/ ISS), the task completion times were stable over the three conditions with relatively little deviations. This resulted in performance gain for the conditions of 2 and 3 windows, and the differences were statistically significant for 0.01 level with paired T-test. However, the ISS did not show any improvement for the condition of one window. In short, with only one window covering desktop icons the ISS gave the subjects the same level of speediness as usual window operations, but the ISS could maintain that level even when several windows cover the desktop icons, effectively bringing performance gain. From the experimental results above, we can conclude that the ISS can at least maintain the same level of speediness as window operations to make desktop icons visible, and that it can raise the level when several windows cover desktop icons. 4.2 Productivity of Primary Task Having confirmed that the ISS could make the process of making desktop icons visible faster, we then conducted experiments to examine whether the ISS could improve the overall productivity of a typical task for office workers. The experiment task was to compile a spreadsheet document. Figure 5 shows a typical screenshot of the desktop during the task. In the spreadsheet (lower right in Figure 5) there was a matrix with rows representing persons and columns representing items of information. The task was to fill the matrix by gathering information spread over many files placed on the desktop. The persons in the rows of the matrix were divided into four groups. Pieces of information in other files than the spreadsheet were described by group (upper). Mappings between persons and groups were not shown in the spreadsheet, but in a different text file (lower left). This situation caused subjects to have the text file always visible, usually at the right bottom of the screen as that place saved the mapping information from being occluded. Consequently, the desktop icons which subjects needed to access to gather information were frequently covered by the text file’s window. In addition, subjects needed to manage three windows as shown in Figure 5 or at least two windows (the spreadsheet and the text file). This formed the situation where subjects were in a dilemma of maintaining the visibility of desktop icons while managing windows. The 14 subjects were divided into two groups, each with 7 subjects. One group performed the task on a set of files with the ISS, and then did the task on a similar set of files without the ISS. The other group also performed the task twice, but this time with the reversed order; first they did not use the ISS and next they did. This design was to counterbalance the learning effect. For each trial, the task completion time was measured.
578
T. Akasaka and Y. Okada
Task Completion Time (sec)
1400 1200 1000
791
800
751
600 400 200 0 w/o ISS
Fig. 5. In order to fill the matrix of the spreadsheet, subjects needed to access several desktop icons, while keeping the spreadsheet and text file visible
w/ ISS
Fig. 6. The ISS did not make a statistically significant difference, although the sampled data showed improvement of about 5 % on average as well as reduction in deviation
The results are shown n Figure 6. It was not confirmed that the ISS made a statistically significant difference, although the sampled data showed the performance gain of about 5%. In addition, the subjects showed the smaller deviation when using the ISS. As mentioned in chapter 3, the ISS can make desktop icons visible without any window operation. This might have contributed to stable window management, which in turn led to the smaller deviation.
5 Conclusion In this study, we developed the Icon Space Saver (ISS) which aimed to provide users with an efficient way to make desktop icons visible, thereby reducing the workload of window management. The experimental results showed that the ISS did make the process of making desktop icons visible faster. However, it was not confirmed that the ISS could reduce the workload of window management and raise the overall productivity of primary tasks. To confirm this needs further investigations, which will be the focus of a future study.
References 1. Hutchings, D.R., Stasko, J.: Revisiting Display Space Management: Understanding Current Practice to Inform Next-generation Design. In: Proceedings of Graphics International 2004, pp. 127–134 (2004) 2. Fitts, P.M.: The Information Capacity of the Human Motor System in Controlling the Amplitude of Movement. Journal of Experimental Psychology 47(6), 381–391 (1954) 3. Davison, B.D., Hirsh, H.: Experiments in UNIX Command Prediction, Technical Report ML-TR-41, Department of Computer Science, Rutgers University (1997)
An Integration of Task and Use-Case Meta-models Rémi Bastide IRIT – Université de Toulouse, ISIS – CUFR J.F. Champollion, Castres, France Remi.Bastide@irit.fr
Abstract. Although task modeling is a recommended practice in the HumanComputer Interaction community, its acceptance in the Software Engineering community is slow. One likely reason for this is the weak integration between task models and other models commonly used in Software Engineering, notably the set of models promoted by the mainstream UML method. To overcome this problem, we propose to integrate the CTT model of user tasks into the UML, at the meta-model level. CTT task models are used to provide an unambiguous model of the behavior of UML use-cases. By so doing, we also bring the benefit of hierarchical decomposition of use-cases (“extend” and “include” relationships) to CTT. In our approach, CTT tasks also explicitly operate on a UML domain model, by using OCL expressions over a UML object model to express the pre- and post-conditions of tasks.
CTT models [8]) offer several advantages over the latter two, notably due to the richness of the temporal operators available. This is increasingly important, since modern user interfaces (direct-manipulation, multi-modal...) depart from the old-fashioned conversational, question-answer style, and are almost impossible to model with sequence diagrams. It is also a routine practice to develop an analysis model of the business objects of the system under design (the so-called “domain model”) early on in the development process, in order to precisely identify the business objects, their structure and their mutual relationships. This domain modeling is performed using UML class diagrams, leaving out premature implementation-related considerations. The main point of this paper is to promote CTT task models as the behavioral language for use cases. To this end, we first introduce our view of the design process which is expected. We then show how the metamodel of CTT can be tightly integrated into the UML metamodel of use-case diagrams, so that the notion of extend and include relationships become meaningful for CTT task models.
2 Design Process For the sake of efficiency, formal modeling work has to be guided by strong methodological, process-oriented guidelines. A design process defines in which order the various artifacts have to be produced during the software lifecycle, defines the expects contents of these artifacts, and what information is needed as an input and produced as an output of the various modeling and design activities. The work presented here deals mainly with the initial phases of the process, namely requirements engineering and preliminary design. • The goal of the requirements engineering phase if to form a consensus between the stakeholders (mainly the customer and the analysis team) regarding what the problem actually is, and what has to be developed in order to solve the problem. The main outcome of this phase is a common understanding between the customer and the development team of the problem domain: no work on the solution domain should be performed at this phase. • Work on the solution domain begins at the preliminary design phase: this is where the first decisions on software architecture are made, and where the best practices of interaction design (in particular iterative prototyping with increasing fidelity level) should be used. Of course, we do not recommend a strict separation between these two phases: it is quite common that work performed at the preliminary design phase uncovers new unforeseen insights on the requirements, and that some iteration has to be performed between these two phases. Although iteration is frequent between these two phases, it should always remain clear to the various actors whether they are working on the problem domain (i.e. the requirements) or on the solution domain (i.e. the design). Our claim is that task modeling is especially useful during the requirements engineering phase, and that it complements nicely the domain models and use-case
An Integration of Task and Use-Case Meta-models
581
models that are developed during this phase. At this stage, class models are used to provide an analysis-level model of the domain (they formalize the vocabulary of the business domain), while use-cases and use-case diagrams are used to provide a useroriented view of the system functionality. The natural language scenarios that are associated with use-cases are essential in easing the construction of a common understanding of the problem between the stakeholders, since they are written in the vocabulary of the business and can be understood and validated by the customer. Our view that task models are essentially a requirement analysis tool contradicts several authors, who recommend using task models at the design phase, for instance to drive the generation of dialogue [6] or abstract interface models. In our approach, requirement task models necessarily remain at a rather abstract level, since at this stage the user interface is not (and should not be) yet under design. It follows that requirement task models should not mention any user-interface specifics: rather, the task models will drive the user-centered design of the UI that will follow in the subsequent phases, where the user interface specialist will strive to design an interface that is best suited to the user task, while taking into account the limitations inherent to the target platform for the interactive system. We do not believe that (except maybe in very stereotyped situations, such as business form-filling applications) a satisfactory user interface can be automatically generated from a task model. Rather, in our view, the task model can be used as a test case for the user interface that will be designed using user-centered techniques such as incremental low-fidelity prototyping. To allow for the smooth integration of task models in the software design lifecycle, we propose to integrate task models and use-cases at the meta-model level [5, 14], thus opening the way for efficient use of Model-Driven Engineering (MDE) techniques such as model weaving and model transformation. The process we advocate is inspired by the “essential use-cases” work proposed by Constantine and Lockwood [4] and the work in [13]. In particular, since use-cases are meant the be an input to interaction design, they should be devoid of any specific reference to user interface, otherwise it would be a premature commitment to a user interface design, before this design has been presented and validated by users through low-fidelity prototyping. We propose that CTT task models should serve as the behavioral language for usecases. In this usage of task modeling, task models are meant to provide an abstract view of the user’s activity, exploring their goals as well as the temporal and causal structure of their interactions with the system. Task models are thus the formal counterpart of the natural language, narrative descriptions of scenarios that is routinely associated with use-cases, and that are still quite useful: natural languages scenarios are ideal to communicate and form consensus with the customer, and can be developed and validated with the customer during brainstorming sessions. Task models, on the other hand, are useful to communicate with the design team, since they convey a precise semantics of the dynamics of human-computer interaction that has to be supported by the software to be produced. Fig. 1 illustrates our view of the early stages of the design process, highlighting the strong bonds between use-cases, domain model and task models that are the main outcomes of the requirements analysis phase.
582
R. Bastide
Requirements analysis Use cases
Domain model
Task model
Preliminary design Dialogue models
Interaction design
Prototyping
Fig. 1. First stages of the design process
3 Related Work The need to bridge the gap between the current practices of Software Engineering (centered on UML diagrams) and user-centered design (including task analysis and modeling) has been stressed by numerous authors. A remarkable variety of solutions to this problem has been proposed. The very father of the CTT notation [12] has identified the main trends of work in this field: •
• •
Representing CTT by an existing notation: Nobrega et al. [10], for instance, provide semantics of the temporal operators of CTT in terms of UML activity diagrams. Nunes et al. [9] use the extensions mechanisms provided by the UML (profiles, stereotypes) to represent the concepts of CTT in a UML framework. Developing automatic converters from UML to task models [6] (and back, we should add). It can be contended that, in the HCI literature, one can find proposals for generators from any kind of model to any other kind. Building a new UML for interactive systems “which can be obtained by explicitly inserting CTT in the set of available notations” [10]. This is the trend we follow in this paper, by integrating a metamodel of CTT inside the metamodel of UML itself.
An Integration of Task and Use-Case Meta-models
583
Although we share the goals expressed in [10], our technical proposal is quite different with the one presented there. − In the first place, we work formally at the metamodel level, whereas only a rough sketch of a solution was provided in [10]. We believe that explicit use of metamodels brings several fundamental advantages, including the opportunity to use existing MDE tools such as model transformation languages or model weavers to extend the potential use of models. We have demonstrated this advantage in previous work [1], by showing how the notions of human errors can be integrated in task diagrams through the use of error patterns and automatic model transformations. − Furthermore, it appears that our proposal is almost an “inside-out” reversal of the approach in [10] : the authors proposed a path to transform a use-case diagram (also called a use-case map) into a CTT task model, that could be further refined. In the contrary, we propose to use CTT as a language to specify the behavior of use-cases.
4 Alignment with the UML Use-Case Metamodel The metamodel of UML use-cases is given in Fig. 2. This is actually the metamodel of use-case maps (diagrams that show the relationships between the use cases for a system), since UML is non-prescriptive as to what a use case actually is, i.e. as to what the behavioral description of a use-case should be.
Fig. 2. The UML use-case metamodel (from [11])
584
R. Bastide
There has been some picky debate amongst specialists over this very metamodel [15], several of its flaws have been pointed out, and better alternative metamodels have been proposed. Although we mainly agree with these criticisms, we have chosen to stick with the “official” metamodel, since our goal is to be as close as possible to the standard. It should also be noticed that the ill-defined notion of use-case specialization relationship, formerly available in the UML, has been removed in the current version of the standard. Starting from this “official” metamodel of UML use-cases we want to cleanly integrate a metamodel of CTT, in order to express that a CTT task model is used to express the behavior of a use-case, and to show that “include” and “extend” relationships can be expressed over a CTT model. The metamodel of CTT illustrated in Fig. 3 improves on the one we previously published in [1] in several ways: − Our earlier metamodel used eCore [14] Ecore as the metamodeling language. The one presented here uses UML class diagrams for the same purpose, which allows us to cleanly express its relationships with other elements in the UML metamodel. For instance, it expresses that the notion of Actor in CTT is identical with the same notion in UML use cases. In turn, this enriches CTT with the features available for UML actors (for instance, one can design a specialization hierarchy of actors with increasing responsibilities) − It explicitly aligns CTT with UML use cases, bringing their structuring features (“include” and “extend” relationships) to CTT. <<enumeration>> TaskAllocation User Application Interaction Abstract
Fig. 3. A metamodel of CTT integrated in the metamodel of UML
An Integration of Task and Use-Case Meta-models
585
In Fig. 3, the classes with a white background are imported from the UML metamodel, and should be related to the identical ones in Fig. 2. The classes with the filled background are specific to CTT. Basically, a CTT task model (CttTask) is a tree of nodes (CttNode) which can be related by transitions (CttTransition) that feature one of the CTT temporal operators (Operator). The use-case metamodel of Fig. 2 states that a use-case can have several “extend” and “use” relationships (* cardinality). The cardinalities chosen in our metamodel of CTT in Fig.3 should be carefully considered: • Include relationship: a CttNode has 0..1 include relationships, meaning that any CTT node can optionally include another CttTask (which in turn is a tree of CttNodes). This models a classical hierarchical decomposition, which makes it easy to reuse a task model in another one, by simply including it at the proper node. It is natural to allow for a maximum of one inclusion, since otherwise the temporal combination of the included CttTasks would be left undefined. • Extend relationship: an extend relationship is ternary, relating a base to an extension through one extensionPoint. In our metamodel, a CttNode has an optional extensionPoint, meaning that it can be optionally extended. However, a CttNode can have several extensions, discriminated by condition: BooleanExpression in metaclass Extend (cf. Fig. 2). It is noteworthy that the metamodel in Fig. 3 conveys the same information as the initial use-case metamodel, only more so. For instance, the set of Include relationships for a given use-case (which are actually the relationships appearing on use case maps) can be computed by exploiting the Include and Extend relationships of Fig. 3 recursively using the hierarchical composition relationship between CTTNodes. The metamodel in Fig.3 also relates to the domain model, albeit implicitly: the preConditions an postConditions elements in CttNode are meant to be Boolean expressions expressed in OCL (Object Constraint Language) operating on a domain model defined by a UML class diagram. As OCL itself is not part of the UML metamodel, but defined in a separate, language-oriented specification the relationship between task and domain model is not apparent, but is nonetheless fundamental.
5 Conclusion We have presented our view of a design process where task and use-case modeling are tightly integrated during the requirement engineering phase. CTT task models are used to provide an unambiguous description of use-case behavior, complementing natural language scenarios. An integration of CTT into the UML metamodel has also been presented, which opens the way to automatic processing of requirement models, to be use in subsequent phases of the design and implementation, for instance test sequence generation.
References 1. Bastide, R., Basnyat, S.: Error Patterns: Systematic Investigation of Deviations in Task Models. In: Coninx, K., Luyten, K., Schneider, K.A. (eds.) TAMODIA 2006. LNCS, vol. 4385, pp. 109–121. Springer, Heidelberg (2007) 2. Cockburn, A.: Writing Effective Use Cases. Addison-Wesley Professional, Reading
586
R. Bastide
3. Constantine, L., Campos, P.: Canonsketch and tasksketch: innovative modeling tools for usage-centered design. In: OOPSLA 2005: Companion to the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, pp. 162–163. ACM, New York (2005) 4. Constantine, L.L., Lockwood, L.A.D.: Constantine & lockwood, ltd. structure and style in use cases for user interface design 5. Limbourg, Q., Pribeanu, C., Vanderdonckt, J.: Towards Uniformed Task Models in a Model-Based Approach. In: Johnson, C. (ed.) DSV-IS 2001. LNCS, vol. 2220, pp. 164– 182. Springer, Heidelberg (2001) 6. Luyten, K., Clerckx, T., Coninx, K., Vanderdonckt, J.: Derivation of a dialog model from a task model by activity chain extraction (2003) 7. Montero, F., López-Jaquero, V., Vanderdonckt, J., González, P., Lozano, M.D., Limbourg, Q.: Solving the mapping problem in user interface design by seamless integration in idealXML. In: Gilroy, S.W., Harrison, M.D. (eds.) DSV-IS 2005. LNCS, vol. 3941, pp. 13–15. Springer, Heidelberg (2006) 8. Mori, G., Paterno, F., Santoro, C.: Ctte: Support for developing and analyzing task models for interactive system design. IEEE Trans. Software Eng. 28(8), 797–813 (2002) 9. Jardim Nunes, N., Falcão e Cunha, J.: Towards a UML profile for interaction design: The wisdom approach. In: Evans, A., Kent, S., Selic, B. (eds.) UML 2000. LNCS, vol. 1939, pp. 101–116. Springer, Heidelberg (2000) 10. Nóbrega, L., Jardim Nunes, N., Coelho, H.: Mapping ConcurTaskTrees into UML 2.0. In: Gilroy, S.W., Harrison, M.D. (eds.) DSV-IS 2005. LNCS, vol. 3941, pp. 237–248. Springer, Heidelberg (2006) 11. Object Management Group: Unified Modeling Language (UML), version 2.0. Technical report, OMG (2005), http://www.omg.org/technology/documents/formal/uml.htm 12. Paternó, F.: Towards a UML for interactive systems. In: Nigay, L., Little, M.R. (eds.) EHCI 2001. LNCS, vol. 2254, pp. 7–18. Springer, Heidelberg (2001) 13. Rosson, M.B.: Integrating development of task and object models. Commun. ACM 42(1), 49–56 (1999) 14. Stahl, T., Volter, M.: Model-Driven Software Development. Wiley, Chichester (2006) 15. Williams, C., Kaplan, M., Klinger, T., Paradkar, A.M.: Toward engineered, useful use cases. Journal of Object Technology 4(6), 45–57 (2005)
Model-Based Specification and Validation of User Interface Requirements Birgit Bomsdorf1 and Daniel Sinnig2 1 Department of Applied Computer Science, Fulda University of Applied Sciences, Germany 2 Department of Computer Science and Software Engineering, Concordia University, Montreal, Quebec, Canada birgit.bomsdorf@hs-fulda.de, d_sinnig@cs.concordia.ca
Abstract. Core functional requirements as captured in use case models are too high-level to be meaningful to user interface developers. In this paper we present how use case models can be systematically refined into detailed user interface requirements specifications, captured as task models. We argue that the transition from functional to UI specific requirements is a semi-formal step which necessitates experience, skills and domain knowledge of the requirements engineer. In order to facilitate the transition we sketch out an integrated development methodology for use case and task models. Since the engineer is also responsible for establishing conformity between use cases and task models we also show, how this validation can be supported by means of the WTM task model simulator. Keywords: Requirements specification, use case model, task model, model simulation.
specifications, but does not take into account task model specifications. Sinnig et al. [5] have defined a common semantic model for use case and task models, and propose a formal, but static, refinement relation between the two artifacts. We firmly believe that the requirements engineer should not be exempted from deciding whether or not a task model faithfully refines the use case it is developed from. On the contrary, finding the answer often depends on domain knowledge and properties specific to a project. Often refinements validation cannot be automated but has to be carried out manually by the requirements engineers themselves. In such a case, simulation and animation have proven to be powerful tools, assisting the requirements engineer in assessing the validity and accuracy of development artifacts [11, 12, 13]. Based on the discussion above, the contributions of this paper are twofold: (1) We propose a systematic and integrated development process according to which UI requirements are derived as a logical progression from a functional requirements specification. (2) We demonstrate how our tool WTM Simulator [12] assists the requirements engineer in verifying whether a task model is a valid refinement of a given use case model. The remainder of this paper is organized as follows: In Section 2, we sketch out, from a generic point of view, the key characteristics of the development process we propose. Section 3 and 4 define use case models and task models as means for capturing functional and UI requirements, respectively. In Section 5 we introduce the WebTaskModel (WTM) approach and present its application to verifying conformity between use case and task models. Finally, in Section 6 we conclude and provide an outlook to future avenues. Related work is discussed throughout the paper.
2 Systematic and Integrated Development Process The basic idea of our current work on a systematic and integrated development process is depicted in Fig. 1. Use cases are used to capture the bare functional requirements of the system, which are afterwards refined to UI specific requirements by means of a set of task models. Both use cases and task models belong to the family of scenario-based notations, and as such capture sets of usage scenarios of the system. In theory, both notations can be used to describe the same information. In practice and in our approach however, use case models capture requirements at a higher level of abstraction whereas task models are more detailed. Ideally, the functional requirements captured in use cases are independent of a particular user interface [7, 14], whereas the refined requirements captured in the task models take into account the specificities of a particular type of user interface and the characteristics of a detailed user role. For example, if the application supports multiple UIs (e.g., Web UI, GUI, Mobile, etc.) and multiple user types (e.g., novice user and expert user), then the use case model is instantiated to several task models; one for each “type” of user interface and user. In modern software engineering, the development lifecycle is divided into a series of iterations. Within each iteration, a set of disciplines and associated activities are performed while the resulting artifacts are incrementally refined and perfected. The development of use case and task models is no exception to this rule. On the one hand, ongoing prioritization and filtering activities during the early stages of development will
Model-Based Specification and Validation of User Interface Requirements
589
gradually refine the requirements captured in the use case model. On the other hand, a task model is best developed in a top-down manner, where a coarse grained task model is gradually refined into a more detailed or more restricted task model. In both cases, it is important to ensure that the refining model is a proper refinement of its relative base model (and all its predecessor models). Validation is an important step of a model-based approach so as to avoid ill-defined or miss-behaving models impacting the final design.
Fig. 1. From Functional requirements to UI Requirements
To illustrate the introduced development process, we use an example that is based on a scenario in which a new web-based Invoice Management System (IMS) is to be developed. It should feature (among others) the following functionalities: “Order Product”, “Cancel Order”, “View Orders”, and “Ship Order”. All the functionalities shall be accessible through a Web UI and should support two user types: New Customer and Registered Customer. As a first step, a functional requirements specification in the form of a use case model is developed, which is shown next.
3 Functional Requirements Specification: Use Cases Use cases were introduced in the early 90s by Jacobson [15]. He defined a use case as a “specific way of using the system by using some part of the functionality.” Modern popularizations of use case models are often attributed to Cockburn [14]. Use case modeling is making its way into mainstream practice as a key activity in the software development process (e.g. Rational Unified Process [16]). There is accumulating evidence of significant benefits to customers and developers [17]. A use case model captures the “complete” set of use cases for an application, where each use case specifies possible usage scenarios for a particular functionality offered by the system. Every use case starts with a header section containing various properties (e.g. primary actor, goal, goal level, etc). The core part of a use case is its main success scenario. It indicates the most common way in which the primary actor can reach his/her goal by using the system. A use case is completed by specifying the use case extensions. These extensions define alternative scenarios which may or may not lead to the fulfillment of the use case goal. An example use case is given in Fig 2. The use case captures the interactions for the “Order Product” functionality of the previously mentioned Invoice Management System (IMS). The main success scenario of the use case describes the situation in which the primary actor directly accomplishes his/her goal of ordering a product. The
590
B. Bomsdorf and D. Sinnig
extensions specify alternative scenarios which may (3a, 6a) or may not (7a) lead to the abandonment of the use case goal. In the next section we show how the “Order Product” use case is refined by UIspecific task models. Use Case: Order Product Primary Actor: Customer Goal: Customer places an order for a specific product. Level: User-goal Main Success Scenario: 1. Primary actor browses the product inventory and selects a specific product for purchase. 2. Primary actor specifies the desired quantity 3. System validates the availability of the product quantity and displays purchase summary. 4. Primary actor provides/validates payment and shipping information. 5. System prompts primary actor to accept the terms of conditions and to confirm the order. 6. Primary actor accepts and confirms. 7. System has the payment authorization unit to carry out payment and finalizes order. 8. System confirms and invoices the order. 9. Use case ends successfully Extension Points: 3a. The desired product is not available: 3a1. System notifies primary actor that product in desired quantity is not available. 3a2. Use case ends unsuccessfully 6a. The primary actor cancels the use case: 6a1. Use case ends unsuccessfully 7a.The payment information is invalid: 7a1. System notifies customer that payment information provided is invalid. 7a2. Use case resumes at step 4
Fig. 2. “Order Product” Use Case
4 Refined UI Requirements Specification: Task Models Task modeling is by now a well understood technique supporting user-centered UI design. The resulting specification is the primary input to the UI design stage in most HCI development approaches. Since we use task models to refine the raw requirements specification given by use cases, several task specifications may be defined for a single use case, one for each type of user interface and/or user type. A task model describes how users will be able to achieve their goals by means of the future application. Furthermore it also indicates how the system will support the involved (sub)tasks. Several approaches to defining such models exist (e.g., CTT [13], TaO Spec [18], MAD [19] and VTMB [11]). The WebTaskModel (WTM) used here is a further development of our previous work [11] to account more appropriately for characteristics of interactive web applications. The enhancements, however, are applicable to conventional interactive systems as well. In the following we are not going to point out web-specific details but introduce only those extensions as relevant for this paper. A more comprehensive overview of WTM can be found in [12, 20]. Fig 3. shows a subset of a task model refining the “Order Product” use case described above. The task model was specifically developed for a Web UI and the user type New Customer. As usual, the task hierarchy shows the decomposition of a task into its subtasks which can be of different task types. In the specification of refined UI
Model-Based Specification and Validation of User Interface Requirements
591
requirements we distinguish between cooperation tasks (represented by ) to denote pieces of work that are performed by the user in conjunction with the application, user tasks ( ) denoting the user parts of the cooperation performed without system intervention, and system tasks ( ) defining pure system parts. Abstract tasks ( ), similarly to CTT [13] and MAD [19], are compound tasks whose subtasks belong to different task categories.
Fig. 3. “Order Product” Task Model for the role New Customer
The order of task execution is given by temporal relations. In the notation used in the figure, temporal relations are denoted by abbreviations: The symbol defines a selection of subtasks. >> denotes tasks that are to be performed strictly one after the other in the specified order (visualized by ). The partial task model shown in Fig 3. specifies the task order product, which is decomposed into the subtasks search for product (according to step S1 of the use case), specify quantity (step S2) feedback (S3 and S3a1) and payment (steps S4 – S8). The task feedback is decomposed into the subtasks display summary, for which we define the precondition C1:product quantity available, and display prod. unavailable, for which we define the precondition NOT C1. Both conditions are derived from the use case extension 3a. Please note that the conditions are not shown in the diagram but were assigned by means of the task property window of the WTM editor (see [20]). The task display prod. unavailable is a so-called stop task. It denotes the premature termination of the scenario and is the task model counterpart to use case step S3a1. In addition to the task model for the role New Customer, a task model for a Registered Customer is compiled. It differs from the presented task model in terms of how the payment task is broken down. Instead of having to provide the shipping and payment information in each case, a registered customer has the option to alter shipping or payment data or to entirely skip the involved subtasks. As seen, different sub-roles lead to slightly different UI requirements. If different UI types were to be supported the use case model would also be refined into device specific task models.
592
B. Bomsdorf and D. Sinnig
5 Tool Supported Validation As mentioned above, use case models capture requirements at a higher level of abstraction whereas task models are more detailed taking into account the specificities of a particular type of user interface and characteristics of a detailed user role. The question arises whether or not a task model faithfully refines the use case it is based on. The requirements engineer is not exempted from deciding this question as finding the answer often depends on domain knowledge and project details. In the following we demonstrate how the tool WTM Simulator [12] can be used to check whether a task model is behaviorally equivalent to a given use case. Firstly, use cases are transformed into a formal (machine readable) presentation based on finite state machines. In the WTM approach, task models are represented by a set of task state machines, which are used within the final application as part of the UI controller [21]. Task state machines are also used to simulate task models within the development steps. In the work reported here a formal correspondence between use case and task models is established to simulate their execution in conjunction. This will be presented by means of a concrete simulation example. 5.1 Mapping Use Cases to UC-FSM At first use cases are transformed into a finite state machine representation called UCFSM. A UC-FSM is a labeled, directed, connected graph, where nodes denote states and the edges represent state transitions. In a UC-FSM the execution of a step is denoted by a transition. The transition labels serve as references to the corresponding steps in the original use case description. We believe that UC-FSM capture easily and intuitively the nature of use cases. As use cases are typically captured in purely narrative form the derivation of the use case graph will be a manual activity. The composition of the use case graph from a given use case depends on the flow constructs, which are implicitly or explicitly entailed in the use case. Examples of such flow constructs are: jumps (e.g. use case resumes at step X), sequencing information (e.g. the numbering of use case steps), or branches to use case extensions. Concrete details on the mapping process as well a slightly more elaborated formal model can be found in [22].
Fig. 4. Use Case FSM for “Order Product” Use Case
Fig 4 depicts the corresponding UC-FSM for the “Order Product” use case. As shown, all the steps of the use case are also present in the UC-FSM. Note that starting from states {quant.selected}, {awaiting confirmation} and {confirmed}, two transitions
Model-Based Specification and Validation of User Interface Requirements
593
are defined, denoting the execution of steps in the main success scenario and alternatively the execution of steps defined in the corresponding extensions. 5.2 Task State Machine and UC-FSM Assignment In WTM each task formally possesses a state machine describing a generic task life cycle (see Fig 5). For each task the state machine can be extended to specify application specific task behavior. The rules that are used for this purpose are of the form task.task-state.task-eventÆ action, where task denotes the task whose behavior is extended, task-state and task-event denote the state and corresponding trigger event upon which the action is to be performed. In the work presented in this paper, this “extension” technique is used to combine task state machines with the UC-FSM. The objective is to specify dependencies between task executions and use case steps.
skipped
Skip Restart
initiated Start
Restart Suspend
State
if all preconditions are fulfilled the task can be started
skipped
the task is omitted
running
denotes the actual performance of the task and of its subtasks, if applicable
completed
marks a successful task execution
suspended
running Resume End completed
Abort
Abort terminated
Meaning
initiated
suspended the task is interrupted
Fig. 5. Generic Task State Machine
In order to run a conformance simulation we extend the various task state machines such that they generate the trigger events needed to run the UC-FSM. The specification of the extensions rules depends on which tasks are meant to be a refinement for which use case step. Hereby, due to the before mentioned different levels of abstraction, one use case step is often refined by several tasks. Table 1 (column 1 and 2) depicts the refinement mapping between use case steps and tasks. Note that abort order product is added since S3a1 is a stop task. The mappings defined by the row of step S4 result from the task model differentiation of the role Customer. Column 3 of Table 1 depicts the state of the task state machine responsible for sending the corresponding use case event to the UC-FSM. Examples of rules resulting from Table 1 are: display summary.completed.on_entry Æ send S3 to Use Case product order display prod. unavailable.completed.on_entry Æ send S3a1 to Use Case product order Æ send abort to task order product
Finally we note that the table is manually created by the requirements engineer. According to our experiences we argue that if the task model was specifically developed based on a given use case specification (as suggested in the paper) the corresponding refinement mappings are clearly defined and hence the conception of the table is a straightforward activity.
594
B. Bomsdorf and D. Sinnig Table 1. Refinement Mapping between Use Case Steps and Tasks Step S1
Task search for a product
Task State completed
S2 S3
specify quantity display summary
completed completed
S3a1
display prod. unavailable New Customer: provide payment information
completed / abort order product completed
Registered Customer: alter data
completed or skipped
S5
prompt confirmation
completed
…
…
…
S4
5.3
WTM Simulation Tool and Example
In [12] we presented a tool that supports the developer in validating task, role, taskobject models and their behavioral interrelations by means of model simulation. In the tool each task is represented by an icon showing static and dynamic information about the task (such as the task type, temporal relations, and the current state). A context menu attached to each task allows triggering one of the events that are defined by the generic task state machine and are currently valid. The WTM simulator provides the software engineer with different areas implementing several views on the models, e.g., showing the hierarchical task structure, listing all tasks that can be started or ended at a current point in time, respectively, and presenting task objects. Some examples are shown by the screenshots in Fig 6. Here, the object area shows only USE CASE product order and its state changes resulting from task execution. Please note that modeling use cases as objects is only a workaround since the use case extensions are not yet implemented in the WTM simulator. In the upper part of Fig 6 the UC-FSM is in the state quant.selected. Since the condition C1 is fulfilled (see condition area) the task display summary can be performed at this point in time. After its completion the UC-FSM state changes to prod. available and provide shipping information is enabled (indicated by the arrow in Fig 6). The second scenario shows the unsuccessful run in case of NOT C1 (defined by C2): Once the display-task is executed order product is terminated (thus the startable leaf task area is empty) and the UC-FSM switches to state prod. unavailable. During simulation the requirements engineer can check whether or not each task sequence allowed by the task model is a valid scenario according to the use case specification and vice versa. Furthermore, the simulator allows also one to observe how the steps of a scenario under investigation affect task-objects and domain objects, respectively. As in the case of the USE CASE object, the simulator tool represents them in the object area showing their name, classes, and their manipulations in terms of state changes. Similarly, but not depicted in Fig 6, a role area shows all defined roles, allowing the investigation of role changes resulting from task execution as well as disabling and enabling of tasks caused by role changes. For example, the requirements engineer can check the validity of a user registration scenario (by which the role has to change from New Customer to Registered Customer) and its coactions with the use cases and task models, respectively, defined for each role.
Model-Based Specification and Validation of User Interface Requirements
595
scenario 1
scenario 2
Fig. 6. Simulating Task and Use Case Executions
6 Conclusion In this paper we presented our current work towards an integrated development methodology for the derivation of UI requirements from high-level functional requirements. The development approach reported here consists of two basic steps. First, a use case model is iteratively created to capture core application requirements. Next, the use case model is successively refined into a set of task models. While use cases capture “raw” functional requirements which are independent of a particular user interface, task models capture refined UI specific requirements which not only take into account the specificities of a particular type of user interface but also the characteristics of a detailed user role. As a result, one use case is typically refined by several task models; one for each UI type or user role. The focus of this paper was on the systematic development of use case and task models. Our approach, however, takes also user roles and involved objects into account - the description of which has been omitted for the sake of conciseness. The tool WTM Simulator was used to check conformity between a task model and a given use case model. In particular, we demonstrated how use cases can be translated into a state machine representation and formally combined with the task state machine approach of WTM, which in turn is used as input to the simulator. The results of the simulation guide and assist the developer in deciding whether the task model is a valid refinement of the underlying use case. The research reported in this paper is the first offspring of a larger project, the goal of which is the establishment of a model-driven UI engineering framework, encompassing all phases of the software lifecycle and involved models. Within our next working step we will elaborate the refinement of the functional requirements, e.g., by means of UML activity diagrams. We also aim to further extend the WTM Simulator such that it allows for direct input of structured textual use cases and (semi) automatically generates refinement mappings between use case steps and tasks.
596
B. Bomsdorf and D. Sinnig
References 1. Kazman, R., Gunaratne, J., Jerome, B.: Why Can’t Software Engineers and HCI Practitioners Work Together? In: Proc. of HCI Intern., Crete, Greece, pp. 504–508 (2003) 2. Ferre, X., Juristo, H., Windl, H., Constantine, L.: Usability basics for software developers. IEEE Software 18(1), 22–29 (2001) 3. Kazman, R., Bass, L., John, B.: Bridging the gaps between software engineering and human-computer interaction. In: Workshop at ICSE 2004, Scotland, UK (2004) 4. Sutcliffe, A.: Convergence or Competition between Software Engineering and Human Computer Interaction. In: Seffah, A., Desmarais, M.C., Metzger, M. (eds.) HumanCentered Software Engineering -Integrating Usability in the Software Development Lifecycle, pp. 71–83. Springer, Heidelberg (2005) 5. Sinnig, D., Chalin, P., Khendek, F.: Common Semantics for Use Cases and Task Models. In: Proc. of Integrated Formal Methods, Oxford, England, pp. 579–598 (2007) 6. Clemmensen, T., Norbjerg, J.: Separation in Theory – Coordination in Practice. In: Workshop Bridging the Gap between Software Engineering and HCI, Portland (2003) 7. Constantine, L.L., Lockwood, L.A.D.: Software for Use: A Practical Guide to the Models and Methods of User Centered Design. Addison-Wesley, Reading (1999) 8. Constantine, L., Biddle, R., Noble, J.: Usage-Centered Design and Software Engineering: Models for Integration. In: Workshop Bridging the Gaps Between SE and HCI, Portland (2003) 9. Kujala, S.: Linking User Needs and Use Case-Driven Requirements Engineering. In: Human-Centered Software Engineering-Integrating Usability in the Development Process, pp. 113–125 (2005) 10. Paternó, F.: Towards a UML for interactive systems. In: Nigay, L., Little, M.R. (eds.) EHCI 2001. LNCS, vol. 2254, pp. 7–18. Springer, Heidelberg (2001) 11. Biere, M., Bomsdorf, B., Szwillus, G.: Specification and Simulation of Task Models with VTMB. In: Proc. of Computer-Human Interaction Conference, pp. 1–2 (1999) 12. Bomsdorf, B.: The WebTaskModel Approach to Web Process Modelling. In: Proc. of Task Models and Diagrams for User Interface Design, Toulouse, France, pp. 240–253 (2007) 13. Paternò, F.: Model-Based Design and Evaluation of Interactive Applications. Springer, Heidelberg (2000) 14. Cockburn, A.: Writing Effective Use Cases. Addison-Wesley, Boston (2001) 15. Jacobson, I.: Object-Oriented Software Engineering: A Use Case Driven Approach. ACM Press (Addison-Wesley Pub), New York (1992) 16. Larman, C.: Applying UML and Patterns: An Introduction to Object-Oriented Analysis and Design and Iterative Development, 3rd edn. Prentice Hall PTR, Englewood Cliffs (2004) 17. Merrick, P., Barrow, P.: The Rationale for OO Associations in Use Case Modelling. Journal of Object Technology 4(9), 123–142 (2005) 18. Dittmar, A., Forbrig, F., Stoiber, S., Stary, C.: Tool Support for Task Modelling - A Constructive Exploration. In: Proc. of DSV-IS, Hamburg, Germany, pp. 59–76 (2004) 19. Sebillotte, S., Scapin, D.L.: From users’ task knowledge to high level interface specification. International Journal of Human-computer Interaction 6, 1–15 (1994) 20. Bomsdorf, B.: Modelling Interactive Web Applications: From Usage Modelling towards Navigation Models. In: Proceedings of 6th International Workshop on Web-Oriented Software Technologies – IWWOST 2007, Como, Italy, pp. 194–208 (2007) 21. Betermieux, S., Bomsdorf, B.: Finalizing dialog models at runtime. In: Baresi, L., Fraternali, P., Houben, G.-J. (eds.) ICWE 2007. LNCS, vol. 4607, pp. 137–151. Springer, Heidelberg (2007) 22. Sinnig, D., Chalin, P., Khendek, F.: LTS Semantics for Use Case Models. In: Proceedings of ACM - SAC 2009, Honolulu, HI (to appear, 2009)
A Position Paper on 'Living Laboratories': Rethinking Ecological Designs and Experimentation in Human-Computer Interaction Ed H. Chi Palo Alto Research Center, Augmented Social Cognition Group, 3333 Coyote Hill Road, Palo Alto, CA 94304 USA echi@parc.com
Abstract. HCI have long moved beyond the evaluation setting of a single user sitting in front of a single desktop computer, yet many of our fundamentally held viewpoints about evaluation continues to be ruled by outdated biases derived from this legacy. We need to engage with real users in 'Living Laboratories', in which researchers either adopt or create functioning systems that are used in real settings. These new experimental platforms will greatly enable researchers to conduct evaluations that span many users, places, time, location, and social factors in ways that are unimaginable before. Keywords: HCI, Evaluation, Ecological Design, Living Laboratories, Methodology, Web Services.
slightly disillusioned with artificial intelligence research but yet believed computers were great tools for modeling and understanding human cognition. The aim of augmented human cognition has remained a core value for Human-Computer Interaction research. With this aim, during the formation of the field, the need to establish HCI as a science had pushed us to adopt methods from psychology, both because it was convenient as well as the methods fit the needs. HCI field's rise paralleled the rise in the notion of personal computing---the idea that each person would have one computer at her command. Systems were evolving from many users using a single system to a single user multi-tasking with her own desktop computer. The costs of these systems forced researchers to think about how users would most productively accomplish knowledge work. The metaphor of the desktop, files, windows, and the graphical icons on bitmapped displays arrived naturally. The study of how users would respond to icons flashing on the screen, how users would move a pointing device like the mouse [2] to move a file from one location to the next paralleled some of the psychological experiments on stimulus and human response that psychologists were already routinely measuring. Fitts' law [1, 2], models of human memory [7], cognitive and behavioral modeling methods like GOMS [1] enabled HCI researchers and practitioners to model a single user interacting with a single computer.
2 Outdated Evaluative Assumptions Of course, the world has changed. Trends in social computing as well as ubiquitous computing had pushed us to consider research methodologies that are very different from the past. In many cases, we can no longer assume: Only a single display. Users will pay attention to only one display and one computer. Much of fundamental HCI research methodology assumes the singular occupation of the user is the display in front of them. Of course, this is no longer true. Not only do many users already use multiple displays, they also use tiny displays on cell phones and iPods and peripheral displays. Matthews et al. studied the use of peripheral displays, focusing particularly on glance-ability, for example. Traditional HCI and psychological experiments typically force users to attend to only one display at a time, often neglecting the purpose of peripheral display designs. Only knowledge work. Users are performing the task as part of some knowledge work. The problem with this assumption is that non-information oriented work, such as entertainment applications, social networking systems, are often done without explicit goals in mind. With the rise of Web2.0 applications and systems, users are often on social systems to kill time, learn the current status of friends, and to serendipitously discover what might capture their interests. Isolated worker. Users performing some task by themselves. Much of knowledge work turn out to be quite collaborative, perhaps more so than first imagined. Traditional view of HCI assumed the construction of a single report by a single individual that is needed by a hierarchically organized firm. Generally speaking, we have come to view such assumption with contempt. Information work, especially work done by highly paid analysts, is highly collaborative. Only the highly automated tasks that are
A Position Paper on 'Living Laboratories'
599
routine and mundane are done in relative isolation. Information workers excel at exception handling, which often require the collaboration of many departments in different parts of the organizational chart. Stationary worker. User location placement is stationary, and the computing device is stationary. A mega-trend in information work is the speed and mobility in which work is done. Workers are geographically dispersed, making collaboration across geographical boundaries and time-zone critical. As part of this trend, work is often done on the move, in the air while disconnected. Moreover, situation awareness is often accomplished via email clients such as Blackberries and iPhones. Many estimates now suggest that already more people access the internet on their mobile phone than on desktop computers. This certainly has been the trend in Japan, a bellwether of mobile information needs. Task duration is short. Users are engaged with applications in time scales measures in seconds and minutes. While information work can be divided and be composed of many slices of smaller chunks of subgoals that can be analyzed separately, we now realize that many user needs and work goals stretch over for long period of time. User interests in topics as diverse as from news on the latest technological gadgets to snow reports for snowboarding need to be supported over periods of days, weeks, months and even years. User engagement with web applications are often measured in much longer periods of time as compared to more traditional psychological experiments that geared toward understanding of hand-eye coordination in single desktop application performance. For example, Rowan and Mynatt studied peripheral family portraits in the digital home over a year-long period and discovered that behavior changed with the seasons [14]. The above discussion point to how, as a field, HCI researchers have slowly broken out of the mold in which we were constrained. Increasingly, evaluations are often done in situations in which there are just too many uncontrolled conditions and variables. Artificially created environments such as in-lab studies are only capable of telling us behaviors in constrained situations. In order to understand how users behave in varied time and place, contexts and other situations, we need to systematically re-evaluate our research methodologies.
3 Re-thinking Evaluations Fundamentally, traditional HCI research is busting the seams in two different ways: (1) ubiquitous computing research is challenging the notion of personal computing in front of a desktop, looking at computation that is embedded in the environment as well as computation done with ever powerful devices that can be taken while mobile [3, 4]; (2) social computing research that is simultaneously challenging the notion of computing systems designed for the individual, instead of for a group or community [6, 12]. Both trends have required re-thinking our evaluation methodologies. Traditional CSCW research have already drawn on qualitative methodologies from social scientists, including field observations and interviews, diary studies, survey methods, as well as focus groups and direct participation. Ubicomp, on the other hand, have used
600
E.H. Chi
a mixture of methods, but have more readily examined actual deployments with real users in the field. In either case, it may be time for us to fundamentally re-think how HCI researchers ought to perform evaluations, as well as the goal of the evaluations. Since, increasingly, HCI systems are not designed for a single person, but for a whole group, we need research that not just augment human intelligence, but also group intelligence and social intelligence. Indeed, a natural extension of research in augmenting human intellect is the development of technologies that augment social intelligence, lead by research in the Social Web and Web2.0 movements. Traditional CSCW research has already studied the needs of coordination for a group and to some extent a community of practice. Many researchers are now conducting research in a social context, in which factors are less easy to isolate and control in the lab. Some research in the past might have treated variations in social contexts as part of the noise of the overall experiment, but this is clearly unsatisfactory since larger subject pools are necessary to overcome the loss in the power of the experiment. Moreover, we now know that many social factors follow distributions that are not normally distributed, making the prediction of individual factors in greatly varying social situations difficult, if not impossible. Since users now interact with computing systems in varied ubiquitous contexts, ecological validity is often much more important than studying factors in isolation. In ubicomp applications, for example, productivity measurements are often not the only metrics that are important. For example, adoption of mobile applications is now often cited as evidence of the usefulness of an application. One might argue that if using an application results in no productivity increase then the fact there is adoption of the application is irrelevant. However, this view is short sighted, because the opposite is also true: If there is productivity increase from using the application, but there is no adoption (perhaps due to ease of use issues, for example), then it is also unclear what benefit the application will ultimately bring. Obviously, the best situation is to have both productivity improvements as well as real adoption. However, research resource constraints often conspire against us to achieve both. Interestingly, academic research often tend to focus on the former rather than the latter, increasing the perceived gulf between academics' ivory tower and the trenches of the practitioners. An example that illustrates this gulf is the studies around color copiers and printers. It has been circulated here at PARC that researchers had studied the need for color output from copiers and printers, and had concluded that there was either negligible increase or no productivity increase from using color. Cost and benefit analysis showed that black-and-white copiers were often just as good and more economical than color copiers in the majority of the cases. While it is unclear whether the studies took into account of increase use of color in various media might possibly drive future demand and utility of color systems, what is clear now is that the adoption of color copiers and printers would occur independent of productivity studies. If what matters in the industry are the adoption of technology, while academic research remains focused on measurements of productivity, we will never bring the two communities together and technology transfer will forever remain challenging.
A Position Paper on 'Living Laboratories'
601
4 Evaluations Using 'Living Laboratories' The Augmented Social Cognition group have been a proponent of the idea of 'Living Labratory' within PARC1. The idea is that in order to bridge the gulf between academic models of science and practical research, we need to conduct research within living laboratories. Many of these living laboratories are real platforms and services that researchers would build and maintain, and just like Google Labs or beta software, would remain somewhat unreliable and experimental, but yet useful and real. The idea is to engage real users in ecological valid situations, while gathering data and building models of social behavior. Looking at two different dimensions in which HCI researchers could conduct evaluations, one dimension is whether the system is under the control of the researcher or not. Typically, computing scientists build systems and want them evaluated for effectiveness. The other dimension is whether the study is conducted in the laboratory or in the wild. These two dimensions interact to form four different ways of conducting evaluations: (1) Building a system, and studying it in the laboratory. This is the most traditional approach in HCI research and the one that is typically favored by CHI conference paper reviewers. The problem with this approach is that it is (1) extremely timeconsuming, and (2) experiments are not always ecologically valid. As mentioned before, it is extremely difficult, if not impossible, to design experiments for many social and mobile applications that are ecologically valid in the laboratory. (2) Not building a system (but adopt one), and still study it in the laboratory. For example, this is possible by taking existing systems, such as Microsoft Word and iWorks Pages and comparing the features of these two systems. (3) Adopting an existing system, and studying it in the wild. The advantage here is to study real applications that are being used in ecologically valid situations. The disadvantage is that findings are often not comparable, since factors are harder to isolate. On the other hand, the advantages are that real findings can be immediately applied to the live system. Impact of the research is real, since adoption issues are already removed. As illustrated below, we have studied Wikipedia usage in detail using this method. (4) Building a system, releasing it, and studying it in the wild. A well-publicized use of this approach is Google's A/B testing approach2. According to Google, A/B testing allowed them to finely tune the Search Engine Result Pages (SERPs). Some details about this kind of A/B online experiments has been documented [8]. For example, how many search results should the page contain was studied carefully by varying the number between a great number of users. Because the subject pool is large, Google can say with some certainty which design is better on their running system. A major disadvantage of this approach is the effort and resource requirement it takes to study such systems. However, for economically interesting applications
such as Web search engines, the tight integration between system and usage actually shorten the time to innovate between product versions. Of these variations, (3) and (4) are what we consider to be 'Living Laboratory' studies.
5 Examples of Living Laboratory Style Research Here we will illustrate how to conduct Living Laboratory studies with some examples. GroupLens and MovieLens. First, an example of building a real system, releasing it, and studying it in the wild was the seminal work of the GroupLens [9] research group at University of Minnesota. GroupLens was first created to deal with information overload, particularly the high amount of traffic in Usenet news. In this way, GroupLens was hoping to adopt an existing community and system, and augment it with some technology and studying how the technology performs in the wild. The technology in question was collaborative filtering. The idea at the time was related to user profiling. Users expressing interest in the same items must be somewhat similar and can form a virtual neighborhood. Therefore, we can recommend to them items that their neighbors are interested in. The research group was somewhat successful in doing this, as enough users on Usenet news adopted the technology and provided feedback on the system. Later, the research group built a movie recommendation site on the Web that used similar collaborative filtering algorithms called MovieLens [10]. The website retained a community of about 6000 users that became an ecosystem in itself. Someone volunteered to keep the movie database up to date, and some participated in discussions about features the recommendation system should have. Later research on specific recommendation algorithms often split users into groups temporarily, where one group might receive one treatment, while the other would receive another treatment. The results are then compared to see how the two groups differed, including whether they evolved different group behaviors.
Fig. 1. Movielens system is an academic project with a live community
A Position Paper on 'Living Laboratories'
603
Games with a Purpose (gwap.com). Luis von Ahn's work on ESP games has evolved in a highly intriguing site called Games with a Purpose (gwap.com). On this site, users can engage in mini-games that are fun in themselves, but also the games end up collecting data that is useful in some other way. One well-known example is the image labeler, in which two users (without other communication means) must agree on the same keyword to receive points. The objective is to agree on the labeling in as many images as possible in a given timeframe. Here the objective is to engage real users in realistic contexts, in which the goal was to entertain the user and to gather behavioral data that tell us something about the images. One can now analyze word choices over many data points, collective action (including any attempts at cheating), as well as longitudinal issues like number of repeat visits, the diversity of users, or viralness of the game. Engagement measures, such as stickiness, can be directly measured.
Fig. 2. The Games With A Purpose (gwap.com) website engages real users with games, while having them accomplish some task that is useful for research
WikiScanner / WikiDashboard over Wikipedia. One realistic approach is to adopt an existing community and system, and create mashup applications that augment the original system with some new capability and studying its effects. For example, Wikis are collaborative systems in which virtually anyone can edit anything. Although wikis have become highly popular in many domains, their mutable nature often leads them to be distrusted as a reliable source of information. For example, Virgil Griffith took open source data from Wikipedia and enabled people to discover the possible identities of Wikipedia editors by cross-referencing the IP address with institution names3. Our own research on social transparency also took this approach. We downloaded a copy of all of the edits on Wikipedia and tabulated the editing statistics for all articles and all users4. This enabled us to create a visualization the editing patterns for 3 4
each article and each user [13]. WikiDashboard has received tens of thousands of visits from Wikipedia users. We know also that both systems were discussed extensively in the Wikipedia community.
Fig. 3. An example page from WikiDashboard [13] project, which inserts a visualization of the social dynamics and edit patterns for every Wikipedia page
6 Conclusion HCI research have greatly benefitted from borrowing evaluation methods that were fine-tuned in other fields, especially the behavioral sciences. Evaluation methods are inseparable from the kinds of science and models that can be build in a field. HCI have long moved beyond the evaluation of a single user sitting in front of a single desktop computer, yet many of our fundamentally held viewpoints about evaluation continues to be ruled by outdated biases derived from this legacy. In this position paper, we have argued that traditional views of human performance in systems have long been only focused on productivity. It is time for us to break out of these longheld views, and look at evaluations in more holistic ways.
Fig. 4. A way to think about the role of Living Laboratory prototypes in scientific research
A Position Paper on 'Living Laboratories'
605
One way to do this is to engage with real users in 'Living Laboratories', in which researchers either adopt or create real useful systems that are used in real settings that are ecologically valid. This enables a tight loop between characterization of behavior, models of the users and system, prototype, and experimentation. The new Social Web platform is enabling researchers to build systems with amazing speed, enabling the whole loop to be completed within much shorter amounts of time than the past. Similar experimentation platforms for mobile computing is just becoming reachable, with iPhone and Google's Andriod leading the charge. These platforms will greatly enable Living Laboratory researchers to conduct evaluations that span many users, places, time, location, and social factors in ways that are unimaginable before. Acknowledgments. We thank PARC’s Augmented Social Cognition team and the HCIC workshop for many helpful discussions on this position paper.
References 1. Card, S., Moran, T.P., Newell, A.: The Psychology of Human Computer Interaction. Lawrence Erlbaum Associates, Mahwah (1983) 2. Card, S.K., English, W.K., Burr, B.J.: Evaluation of mouse, rate-controlled isometric joystick, step keys, and text keys for text selection on a CRT. Ergonomics 21(8), 601–613 (1978) 3. Carter, S., Mankoff, J., Klemmer, S., Matthews, T.: Exiting the cleanroom: On ecological validity and ubiquitous computing. HCI Journal (2008) 4. Chi, E.H.: Introducing Wearable Force Sensors in Martial Arts. IEEE Pervasive Computing 4(3), 47–53 (2005) 5. Engelbart, D.C.: Augmenting Human Intellect: A Conceptual Framework. Summary Report AFOSR-3223 under Contract AF 49(638)–1024, SRI Project 3578 for Air Force Office of Scientific Research, Stanford Research Institute, Menlo Park, CA (1962) 6. Grudin, J.: Groupware and social dynamics: Eight challenges for developers. Communications of the ACM 37(1), 92–105 (1994) 7. Jones, W.P.: On the Applied Use of Human Memory Models: The Memory Extender Personal Filing System. International Journal of Man-Machine Studies 25(2), 191–228 (1986) 8. Kohavi, R., Longbotham, R.: Online Experiments: Lessons Learned. Computer 40(9), 103–105 (2007), doi:10.1109/MC.2007.328 9. Konstan, J.A., Miller, B.N., Maltz, D., Herlocker, J.L., Gordon, L.R., Riedl, J.: GroupLens: Applying Collaborative Filtering to Usenet News in special section: recommendation systems. Communications of the ACM 40(3), 77–87 (1997) 10. Riedl, J., Konstan, J.: Word of Mouse: The Marketing Power of Collaborative Filtering. Warner Books, New York (2002) 11. Sears, A., Jacko, J.A.: The Human-Computer Interaction Handbook: Fundamentals, Evolving Technologies, and Emerging Applications. CRC Press, Boca Raton (2008) 12. Shneiderman, B.: Science 2.0. Science 319(5868), 1349–1350 (2008) 13. Suh, B., Chi, E.H., Kittur, A., Pendleton, B.A.: Lifting the Veil: Improving Accountability and Social Transparency in Wikipedia with WikiDashboard. In: Proceedings of the ACM Conference on Human-factors in Computing Systems (CHI 2008), Florence, Italy, pp. 1037–1040. ACM Press, New York (2008) 14. Rowan, J., Mynatt, E.D.: Digital family portrait field trial: Support for aging in place. In: Proc. of CHI 2005 Conference on Human Factors in Computing Systems, pp. 521–530. ACM, New York (2005)
Embodied Interaction or Context-Aware Computing? An Integrated Approach to Design Johan Eliasson, Teresa Cerratto Pargman, and Robert Ramberg Department of Computer and Systems Sciences, Stockholm University/Royal Institute of Technology, SE-164 40 Stockholm, Sweden {je,tessy,robban}@dsv.su.se
Abstract. This paper revisits the notion of context from an interaction design perspective. Since the emergence of the research fields of Computer supported cooperative work and Ubiquitous computing, the notion of context has been discussed from different theoretical approaches and in different research traditions. One of these approaches is Embodied Interaction. This theoretical approach has in particular contributed to (i) challenge the view that user context can be meaningfully represented by a computer system, (ii) discuss the notion of context as interaction through the idea that users are always embodied in their interaction with computer systems. We believe that the particular view on users context that the approach of Embodied Interaction suggests needs to be further elaborated in terms of design. As a contribution we suggest an integrated approach where the interactional view of Embodied Interaction is interrelated with the representational view of Context-aware computing. Keywords: Embodied Interaction, Context-aware computing, Design, Representation, Context.
approach thereby portrays human agents as engaged in an interaction characterized by skilled and continuous coping. It thus describes an understanding of human computer interaction that can be exemplified with a user finding herself in a situation of being able to handle all difficulties and not losing focus in her activity even once. How she gets there and how she manages to remain in this focused activity, is out of the scope. But how well does this picture of skilled and engaged human computer interaction guide design? And more specifically; How well does it guide design of context-aware systems? To understand these questions better we need to return to one of the targeted problems of Embodied Interaction. Namely the gap between the social conception of context and the technical one [2, 3]. Embodied Interaction is but the latest contribution to improving the understanding of this gap. It uses the philosophical tradition of phenomenology as a theoretical departure point for understanding interaction. This tradition has previously been presented in the HCI research community, since it seems to offer a way to take both social and technical views into account. Based on the “present-at-hand” mode of use, Winograd and Flores [4] discuss user activity in terms of “breakdown”. Weiser [5] introduces the concept of transparency, in calm computing and ubiquitous computing research, based on the “ready-to-hand” mode of use. The approach of Embodied Interaction relies much on the idea of a well practiced and smooth interaction, with and through computers, and it deemphasizes the developmental aspects of the user activity. Without addressing these developmental aspects, it is difficult for designers to operationalize the approach of Embodied Interaction in their work. Without looking at how you become a skilled user in the interaction with a system, one opportunity for design is passed over. Advocating history of use, Chalmers [6] have argued that the ideal of transparency, which can be also found in Embodied Interaction, is an unachievable goal. Räsänen and Nyce [7] have pointed out that the approach of Embodied Interaction is reductionist in that it does not go beyond interaction, and focus the here and now too much. We will go one step further to claim that Embodied Interaction does not take all modes of interaction present in the activity into account, and thereby misses out on how skill is acquired. In this respect we will claim that the Embodied Interaction approach overlooks the interplay between learning and practice, between reflection and action that characterizes any kind of human computer interaction. This observation is particularly interesting for design of context-aware computing systems because this field has strong connection to Embodied Interaction. This paper revisits the notion of context in the field of context-aware computing from an integrated design perspective on Embodied Interaction. We believe that the rich conceptualization of Embodied Interaction deserves to be further developed in terms of design of context-aware computing systems. This leads to the following question: How do we design for context-aware computing systems in the light of Embodied Interaction? In this paper we will not try to answer questions about proactivity in context-aware computing. Instead we follow Rogers [8] in that what we aim for is not proactive computing but proactive people.
608
J. Eliasson, T. Cerratto Pargman, and R. Ramberg
2 The Notion of Context from an Embodied Interaction Perspective Grounded in Merleau-Ponty’s Phenomenology of perception [9], Schutz’s Social phenomenology [10] and Heidegger’s Hermeneutic phenomenology [11], Dourish [1] suggests a theoretical approach to human computer interaction which he coins Embodied Interaction. The Embodied Interaction approach views context not as information but as a relation and, as human actors participate in the world, action does not occur in a particular context but context is rather created and recreated in concert with interaction [12]. Because of this, context is not stable but instead a dynamic feature; constantly changing. What is to be regarded as context is thereby determined by the setting, actors and interaction. According to the Embodied Interaction perspective context is not some delineable aspect of a setting that can be encoded and represented [12]. Rather context is something people do. In this way the context model in Embodied Interaction is an interactional model and not a representational model [12]. The view that context is what people do, comes from the primacy of action in Embodied Interaction. An emphasis on action is shared with Situated Action [13], which is also one departure point for Embodied Interaction. Both approaches regard context and meaning as continually changing and only possible to recognize in how interaction unfolds. According to Embodied Interaction, the way we interact with a computer system is a sign of how we relate to the system. Meaning is also embodied, both in a physical and a wider sense. In this way our interaction is dependent on our physical, social and cultural body. The theoretical approach of Embodied Interaction argues against disembodied, objective and reflective use. What Embodied Interaction instead focuses on, inherited from embodiment [9] and being-in-the-world [11], is a moment of mindless interaction, a moment of skilled coping. 2.1 Challenges for Context Design from an Embodied Interaction Perspective Dourish [1] suggests the following six principles as a backdrop for design (pp. 162): 1. 2. 3. 4. 5. 6.
Computation is a medium Meaning arises on multiple levels Users, not designers, create and communicate meaning Users, not designers, manage coupling Embodied technologies participate in the world they represent Embodied interaction turns action into meaning
When trying to design for context from the Embodied Interaction perspective, we are left with these broad design principles. That the principles are broad make them difficult to operationalize. This while the alternative of designing for context, using objective representations, is merely seen as positivist thinking, incompatible with the philosophy put forward by Embodied Interaction [12]. Take for instance the third and fourth design principle above, they directly address the role of designers although they do so in a rather negative, excluding sense. Principle number three and four state
Embodied Interaction or Context-Aware Computing?
609
what designers of these systems should not do. Thereby the role of the interaction designer seems to be marginalized, to an enabling one. It is probably not meant that the ideal we should strive for is the ultimate and final system, allowing for every kind of appropriation and every kind of interaction. Dourish [12] notes that one and the same system should support evolution: “[...] our concern is not simply to support particular forms of practice, but to support the evolution of practice—the ‘conversation with materials’ [Dourish quoting Schön [14]] out of which emerges new forms of action and meaning.” (p. 25). This seems like a contradictory claim as the evolution of practice is only known in retrospect and in analysis. So how can this be used for claims about design? In a passage about Place and Space, Dourish [1] writes: “…place can’t be designed, only designed for.” (p. 91). If Embodied Interaction is about meta-design, then what are the remaining implications for design, and especially design for context? Our interpretation of Embodied Interaction is that interaction designers should leave context and meaning as open to appropriation as possible. What designers ideally should strive for, then, is completely open systems. In these computer systems each user can interact with the most suitable content and structure. The computer system has, from this particular understanding of interacting with computers, to be able to show every possible structure and the current state and configuration of the system [12]. From an Embodied Interaction perspective on human computer interaction, we can design user interfaces, but not how they should work, as creation of meaning should be left to users in their appropriation of the interfaces. Because we are not allowed to design how an interface should work we also cannot explicitly support skill acquisition. In Embodied Interaction skill acquisition is not an issue because it is does not belong in the picture of skilled and engaged coping, and thereby it falls outside the scope of Embodied Interaction. As a result, acquiring skill becomes something magical, something designers will not need to attend to. The Embodied Interaction approach has an interactional model of context. But if the notion of representation is absent in the description of interaction with a system how can designers design for this interaction? The concept of representations is key for the design of computer systems and especially context-aware computing.
3 The Notion of Context in the Field of Context-Aware Computing In context-aware computing the notion of representations of context is seen as a prerequisite for designing context-aware systems. The assumption is that it is possible to divide the context of a device (or a user) into smaller parts and that some of them are more or less objective and stable. Thereby it is possible to meaningfully represent them in a computer system hosting the device. For example Dey et. al. [15] reasons in terms of identifying and analyzing the constituent elements of context. In identifying and analyzing the constituent elements of context, ubiquitous computing research is bottom-up, starting with sensor data representing aspects of the physical environment [15]. One example is when sensor values as GPS coordinates are used in navigational applications. Starting from sensor values
610
J. Eliasson, T. Cerratto Pargman, and R. Ramberg
then context and meaning is inferred up to the level of human interaction with the device. As described in Dey et. al. [15]: “One hypothesis that a number of ubiquitous computing researchers share is that enabling devices and applications to automatically adapt to changes in their surrounding physical and electronic environments will lead to an enhancement of the user experience.” (p. 100). One last step then is to use the model not only to adapt, but also to try to foresee what is going to take place next and let the application act proactively, guessing what users soon might need to have at hand. In this case questions for system designers are how to adapt to context and how to act proactively in context. Obviously, it is a very hard problem to get all these abstractions, models and inferences right. It can certainly be questioned whether these systems will ever succeed outside very specific domains with very limited scope [8, 16]. 3.1 Challenges for Design of Context-Aware Computing Systems Context-aware computing has been blamed for making only small advances and relying too much on systems engineering to solve problems origination in human interaction [3, 17]. It is also questionable whether we will see a major breakthrough in context-aware computing any time soon as the problems of strong AI and proactive computing are still far from solved [8]. The problem for context-aware computing lies in the representational models that are built in context-aware computing. In a representational model there are inherent questions about what is represented and how it is represented. The next question is how different representations are related. Computational representations use specific values, structures and interrelations. There is no vagueness involved, but every possible value, structure and interrelation have to be decided in advance by the designer. The effect of these decisions is that the behavior of each model of context is also at a basic level determined in advance. Because of this the user model and the system model will diverge as soon as the context-aware system is put in use. The context-aware computing solution to this divergence is either to add an exception to the model every time it diverges or to trust in future AI advancements to solve all discrepancies. In the field of context-aware computing physical and digital representations of context are building blocks of design. As opposed to human and social representations, designed representations are bounded in terms of structure and contents. In computer science, representations are the internal software components that together make up a computer program. These digital software components rely on physical hardware components, which in turn bound the representational power. A computer system is then itself built on representations and therefore cannot be non-representational. But this still allows for non-representational use, with embodied physical or digital representations. This duality between non-representational use and designed and bounded representation is present in every interaction with something that is designed. The representations of context in context-aware computing are seen as objective because of their origin in sensor values. But this concept of objective context should not be interpreted as absolute. Even for instance, GPS coordinates are only valid within their social frame, which in this case is a very wide frame. Chalmers [18], in accordance with Ricoeur and Gadamer, writes: “‘Objectivity’ comes from distanciation: representation is fixed, dissociated from intention and only displays universally
Embodied Interaction or Context-Aware Computing?
611
shared references. […] objectivity is not absolute. Instead, we see degrees and forms of distanciation.” (p. 213). Also on objectivity, Dourish [12] writes: “In contrast to the objective and quantitative nature of positivist theories, phenomenological theories are subjective and qualitative in orientation. By ‘subjective’ I mean that they regard social facts as having no objective reality beyond the ability of individuals and groups […]” (p. 21). The interpretation of this is not that everything is subjective in the sense that everyone has their own interpretation, different from everyone else’s. If this were the case, we would not be able to relate to what others do, we would simply not be able to engage in any interaction without questioning every step of it. Instead we socially create meaning, which we use in interaction. That is: ‘objective’ and ‘subjective’ may not be so far apart. As the extreme of objective representations is never the case and as it is impossible to design for the completely subjective, we need to find a point where we can agree. If “groups” in the previous quote is taken to be the people we design for, then we are essentially agreed, and can meet half way between objective and subjective.
4 Towards an Integrated Approach: Reintroducing the Concept of Representations to Embodied Interaction At some level computer technology is always designed. In fact we both design human computer interaction and we design for human computer interaction. One extreme is the socio-cultural approach. Relying on ethnographical methods, we start out by describing specific users as a basis for design and then design for context. In this view human action is in focus. Action is performed within context at the same time as context is interpreted and recreated. With this focus context is never stable and therefore cannot be deliberately designed. The remaining option for a designer is to support user context formation by relying solely on user appropriation. Human action is subjective and situated, rendering each interaction different from the previous one. In Table 1 this corresponds to “action” as mode of use and because of the subjective nature system designers can only support interaction and design representations for context determined by users. The other extreme is the technology perspective, where we design representations of context and let uses adapt to these representations. Context is modeled using objective and stable representations of sensor values. Users can then interact with this computer model where use is objectifying and reflective. The mode of use as seen in Table 1 is characterized by reflection on representations of context. To combine results stemming from these two approaches is challenging, e.g. [12]: “Translating ideas between different intellectual domains can be both exceptionally valuable and unexpectedly difficult. One reason is that the ideas need to be understood within the intellectual frames that give them meaning, and we need to be sensitive to the problems of translation between these frames.” (p. 20). On fundamental ontological disagreements it is questionable whether it can be done at all. On context in computer supported cooperative work and ubiquitous computing, despite the seemingly contradictory approaches, there have been many attempts to bridge or at least narrow the gap between these two intellectual frames [3, 19]. An alternative to bridging the gap would be to acknowledge that the both sides, computer
612
J. Eliasson, T. Cerratto Pargman, and R. Ramberg Table 1. Mode of use related to artefacts of design
Mode of use Action Reflection
Design artefacts Representations for context Representations of context
representations stemming from sensor data and analytical representations of context are necessary. Instead of searching for one common ground for these views of context we note that they are two sides of the same coin. When learning a new system much time and effort goes into figuring out how the system works instead of engaging in the activity itself. At first when the system has been learnt it can be handled without reflection, with skilled and embodied interaction. But still there are instants when “an event ‘leaps to the eye’ because it is expected or is a deviation from that which one would expect” [20] (p. 294). Also Heidegger noted this (here in the words of Dreyfus [21]): “…mental content arises whenever the situation requires deliberate attention.” (p. 70). These points show us towards an answer in revisiting Heidegger’s original view of hermeneutic phenomenology. His famous example with the hammer does not only serve to show how the hammer is transparent in ready-to-hand use, but also how “breakdown” (when the head falls off and the hammer becomes present-at-hand) leads to acquiring skill (in avoiding this malfunction in the future). As Dreyfus [21] says when clarifying Heidegger “…the occurent is necessary for explaining the functioning of the available…” (p. 121). Here Dreyfus uses the terminology “occurent” instead of present-athand and “available” instead of ready-to-hand. Figure 1 shows how ready-to-hand action and present-to-hand reflection are interrelated. With this integrated view there is no necessity to choose between action and reflection, no necessity to choose between designing representations for context and designing representations of context. Instead the mode of use repeatedly shifts between action and reflection.1 Take GPS positioning for example. Most of the time the coordinates are correct and a user can interact with the navigational program without paying too much attention. The mode of use is here seen as “action”. But there are certainly occasions when the mode of use shifts to reflection; for example when a breakdown in interpretation occurs because of a mismatch between the map position and the position in the real world. Another breakdown could occur when a user moves indoors and gets a message about lack of coverage. In both these examples, interaction is interrupted and the user may need to reflect upon what the problem is, to be able to find a solution (e.g. update GPS-data or move outdoors) before interaction can be reengaged. Objective representations for context are not only to be seen as harmful, constraining user context, but they also form a structure to relate to in a hermeneutic interpretation. Instead of trying to give guidelines for how to design one ultimate design, we need to acknowledge that a design and thereby also the designer is part of this hermeneutic development and that continuous redesigns, done by both designer and user, are necessary for the system to stay relevant to a user. 1
Since both modes of use can be found in the Hermeneutic phenomenology of Heidegger there might be no ontological disagreements in the end.
Embodied Interaction or Context-Aware Computing?
613
Reflection Action
Fig. 1. The two modes of use as interrelated
5 Discussion The Embodied Interaction perspective has both turned away from, and argued against objective representations for context. Although the Embodied Interaction view of context contributes to a better understanding of human interaction with and through computers, at the same time it marginalizes objective representations for context without offering an alternative basis for design. Maybe it even marginalizes design as a whole. It is time to turn the perspective back again to enable both design of context and design for context. The alternative to design systems completely open to appropriation is to use current descriptions of context as a basis for design. If we cannot use current descriptions, but instead need to leave more open for appropriation, then the role of the designer is marginalized accordingly. Computer systems always have room for interpretation and appropriation, but through careful design appropriation and skill acquisition can be guided. Leaving more open to appropriation means constraining the choices that the designer has. A similar trend in design was when the concept of affordance became the one guideline overshadowing all others in HCI. Given the Hermeneutic phenomenology perspective, it poses no problem to reintroduce objective representations of context in the philosophy put forward by the Embodied Interaction approach. Action and reflection are just different modes of use where present-at-hand reflection is an important complement besides embodied ready-to-hand action, and it is not one or the other. Users act in context by (hermeneutically) going back and forth from ready-to-hand embodied interaction to presentat-hand reflection and back again. Our integrated approach undoubtedly have much in common with Winograd and Flores [4], focusing “breakdown” as important, but there are differences. They came
614
J. Eliasson, T. Cerratto Pargman, and R. Ramberg
to the conclusion of modeling computer use through utilizing a state machine representation of speech act theory, with labeled states and directed arcs. Our approach is to use present-at-hand categories, but not to build a general model that enforces some elaborate structure. Instead we only point to the interrelation between present-at-hand and ready-to-hand. This approach can be used either to build general systems with small descriptive powers or specific systems with large descriptive power. But our main contribution is that the present-at-hand categories give us a way of talking about design, while still relating to ready-to-hand Embodied Interaction. It is interesting to note what Dourish [1] write about the states of ready-to-hand and present-at-hand. Dourish explicitly refers to these “states” when discussing coupling using a computer system as example: “If there were simply these two states […] However the truth is more complex. As we have seen, the tools through which we operate when interacting with a computer system are not simply physical objects, but software abstractions, too. There are very many of these abstract entities in operation at any given moment, and programs link them together in a variety of ways.” (p. 139). This surely gives the impression of great complexity. Dourish ends this passage in the following: “The consequence, then, is that there are very many different levels of description that could be used to describe my activity at any given moment. Some, perhaps, are ready-to-hand and some present-at-hand at the same time […]” (p. 140). But that some entities are ready-to-hand while others are present-at-hand is nothing new. On a conceptual level even when Heideggers’ hammer was ready-to-hand some other part of the activity was present-at-hand. Computer systems does not change this. If we design these systems with using present-at-hand categories deliberately, we might even bring Embodied Interaction one step forward.
References 1. Dourish, P.: Where the action is: the foundations of embodied interaction. MIT Press, Cambridge (2001) 2. Dourish, P.: Seeking a foundation for context-aware computing. Hum. Comput. Interact. 16, 229–241 (2001) 3. Barkhuus, L.: The context gap, an essential challenge to context-aware computing, vol. Diss. IT University of Copenhagen, Copenhagen (2005) 4. Winograd, T., Flores, F.: Understanding computers and cognition: a new foundation for design. Ablex, Norwood (1986) 5. Weiser, M.: The Computer for the 21st Century. Scientific American 265, 95, 98–102, 104 (1991) 6. Chalmers, M.: A historical view of context. Computer Supported Cooperative Work: CSCW: An International Journal 13, 223–247 (2004) 7. Räsänen, M., Nyce, J.M.: A new role for anthropology?: Rewriting “context” and “analysis” in HCI research. In: ACM International Conference Proceeding Series, vol. 189, pp. 175–184 (2006) 8. Rogers, Y.: Moving on from Weiser’s Vision of Calm Computing: Engaging UbiComp Experiences. In: Dourish, P., Friday, A. (eds.) UbiComp 2006. LNCS, vol. 4206, pp. 404– 421. Springer, Heidelberg (2006) 9. Merleau-Ponty, M.: Phenomenology of perception. Routledge, London (2002/1962)
Embodied Interaction or Context-Aware Computing?
615
10. Schutz, A., Luckmann, T.: The structures of the life-world. Northwestern U.P., Evanston (1973) 11. Heidegger, M.: Being and time. Harper, New York (1962) 12. Dourish, P.: What we talk about when we talk about context. Personal Ubiquitous Comput. 8, 19–30 (2004) 13. Suchman, L.A.: Plans and situated actions: the problem of human-machine communication. Cambridge Univ. Press, Cambridge (1987) 14. Schön, D.: The reflective practitioner: how professionals think in action. Basic Books (1983) 15. Dey, A.K., Abowd, G.D., Salber, D.: A conceptual framework and a toolkit for supporting the rapid prototyping of context-aware applications. Human-Computer Interaction 16, 97– 166 (2001) 16. Dreyfus, H.L.: What computers still can’t do: a critique of artificial reason. MIT Press, Cambridge (1992) 17. Håkansson, M.: Playing with context: explicit and implicit interaction in mobile media applications, Vol. Diss. Department of Computer and Systems Sciences (together with KTH), Stockholm University, Kista (2009) 18. Chalmers, M.: Hermeneutics, information and representation. European Journal of Information Systems 13, 210–220 (2004) 19. Chen, Y., Atwood, M.E.: Context-centered design: Bridging the gap between understanding and designing. In: Jacko, J.A. (ed.) HCI 2007. LNCS, vol. 4550, pp. 40–48. Springer, Heidelberg (2007) 20. Schmidt, K.: The problem with ’awareness’: Introductory remarks on ’awareness in CSCW’. Computer Supported Cooperative Work: CSCW: An International Journal 11, 285–298 (2002) 21. Dreyfus, H.L.: Being-in-the-world: a commentary on Heidegger’s Being and time, division I. MIT Press, Cambridge (1991)
Supporting Multidisciplinary Teams and Early Design Stages Using Storyboards Mieke Haesen, Jan Meskens, Kris Luyten, and Karin Coninx Hasselt University – tUL – IBBT, Expertise Centre for Digital Media, Wetenschapspark 2, B-3590 Diepenbeek, Belgium {mieke.haesen,jan.meskens,kris.luyten,karin.coninx}@uhasselt.be
Abstract. Current tools for multidisciplinary teams in user-centered software engineering (UCSE) provide little support for the different approaches of the various disciplines in the project team. Although multidisciplinary teams are getting more and more involved in UCSE projects, an efficient approach to communicate clearly and to pass results of a user needs analysis to other team members without loss of information is still missing. Based on previous experiences, we propose storyboards as a key component in such tools. Storyboards contain sketched information of users, activities, devices and the context of a future application. The comprehensible and intuitive notation and accompanying tool support presented in this paper will enhance communication and efficiency within the multidisciplinary team during UCSE projects.
1 Introduction When combining HCI techniques and software engineering principles in usercentered software engineering (UCSE), the biggest challenge is the communication within a multidisciplinary team including the end users. MuiCSer, a framework for Multidisciplinary user-centered Software Engineering processes, focuses on the benefits of both disciplines, and was introduced to investigate the features and shortcomings of current UCSE models and tools [1]. One missing link in most user centered processes is a tool to progress from informal design artefacts (e.g. scenario) toward more structured design artefacts (e.g. task model). Most tools and techniques require specific knowledge about specialized notations or models, thus exclude most team members to be involved. Furthermore, functional information may be missing in informal design artefacts while structured design artefacts may not always contain all non-functional information. We propose the usage of storyboards as a comprehensible artefact related to features of graphical user interface design tools to overcome these shortcomings. In summary, the main contributions in this paper are: • •
a novel user-centered design approach that uses storyboards as a common language in a multidisciplinary team; tool support for creating and editing storyboards in order to bridge the gap between the early stages of the UCSE process and the user interface
Supporting Multidisciplinary Teams and Early Design Stages Using Storyboards
617
design. This tool supports the connection between storyboards and artefacts created later in the process.
2 Related Work User-centered processes recommend combining non-functional as well as functional requirements by involving a multidisciplinary team [2]. The early design stages of usercentered design (UCD) include a user needs analysis and generally result in several artefacts such as usability requirements [3], scenarios [4] and personas [5] describing the user needs. These artefacts are written in a narrative style and are usually created by interaction designers. Similar artefacts are used in software engineering and agile development [6] (e.g. essential use cases, scenarios, story cards, user stories). Although several disciplines provide notations to describe user needs, the notations are not always comprehensible for all members of a multidisciplinary team. Lindgaard et al. [7] address the difficulties in presenting user needs for requirements engineering. Earlier studies describe the needs of interaction designers in a multidisciplinary team. Brown et al. [8] conducted an ethnographic study to investigate the collaboration between user interaction designers and developers. The study describes the benefits of stories and sketches in the early stages of user-centered approaches and emphasizes the power of combining both. Assembling stories and sketches is a powerful technique to reveal errors, and to consider temporal and contextual information. Arecent study of Myers et al. [9] reports that designers are experiencing difficulties when designing the behavior of user interfaces. While prototyping the appearance of user interfaces is straightforward, designing and communicating the behavior is an ongoing process. Furthermore, the survey revealed designers frequently use sketches and storyboards. Currently little tool support is available for storyboarding in multidisciplinary teams. Demais [10] and IBM Rational Requirements Composer1 focus on storyboards in the design process of multimedia applications while Denim [11] and Highlight [12] feature storyboards for web applications. All these tools are developed to describe the behavior of software or web applications and support a first walkthrough of the future system or website. The storyboards created using these tools contain mock-ups of UI designs and their relationships, and thus are designed after the requirements gathering of a future application. The ActivityDesigner [13] tool allows storyboarding at the early stages of design. In this tool, designers can extract activities from concrete scenarios making it possible to include rich contextual information about everyday lives as scenes. Based on the scenes, higher level structures and prototypes can be created. The tool we present in this paper also provides the possibility to build storyboards during the gathering of requirements in order to facilitate the creation of artefacts at later stages.
1
http://www.ibm.com/developerworks/rational/library/08/1118_zhuo, last visited 6 January 2009.
618
M. Haesen et al.
3 Storyboards In a UCSE process, a report of a user needs analysis, scenarios and personas are presented to the entire team after conducting the first user studies. Structuring artefacts that are written in a narrative style is a complex though important process. All artefacts created at later stages need to be consistent with these first results. Unfortunately little tool support is available for the first transition in UCSE processes. This implies that the entire team needs to verify consistency among the informal results of a user needs analysis and artefacts created later in the process. A good understanding within the multidisciplinary team at this point can be crucial for the resulting user experience. We investigate how storyboards can be used in UCSE processes by a multidisciplinary team. 3.1 Users, Activities, Devices, Context The professional use of storyboards originates from the film-industry and is getting introduced in several disciplines such as advertisement and product design [14]. In UCSE a storyboard can have several meanings. Storyboards can depict manual steps, users interacting with a product, screen mockups of a new work practice or the link with the system behind-the-scenes [6]. The focus on visual information renders it highly comprehensible for any member of the team, independent of their background or role in the team [14] [15]. In the context of our research, we want to define storyboards as sketches of real life situations, depicting users carrying out several activities by using devices in a certain context. An example of a simple storyboard is presented in the center of Fig. 1. Since storyboards contain a lot of information about the future use of an application, they can be used to provide a link between a user needs analysis and requirements gathering, containing functional as well as non-functional requirements. Furthermore, the natural style of presenting the use of a future system implies this artefact is very comprehensible for all team members including end users. Since scenes of a storyboard contain contextual information, they are suitable for the specification of context-aware applications. This contextual information has to be taken into account during the entire development process, thus storyboards can contribute to the evaluation, verification and validation of several stages. 3.2 Bridging the Early Stages of UCSE Processes The creation of storyboards happens at the early stages of a UCSE process, after the creation of scenarios and personas. An example storyboard and the interrelationship between a storyboard and other artefacts are presented in Fig. 1. A storyboard is built by splitting up the scenario into scenes and presenting the scenes as sketches depicting users interacting with the future system. Connecting scenes of a storyboard, structures the narrative information of the scenario. The understandability of storyboards increases the amount of team members that can collaborate during this phase. Even end-users can be involved to create or evaluate storyboards.
Supporting Multidisciplinary Teams and Early Design Stages Using Storyboards
619
Fig. 1. A storyboard and its interrelationship with other artefacts in the development process. Situations and devices in the scenes are extracted from scenarios, while the user information is extracted from the scenarios as well as the personas. The storyboard is used as input for the creation of task flow diagrams and the UI designs.
Once all scenes are added to the storyboard, personas and devices can be highlighted in each individual scene. This enriches the information contained by the storyboard and can be used to make the transition to other artefacts. Task flow diagrams, presenting user actions and processes to complete a task, can be produced based on the information in and the connections between the scenes of the storyboard. At a later stage of the development process, the storyboard can guide the UI design and development. By carefully considering the situation of each scene, designers and developers build an application corresponding to context, requirements and constraints contained by the storyboard. Interaction designers can use a storyboard to verify that the UI designs take into account all requirements. A storyboard also contributes to the preparation of the usability tests. Using storyboards in UCSE processes increases the visibility of the project. New team members for instance, can explore the requirements of the project at a glance by looking at the storyboard.
4 Tool Support for Storyboards As stated above, storyboards contribute to the development process of software applications in multidisciplinary teams. When suitable tool support is available for all team
620
M. Haesen et al.
members, storyboards become more powerful and the visibility and traceability of a project increase. A literature survey [1] showed there is a need for tools that support UCSE processes in the early stages of design. Since storyboards are created during the requirements gathering, storyboarding tools can partly cover transitions between the early stages of UCSE. Furthermore, storyboards are a very suitable artefact to specify the use of context-aware applications, thus we decided to integrate the tool support for storyboards in the Gummy [16] GUI builder tool. Gummy supports graphical design of multi-device and context-aware user interfaces. To enable this, Gummy automatically adapts its workspace according to the considered target platform and thus allows designers to create user interfaces for a wide range of devices without having to change their work practices. The inclusion of storyboards during this design stage better describes the context of a user interface and provides a more convenient way to describe the intended contextof-use. This way, the storyboards provide guidelines for the design of the UI. In the storyboarding extension of Gummy, a team member, e.g. an interaction designer starts the creation of the storyboard by loading a scenario into the workspace. Following, a sequence in the scenario can be selected and consequently, a new scene can be created. The sequence of the scenario is automatically added to the scene as a description, while the interaction designer can load an image and add a title. The image of the scene can be a photo of the user observations or a scanned sketch, which encourages designers to sketch in a creative and informal way [11]. A screenshot of the storyboarding tool is shown in Fig. 2.
Fig. 2. A screenshot of the storyboarding extension in Gummy. Scenarios can be loaded on the left panel. For a selected sequence in the scenario a new scene can be created. A scene can contain sketches of users interacting with the future system, a title and a description.
Supporting Multidisciplinary Teams and Early Design Stages Using Storyboards
621
For each scene in a storyboard, team members can add annotations and point out the personas and devices. When the specifications of a device (e.g. screen size) are included in the scenes, this information can be considered when the workspace for the UI design is loaded. The contextual information of the scenes (e.g. sketches presenting the environment or courses of communication) can be used as guidelines for the UI design without obstructing the creativity of UI designers. By extending existing tool support, the visibility and traceability of UCSE processes can be enhanced. The storyboard extension makes it possible to include the results of the first UCSE stages (user needs analysis) and helps to process and structure these narrative artefacts. Furthermore, a visualization of a scenario by scenes makes it possible to see the usability requirements at a glance, which improves the communication and efficiency in the project team.
5 Ongoing and Future Work Storyboards are implicitly used in different ways by multidisciplinary teams. This partly explains the many interpretations of storyboards and reveals the challenges in developing a storyboarding tool for multidisciplinary teams. In ongoing work we are carrying out a survey considering the roles in a multidisciplinary team and the tools used by members of a team. Observations and interviews are organized to investigate current practices of multidisciplinary teams in industry. Furthermore, storyboards as defined in this paper are introduced in a multidisciplinary project team and the storyboarding tool will be evaluated during several iterations. Based on the findings of these studies, we will fine-tune the features of storyboards and the relationships between storyboards and other artefacts. As the current version of the storyboarding tool is intended for individual use, the user studies may also reveal some expectations of teams regarding a distributed and a collaborative version of the tool. Furthermore, contextual information and platform specifications can be extracted from the scenes in a storyboard to guide design of UIs in the Gummy GUIbuilder.
6 Conclusion In this paper we described how storyboards can contribute to UCSE. By sketching users interacting with a future application, pointing out devices and adding annotations in the early stages of a UCSE project, these storyboards contain functional and nonfunctional requirements. Storyboards can contain rich contextual information and are based on an intuitive notation providing more structure than narrative scenarios of use. We integrated tool support for the creation and the use of storyboards in the Gummy multi-device GUI builder. Ongoing and future studies are being carried out to examine the approach of multidisciplinary teams in industry and to adapt the storyboarding tool according to current practices.
622
M. Haesen et al.
This new level of tool support can simplify the creation of artefacts at later stages of a development process and improves the communication within a multidisciplinary team. The comprehensibility of storyboards allows non-technical team members to be involved in the first activities of model-based UI development. Consequently, the loss of information after a user needs analysis will decrease while the visibility and traceability of a project increase. Storyboards are a common language in multidisciplinary teams, which contributes to the user experience of the final user interface. Acknowledgements. Part of the research at EDM is funded by EFRO (European Fund for Regional Development) and the Flemish Government. The MuiCSer process framework and the Gummy tool, including the storyboarding extension, are based on our experiences in the IWT project AMASS++ (IWT 060051).
References 1. Haesen, M., Coninx, K., Van den Bergh, J., Luyten, K.: MuiCSer: A Process Framework for Multi-Disciplinary User-Centered Software Engineering processes. In: Proceedings of Human-Centred Software Engineering, September 2008, pp. 150–165 (2008) 2. International Standards Organization. ISO 13407. Human Centred Design Process for Interactive Systems. Geneva, Swiss (1999) 3. Redmond-Pyle, D., Moore, A.: Graphical User Interface Design and Evaluation. Prentice Hall, London (1995) 4. Carroll, J.M.: Making use: scenario-based design of human-computer interactions. MIT Press, Cambridge (2000) 5. Pruitt, J., Adlin, T.: The Persona Lifecycle: Keeping People in Mind Throughout Product Design. Morgan Kaufmann, San Francisco (2006) 6. Holtzblatt, K., Wendell, J.B., Wood, S.: Rapid Contextual Design: A How-to Guide to Key Techniques for User-Centered Design (Interactive Technologies). Morgan Kaufmann, San Francisco (December 2004) 7. Lindgaard, G., Dillon, R., Trbovich, P., White, R., Fernandes, G., Lundahl, S., Pinnamaneni, A.: User needs analysis and requirements engineering: Theory and practice. Interact. Comput. 18(1), 47–70 (2006) 8. Brown, J., Lindgaard, G., Biddle, R.: Stories, Sketches, and Lists: Developers and Interaction Designers Interacting Through Artefacts. In: Proceedings of Agile 2008, pp. 39–50 (2008) 9. Myers, B.A., Park, S.Y., Nakano, Y., Mueller, G., Ko, A.: How designers design and program interactive behaviors. In VL/HCC, pp. 177–184 (2008) 10. Bailey, B.P., Konstan, J.A., Carlis, J.V.: Demais: designing multimedia applications with interactive storyboards. In: ACM Multimedia, pp. 241–250 (2001) 11. Newman, M.W., James, A.L.: Sitemaps, storyboards, and specifications: A sketch of web site design practice. In: DIS 2000 Designing Interactive Systems, pp. 263–274. ACM Press, New York (2000) 12. Nichols, J., Lau, T.: Mobilization by demonstration: using traces to re-author existing web sites. In: IUI 2008: Proceedings of the 13th international conference on Intelligent user interfaces, pp. 149–158. ACM Press, New York (2008)
Supporting Multidisciplinary Teams and Early Design Stages Using Storyboards
623
13. Li, Y., Landay, J.A.: Activity-based prototyping of ubicomp applications for long-lived, everyday human activities. In: Proceedings of the Conference on Human Factors in Computing Systems, CHI 2008, pp. 1303–1312 (2008) 14. van der Lelie, C.: The value of storyboards in the product design process. Personal Ubiquitous Computing 10(2-3), 159–162 (2006) 15. Sova, R., Sova, D.H.: Storyboards: a dynamic storytelling tool. Technical report, Sova Consulting Group, Tec-Ed Inc. (2006) 16. Meskens, J., Vermeulen, J., Luyten, K., Coninx, K.: Gummy for multi-platform user interface designs: Shape me, multiply me, fix me, use me. In: Proceedings of the working conference on Advanced Visual Interfaces, AVI 2008, pp. 233–240. ACM Press, New York (2008)
Agent-Based Architecture for Interactive System Design: Current Approaches, Perspectives and Evaluation Christophe Kolski1, Peter Forbrig2, Bertrand David3, Patrick Girard4, Chi Dung Tran1, and Houcine Ezzedine1 1
LAMIH – UMR8530, University of Valenciennes and Hainaut-Cambrésis, Le Mont-Houy, F-59313 Valenciennes Cedex 9, France firstname.name@univ-valenciennes.fr 2 University of Rostock, Computer Science Department, Albert-Einstein-Str 21, D-18051 Rostock, Germany Peter.Forbrig@informatik.uni-rostock.de 3 LIESP, Ecole Centrale de Lyon, 36 avenue Guy de Collongue, F-69134 Ecully Cedex, France Bertrand.David@ec-lyon.fr 4 ENSMA / LISI, Teleport 2, 1 Avenue Clément Ader, B.P. 40109, F-86961 Futuroscope Chasseneuil Cedex, France girard@ensma.fr
Abstract. This paper proposes a survey concerning agent-based architectures of interactive systems. This survey is focused on certain models and perspectives. Indeed, general agent-based architectures are first presented. Then agent-based approaches dedicated to CSCW systems are reviewed. The appearance of web services requires new agent-based approaches; basic ideas are introduced. Agent-based interactive systems necessitate new tools for their evaluation; an example of representative evaluation tool is presented. Keywords: Human-computer interaction, Architecture model, agent-based systems, CSCW, design, evaluation.
Agent-Based Architecture for Interactive System Design
625
Fig. 1. Global overview of available architecture models
2 From Seeheim Model to Agent-Based Architectures Two main approaches of architecture models were first elaborated: global models, and agent-based models. Global models define a precise structure based on a fixed number of components, whose role and nature are precisely defined. The well-known Seeheim model is the first of them [1]. It recommends developing user interfaces as a separate module, connected to a functional core on which it must lean. The interface itself is organized in three parts: the Presentation (devoted to the management of inputs and outputs), the Controller (defined as a component that manages the sequence of interaction elements) and the Application Interface (which allows the translation between the interactive “world” and the functional core). The main interest of the Seeheim model is to give original definitions that establish good foundations for all works on architecture and tools in HCI. For example, the Arch model [2] proposes some modifications of the Seeheim model (including the functional core into the model, defining an additional component, defining the notion of a “slinky model”), but keeps the main definitions, namely for dialogue control. Nevertheless, global models bring forward some drawbacks, mainly when trying to apply object-oriented approach. While current object-oriented interactive application may involve hundreds of cases, the global structure gives no help on defining elementary interaction classes. MVC (Model-View-Controller) [3], and then agent-based architecture models, such as PAC (Presentation-Abstraction-Control [4], AMF (multi-Agent-Multi-Facets [5]) and AoMVC (Agent-oriented MVC [6]), were designed to solve this problem. They define elementary software bricks composed of some parts (fixed number or not), and define the relations that must exist between bricks and parts. Some of them have been defined as design patterns. So doing, global functions such as Dialogue Control or Presentation are split in each elementary agent, what helps to support iterative design. Some tools to help define applications with these models have been designed, see for instance [7]. However, as global models, agent-based architecture models suffer from problems. Choosing the right level of decomposition is hard for non-experienced developers. More, ensuring strictly the rules of the model (for example, a PAC object only knows its father and its sons) may be difficult when implementation considerations are to be taken into account. Hybrid models, which are supposed to benefit the most from the two approaches, emerged. Mainly, these models lean on a global definition of the architecture based on
626
C. Kolski et al.
the Arch model, and use an object-oriented approach to refine some of the main components, such as the Presentation or the Controller. For example, PAC-Amodeus [8] facilitates the design of multimodal applications. Another example is H4, a model that was defined firstly for the Computer Aided Design area; tools were created for various applications, to help the design of applications [9, 10], to help their validation [11], or both [12]. Other related research proposes architecture models concerning distributed and plastic UI [13, 14].
3 Agent-Based Architectures: Approaches Dedicated to CSCW Systems CSCW systems are not only interactive systems, but also and mainly multi-user distributed systems. For these reasons their architecture must answer new requirements. Three important characteristics are: (1) taxonomy of collaborations, which can be either related to the crossing in a matrix location (local or distant), and temporal view (synchronous or asynchronous), as suggested by [15], or related to the nature of cooperation (asynchronous cooperation, in session cooperation, in meeting cooperation and close cooperation [16]); (2) awareness is the information about activities done by other actors, needed in synchronous cooperation, which can be actor oriented (their effective participation) or production oriented expressed by WYSIWIS (What You See Is What I See) acronym with a strict or relaxed view of working data; (3) nature of cooperation activities can be examined, as initially proposed by [17] in relation to the support of three main kinds of activities, i.e. production, conversation/ communication and coordination between participants. From an architectural point of view, CSCW systems are clearly inspired by interactive systems architectures, i.e. layered, agent and hybrid architecture are also used for CSCW systems. We can mention Zipper [18] and Dewan [19] models for layered collaborative systems, based mainly on ARCH model adaptation to multi-user distributed situations. ALV and AMF-C [20] are the representatives of agent-based systems. They generalize PAC agent model for collaborative distributed situations. CoPAC, PAC* and Clover (all described in [21]) are typical examples for hybrid systems. In this last case, they reuse ARCH model and adapt it to multi-user and distributed situations. All these architecture models take into account synchronous collaboration allowing real time interaction between cooperating actors. Distant and local interactions are treated in the same way, as only mediated interactions are taken into account, i.e. direct local nonmediated interaction is not supported. Asynchronous collaboration is not addressed mainly because in this case multi-user interaction, awareness and cooperative operations are not done by interaction. Awareness of shared artifacts (data) and participating actors is more or less supported as well as strict and relaxed WYSIWIS. Concerning cooperation activities (production, conversation and coordination), these are either fundamental elements (for PAC* and Clover) or naturally integrated (AMF-C). Hybrid architectures are either agent-based only in Control part of the model (CoPAC and PAC*) or agent orientation can be used also in other parts of the model. Recent evolution of cooperative systems is related to the mobility of the actors, evolving in augmented real environment with pervasive behavior of the environment and related context-aware computing. The concept of nomadism (networking, handheld
Agent-Based Architecture for Interactive System Design
627
devices, mobile communicating objects technology, localization and permanent or non permanent connectivity) extends the CSCW and allows us to introduce the concept of "capillary" CSCW [16]. We use this term by analogy with the network of blood vessels. As its name implies, the purpose of the capillary CSCW is "to extend the capacities provided by co-operative working tools in increasingly finer ramifications, from their use on fixed proprietary workstations to small handheld devices". Main characteristics are: management of collaboration and coordination of the mobile actors, coherence and validity of the information exchanged between handheld devices which are connected only intermittently to the network and the "group" with the aim of having the most synchronized possible information, heterogeneity of the communication protocols of the handheld devices and constraints of interface and overall capacity of the handheld devices in terms of size of screen, speed transmission, memory, autonomy, as well as the interaction devices. In recent evolution of the AMF-C model, its transformation from a fully agent-based system to hybrid system, integration of IRVO perception of new paradigm of interaction (interaction with real and virtual objects) allows it to fully address problems with capillary cooperative systems. In this new mobility context adaptation to different interaction devices, environmental situations, software and hardware platforms and user preferences becomes the core problem. Adaptation techniques can be classified in four different categories ranging from easiest to implement to most powerful: Translation techniques; Markup language-based approaches; Reverse and re-engineering techniques; Model-based approaches. Designing and implementing interactive collaborative applications that are adaptable (manually) or adaptive (automatically) to the context of use requires consideration of the characteristics of the user, the interactive platform as well as the constraints and capabilities of each environment. A state of the art survey shows us that among the large majority of existing approaches for adaptation, the model-based approach seems to be the most powerful. Such approach uses high level and abstract representations that can be instantiated latter on in the development lifecycle to meet specific usability requirements. However, these approaches need to combine apparently independent models such as concepts (e.g. UML), task (e.g. CTT), platform (e.g. CC/PP) or user profiles. The relationships between these models need to be defined at the design step and refined at run-time in order to be able to achieve the overall usability. Our belief is that, what we refer to as an interaction model is the right place to glue together all the models and usability attributes. This model must support both design stage linking other models and run-time. In addition, because Software Engineering and HCI have shown the importance of clearly separate functional core from presentation components, our interaction model is supported by a well structured architecture. In this new version of the AMF-C architectural model [22], we maintain the basic characteristic of the model, i.e. the Multi-faceted approach allows the creation of new facets, to clarify the behavior and allow automation of implementation process; a graphical formalism that expresses the control structure of multi-user interactions and adaptation in real time of awareness characteristics; and a run-time model that allows dynamic control of interactions. We add IRVO interaction formalism allowing the expression of new augmented reality interactions and we structure the system with hybrid approach, allowing to mix XML specifications, engine based interpretation and connection to real components of functional core or managing new interaction devices (Fig. 2).
628
C. Kolski et al.
Fig. 2. Relations between Arch model (dashed lines), AMF-C and IRVO models
4 Web Services and Agent-Based Architectures Web services lead to new possibilities and problems concerning distributed system design. Fig. 3 suggests a complex industrial organization exploiting web services.
Fig. 3. Example of different actors communicating directly or not via web services [23]
The traditional web services provide functionalities based on classical client/server architecture, but agent-based architectures offer new perspectives in this field. They utilize autonomous and proactive behaviors of agents. Interesting new approaches appear in the literature. For instance, a technical framework for AWS (agent-based web services) is described in [24]; it supports the idea of capturing, modeling and implementing service functionalities with autonomous and dynamic interactions. Technically agent-oriented software construction, knowledge representation and interaction mechanisms are integrated. Fig. 4 gives an impression of the framework. DAML-S (DARPA agent markup language for services) is a semantic markup language for describing web services and related ontologies. It has been superseded by OWL-S [25]. A discussion of dynamic web-service invocation by agents can be found in [26]. Their infrastructure is a hybrid peer-to-peer model. Agents are used to specify service providers and service customers. For this purpose JADE [27] (Java Agent Development Environment) is used; it is a framework developed as open source project. A
Agent-Based Architecture for Interactive System Design
629
web service can be published as a JADE agent service and agents services can be published as web service endpoints (see also [24]). Such propositions have to be considered with attention regarding agent-based architecture perspectives concerning service-oriented interactive systems.
Business Application Environment (Business-oriented protocols) (e.g. contract net and e-auction for e-Marketplace) Web Service Operation Protocols (e.g. WSDL, BPEL4WS and WS Security)
AWS: #m
Business Operator
Service Protocols
(e.g. DAML-S)
Communicatio
Interaction
ProblemSolving
HTTP
Communicatio
e
Interaction
(e.g. DAML-S) led g
SOAP
SOAP
Kn ow
ow Kn
ge led
ProblemSolving AWS: #m
Representation
Application Entities
Communication
Transportation
Fig. 4. Integrated technical framework for agent-based web services [24]
5 Agent-Based Architectures: The Evaluation Problem The evaluation of interactive systems aims at ensuring that users are capable of realizing their tasks. The evaluation methods and tools are numerous and of different types; they are generally based on two global criteria: utility and/or usability [28]. When the interactive system uses an agent-based architecture, new methodological and conceptual questions appear. For instance: how to evaluate such systems? Is it necessary to combine several evaluation methods? Is it possible to be assisted by automated or semi-automated evaluation tools? How to connect such tools to the agent-based systems? How to link the agents’ behaviors with the analyzed situations? There are several further questions. We are particularly interested in automated or semi-automated tools. An electronic informer (EI) is a software tool that captures automatically interactions between the user and the UI in real situations in a discreet and transparent way, so that the user does not feel hampered by the tool. The captured data are objective and can be scientifically analyzed by the evaluators. For a review about EI, we refer to [29]. Several tools are available, but very few of them take into account the specificities of agent-based interactive systems in their evaluation approaches [30, 31, 32, 33]. The architecture of a tool dedicated to such systems is showed in Fig. 5. This kind of EI aims at capturing not only interactions between user and interface agents in terms of occurred UI events like other EIs, but also interactions between agents themselves in terms of interactions between services. It aims also to go further than other EIs to assist evaluators in interpreting analysis results of captured data in order to evaluate three aspects of an agent-based interactive system: user interface (UI), some non-functional properties (such as response time, reliability, complexity, etc.), and
630
C. Kolski et al.
Fig. 5. Example of a tool for evaluating agent-based interactive systems [33]
properties of users to operate systems (ability, habits, preferences, progress of a certain user, etc.). Seven independent modules compose this tool. The module 1 is responsible of capturing events that occur from all agents of the system and then, it saves them into a database that will be analyzed by other modules. The connection between this EI and the evaluated agent-based system is based on the association of each type of agents (interface agents, controller agents, application agents) with a corresponding informer. The evaluation can be remotely realized. This module 1 and the evaluated system can run on the same machine, or on two different ones on the network. After capturing data, this EI enables the evaluator to determine tasks that user has realized (module 2). Some synthetic calculations and statistics can be realized on captured data such as the number and frequency of occurred events, average response time of service interactions, time taken to realize a task, number of successful or failed tasks, etc., of any chosen agent or all the agents in any chosen period of time. These analysis results will be showed to the evaluator using tables or graphs (module 3). The tool also enables the generation of Petri Nets (PNs) and the evaluator can
Agent-Based Architecture for Interactive System Design
631
compare PNs (module 4 and 5). A generated PN describes user’s actions in terms of UI events (that have ever occurred on interface agents) and system’s actions in terms of executed services of agents in order to realize a certain task. Generated PNs are called observed PNs or real PNs. The evaluator can compare real PNs to realize a certain task of a certain user with theoretical PNs predicted by the designers for the same task or he/she can do the comparison between real PNs to realize the same task of different users. Exploiting formal aspect of the PNs, such comparisons are very useful for evaluators to detect problems of the interface, the system or the users such as: bad or useless actions of users, non-optimal way chosen by users to realize tasks, failed service interactions, properties of users (habits, evaluation and comparison of abilities of different users, supervision of the progress of abilities of a certain user, etc.). The analysis results of the module 3, the generation and comparison of PNs, all these results can be interpreted with the indications of module 6 (that enables the association with an open list of determined criteria) to help evaluators critique the system and propose useful suggestions to the designers for improvements. This tool is representative of a new generation of tools dedicated to agent-based interactive systems. A lot of research is still necessary in this domain (adaptation to different application fields and architecture models, helps in real time…).
6 Conclusion and Perspectives Since the eighties, many models and approaches are proposed in the literature concerning so-called distributed or agent-based architectures of interactive systems. By lack of place, it was just possible to propose a brief overview of this domain, about (1) general agent-based architecture models, (2) models dedicated to CSCW systems, (3) interactive systems based on web services, (4) evaluation of interactive systems using agent-based architecture. Many research and development perspectives can be now envisaged. Currently, general agent-based architecture models are mainly used at the conceptual level. They allow good design of application, minimizing dependencies and improving maintainability of applications. They need now to be more largely used at the implementation level. Their inclusion into integrated development environments, such as Eclipse for example, might be the next step to allow tools to be developed. Help for software design, simulation, and evaluation are the main topics that are to be addressed. Capillary cooperative systems need important context adaptation. These mechanisms are more easily elaborated in hybrid architectures using agents in several layers. The benefits of autonomous behavior and independence of agent-based systems constitutes an important advantage. Many researches concern currently context-aware interactive systems; different types or generations of adapted agent-based architecture models have to be progressively proposed. Agent-based systems might help to dynamically compose web services. In this way they can support dynamic adaptation of workflow systems. Many research problems have also to be studied and solved regarding the evaluation of agent-based interactive systems.
632
C. Kolski et al.
Acknowledgements. The present research work has been supported by CISIT, the Nord-Pas-de-Calais Region, the European Community (FEDER). The authors gratefully acknowledge the support of these institutions.
References 1. Pfaff, G.E.: User interface management system. Springer, Heidelberg (1985) 2. Bass, L., Little, R., Pellegrino, R., Reed, S.: The Arch Model: Seeheim revisited. In: Proceedings of User Interface Developers Workshop, Seeheim (1991) 3. Goldberg, A.: Smalltalk-80, the interactive programming environment. Addison-Wesley, Reading (1983) 4. Coutaz, J.: PAC, an Object-Oriented Model for Dialog Design. In: Bullinger, H.-J., Shackel, B. (eds.) Proc. Interact 1987, 2nd IFIP International Conference on HumanComputer Interaction, Stuttgart, Germany, September 1-4, 1987, pp. 431–436 (1987) 5. Ouadou, K.: AMF: Un modèle d’architecture multi-agents multi-facettes pour Interfaces Homme-Machine et les outils associés (in French), PhD Thesis, ECL, Lyon (1994) 6. Goschnick, S., Sterling, L.: Shadowboard: an Agent-oriented Model-View-Controller (AoMVC) architecture for a digital self. In: Proc. Int. Workshop on Agent Technologies over Internet Applications (ATIA 2001), Tamkang University, Taipei, Taiwan (2001) 7. Jambon, F.: From Formal Specifications to Secure Implementations. In: Kolski, C., Vanderdonckt, J. (eds.) Computer-Aided Design of User Interfaces (CADUI 2002), pp. 43–54. Kluwer Academics, Dordrecht (2002) 8. Nigay, L.: Conception et modélisation logicielles des systèmes interactifs: application aux interfaces multimodales (in French), PhD Thesis, Joseph Fourier Univ., Grenoble (1994) 9. Texier, G., Guittet, L., Girard, P.: The Dialog Toolset: a new way to create the dialog component. In: Stephanidis, C. (ed.) Universal Access in HCI, pp. 200–204. Lawrence Erlbaum Associates, Mahwah (2001) 10. Depaulis, F., Maiano, S., Texier, G.: DTS-Edit: an Interactive Development Environment for Structured Dialog Applications. In: Kolski, C., Vanderdonckt, J. (eds.) ComputerAided Design of User Interfaces (CADUI 2002), pp. 75–82. Kluwer Academics, Dordrecht (2002) 11. Francis, J., Girard, P., Boisdron, Y.: Dialogue Validation from Task Analysis. In: Duke, D.J., Puerta, A. (eds.) Eurographics Workshop on Design, Specification, and Verification of Interactive Systems (DSV-IS 1999), Braga, Portugal, pp. 205–224. Springer, Heidelberg (1999) 12. Baron, M., Girard, P.: SUIDT: Safe User Interface Design Tool. In: International Conference on Intelligent User Interfaces Computer-Aided Design of User Interfaces (IUICADUI 2004), Madeira, Portugal, pp. 350–351. ACM Press, New York (2004) 13. Balme, L., Demeure, A., Barralon, N., Coutaz, J., Calvary, G.: CAMELEON-RT: A Software Architecture Reference Model for Distributed, Migratable, and Plastic User Interfaces. In: Markopoulos, P., Eggen, B., Aarts, E., Crowley, J.L. (eds.) EUSAI 2004. LNCS, vol. 3295, pp. 291–302. Springer, Heidelberg (2004) 14. Calvary, G., Daassi, O., Coutaz, J., Demeure, A.: Des widgets aux comets pour la plasticité des systèmes interactifs. Revue d’interaction Homme-Machine 6(1), 33–53 (2005) 15. Ellis, C.A., Gibbs, S.J., Rein, G.L.: Groupware: some issues and experiences. Communication of ACM 34(1), 39–58 (1991)
Agent-Based Architecture for Interactive System Design
633
16. David, B., Chalon, R., Vaisman, G., Delotte, O.: Capillary CSCW. In: Stephanidis, C., Jacko, J. (eds.) Human-Computer Interaction Theory and Practice, LEA, pp. 879–883 (2003) 17. Ellis, C.A., Wainer, J.: A Conceptual Model of Groupware. In: Proceedings of CSCW 1994 Conference, pp. 79–88. ACM Press, New York (1994) 18. Patterson, J.F.: A taxonomy of architectures for synchronous groupware applications. In: Workshop on Software architectures for cooperative systems CSCW 1994. ACM SIGOIS Bulletin Special Issue Papers of the CSCW 1994 workshops, vol. 15(3) (April 1995) 19. Dewan, P., Choudhary, R.: Coupling the User Interfaces of a Multiuser Program. ACM Transactions on Computer-Human Interaction 2(1), 1–39 (1995) 20. Tarpin-Bernard, F.: Architectures logicielles pour le travail cooperatif (in French), PhD Thesis, Ecole Centrale de Lyon, France (1997) 21. Laurillau, Y.: Conception et réalisation logicielles pour les collecticiels centrées sur l’activité de groupe: le modèle et la plate-forme Clover (in French), PhD Thesis, Joseph Fourier University, Grenoble (2002) 22. Tarpin-Bernard, F., Samaan, K., David, B.: Achieving usability of adaptable software: the AMF-based approach. In: Seffah, A., Vanderdonckt, J., Desmarais, M.C. (eds.) HumanCentered Software Engineering, Software Engineering Models, Patterns and Architectures for Human-Computer Interaction, Springer, Heidelberg (2009) 23. Idoughi, D.: Contribution à un cadre de spécification et conception d’IHM de supervision à base de services web dans les systèmes industriels complexes, application à une raffinerie de sucre (in French), Ph.D. Thesis, University of Valenciennes, France (2008) 24. Li, Y., Shen, W.-m., Ghenniwa, H., Lu, X.: Model-Driven Agent-Based Web Services IDE. In: Wang, S., Tanaka, K., Zhou, S., Ling, T.-W., Guan, J., Yang, D.-q., Grandi, F., Mangina, E.E., Song, I.-Y., Mayr, H.C. (eds.) ER Workshops 2004. LNCS, vol. 3289, pp. 518–528. Springer, Heidelberg (2004) 25. Paolucci, M., Sycara, K.: Autonomous Semantic Web Services. IEEE Internet Computing 7, 34–41 (2003) 26. Yang, H., Chen, J., Meng, X., Zhang, Y.: A Dynamic Agent-based Web Service Invocation Infrastructure. In: Proceedings of the First Int. Conf. on Advances in ComputerHuman Interaction, Sainte Luce, Martinique, pp. 206–211 (2008) 27. Bellifemine, F., Caire, G., Poggi, A., Rimassa, G.: JADE: A software framework for developing multi-agent applications. Lessons learned, Information & Software Technology 50(1-2), 10–21 (2008) 28. Nielsen, J.: Usability Engineering. Academic Press, Boston, MA (1993) 29. Hilbert, D.M., Redmiles, D.F.: Extracting usability information from user interface events. ACM Computing Surveys 32(4), 384–421 (2000) 30. Trabelsi, A., Ezzedine, H., Kolski, C.: Architecture modelling and evaluation of agentbased interactive systems. In: Proc. IEEE SMC 2004, The Hague, pp. 5159–5164 (2004) 31. Tarby, J.-C., Ezzedine, H., Rouillard, J., Tran, C.D., Laporte, P., Kolski, C.: Traces using aspect oriented programming and interactive agent-based architecture for early usability evaluation: Basic principles and comparison. In: Jacko, J.A. (ed.) HCI 2007. LNCS, vol. 4550, pp. 632–641. Springer, Heidelberg (2007) 32. Ezzedine, H., Bonte, T., Kolski, C., Tahon, C.: Integration of traffic management and traveller information systems: basic principles and case study in intermodal transport system management. Int. J. of Comp., Com. & Control (IJCCC) 3, 281–294 (2008) 33. Tran, C.-D., Ezzedine, H., Kolski, C.: A generic and configurable electronic informer to assist the evaluation of agent-based interactive systems. In: 7th international conference on Computer-Aided Design of User Interfaces, CADUI 2008, Albacete (June 2008)
BunBunMovie: Scenario Visualizing System Based on 3-D Character Tomoya Matsuo and Takashi Yoshino Wakayama University, 930 Sakaedani, Wakayamaya, Japan yoshino@sys.wakayama-u.ac.jp
Abstract. There are many text-based contents, such as novels and script. Those contents have only scenario, and lack visual information. The purpose of this research is to provide visualizing environment that can visualize text-based contents easily. Moreover, such environment can also provide the opportunity to get pleasure out of scenario. It is necessary for visualizing scenario to make various motions of characters and to depict various situations. Therefore we propose motion assortment function to make various motions of characters. The function uses a Japanese dictionary and a thesaurus search. We also propose associated image display function that uses an image search to depict various situations. From the experiments about the motion assortment function, we show that the proposal method can assort some motions. From the experiments of subjective assessment, we found that some subjects inclined to use such easy visualizing environment. Keywords: Scenario visualizing, 3-D character, motion synthesis.
BunBunMovie: Scenario Visualizing System Based on 3-D Character
635
2 Related Works Previous studies have proposed several systems that can recreate scenarios only on the basis of input information. Zeng developed a system named “3D Story Visualiser”[1, 2]. This system utilizes visual information in input sentences to create 3D scenes. The input is in the form of natural language and is processed with a tool known as NLS. The tool extracts nouns and prepositions from the input sentences, converts them into VRML form, and then outputs them as 3D images. In our system, the only task performed by the user is the input of sentences into the system. However, the 3D Story Visualiser can only reproduce information as a 3D image. The movements of the character are not reproduced. The aim of our study is to create a system that produces animated characters. Aoki developed a system (for creating animated objects) entitled the “Digital Movie Director”[3]. His system is based on TVML technology [4, 5] which was developed by the NHK Science & Technical Research Laboratories. In this system, the subject, predicate, and object used in the scene are set by the user. Moreover, the user can set camera angles and sound effects. In our system, the only task of the user is to input sentences. However, in the Digital Movie Director, the user has to first develop both the characters and the scene. Hence, a large amount of time is required to develop 3D images using this system. The Digital Movie Director cannot create a 3D movie unless both the scene and movements of the characters are prepared in advance.
3 System for Visualizing 3D Character Scenarios 3.1 Objective The objectives of our system are stated as follows: 1. The only task to be performed by the user is the input of sentences describing the scenario. We aim to develop a visualization system wherein no special operations are required to be performed by the user. A system that only requires sentences to be input can be easily used by all users. Our system first analyzes the input sentences and then determines the subject and predicate. 2. The system responds to verbs other than the registered verbs. It is difficult to simulate movements corresponding to all verbs. Hence, it is necessary for the system to respond to verbs other than those that are registered. Therefore, we have developed a “dictionary retrieval function” and a “movement synthesis function.” These functions allow us to simulate movements that correspond to verbs that are not registered in the system. 3.2 BunBunMovie The system that we have developed in this study is known as “BunBunMovie.” It analyzes input sentences and uses the information in them to develop a 3D movie. Figure 1 shows the execution screen of our system. The upper part of Figure 1 shows
636
T. Matsuo and T. Yoshino
a screen that displays the input sentences. “3D character,” “Related image,” and “Sentences under analysis” are displayed on the screen. The lower part of Figure 1 shows the screen used to input sentences. The user inputs BunBunMovie: Scenario Visualizing System Based on 3-D Character 3 sentences and pushes the reproduction button, following which the sentences are displayed on the screen. The BunBunMovie system is programmed using C#. We use TVML to create 3D animated objects. MeCab [6] is used for morphological analysis.
Image of character Image of place
Moving character Name of character and movement name of character Charactor name : I Action : Dance Reproduction button Yamada said "yah."And, Yamada jumped high. I was angry and s aid "It is annoying!!" T he uncle was s urpris ed at the voice. I apologized to the uncle, "I'm sorry."
Input sentences
I was reading the book. I went out because I had gotten tired of it. On the way, I met the friend. The friend said, "I was busy now," and went to the other s ide. I became sad a little. However, I played on down at once. I went to the dis co, and I danced with Tanaka.
Input history
Fig. 1. Execution screen of the system
Flow of sentence analysis process. We now explain the process for the analysis of the input. Figure 2 shows the flow of the procedure for sentence analysis. The explanation of each function is as follows. 1. First, the system analyzes the input sentences using MeCab (Figure 2 (1)). The nouns that denote the subject and place are determined from the relation between the noun and the case-marking particle. 2. The system examines whether the noun assumed to be a subject exists in a “subject list” (Figure 2 (2)). If the noun exists in the list, its related image is retrieved (Figure 2 (3)). 3. To analyze the verb, the system examines a “verb list” (Figure 2 (4)). If the verb under analysis does not exist in the list, the system refers to a dictionary (Figure 2 (5), (6), (7)). 4. The system then examines whether the noun assumed to denote a place exists in a “place list” (Figure 2 (8)). If the noun exists in the list, its corresponding image is retrieved (Figure 2 (9)). 5. The subject, predicate, and the noun indicating the place are converted into TVML format (Figure 2 (10)). 3.3 Functions of the BunBunMovie System The BunBunMovie system uses three functions known as “word list,” “dictionary retrieval,” and “image retrieval” to recreate desired scenarios from the information inputted by the users.
BunBunMovie: Scenario Visualizing System Based on 3-D Character
637
[Input sentences]
(1)
[Morphological analysis] The system assumes the noun to be a subject from the relation between the noun and the case-marking particle. MeCab The system assumes that the noun is place noun from the relation between the noun and the case-marking particle. Image search for object
(2) [List of subject]
The image is displayed in a movie.
subject
[Google image search]
(3)
Image TAMATEBAKO] (5)[RUIGO Synonym
(4) [List of verb]
dictionary
verb Case of unregistered verb
Place
Image search for place The image is displayed in Movie.
(7) The national language dictionary retrieval result and the synonym dictionary retrieval result are examined.
[Yahoo dictionary]
(6) Language
When the national language dictionary retrieval result exist in the synonym dictionary retrieval result, the system examines the verb list again.
dictionary
(8) [List of place noun]
Image search With face image search option
[Google image search]
(9)
Image search
Image
(10) [TVML data]
Play
TVML Player
Fig. 2. Flow of the sentence analysis process
Use of “Word list”. The word list consists of a list of words that already have corresponding images present in the system. The word list consists of three lists, a “subject list,” “verb list,” and a “place list.” The system checks these lists for words extracted from the input sentences and uses the corresponding images to recreate the desired scenarios. The detailed description of each list is as follows: 1. Subject List A noun that is recognized by the system as a subject is registered in the subject list. For example, nouns such as “I” and “he” are present in the subject list. All words registered in the subject list are nouns that fall in the “living thing” category of the Japanese dictionary. The number of registered words in our list is approximately 1700. 2. Verb List Verbs such as “run” and “walk” are registered in the verb list. The verbs registered in the verb list can be converted into TVML. There are 103 verbs present in our list. 3. Place List This list contains nouns that indicate a place or location. For instance, this list contains words such as “meadow” and “mountain.” The total number of words in the place list is approximately 6500. The registered words are nouns that exist in the “place” category of the Japanese vocabulary dictionary. The place list is required to accurately recreate the location of a scene from its description. Dictionary retrieval function. In order to recreate a scene properly, we need to recreate the movements of the characters in the scene. However, it is not possible to simulate all types of movements since the number of verbs registered in the list is limited. This results in a 3D character that is inanimate even though sentences describing its behavior have been inputted.
638
T. Matsuo and T. Yoshino
The dictionary retrieval function is used as follows. 1. First, the system obtains synonyms of the unregistered verb from the thesaurus (Figure 2 (5)). 2. Then, the system retrieves the definition of the unregistered verb from the Yahoo online national language dictionary (Figure 2 (6)). 3. The system analyzes the definition of the unregistered verb in the national language dictionary using MeCab. 4. It compares the information obtained from the national language dictionary with the synonyms and examines whether the description obtained from the dictionary matches any of the synonyms (Figure 2 (7)). 5. If a match is found, the system assumes the synonym to be a paraphrase of the unregistered verb. 6. The system examines whether this verb exists in the verb list (Figure 2 (4)). 7. On the basis of the information in the verb list, the appropriate movements are then assigned to the 3D characters. For instance, if the dictionary retrieval function is used for an unregistered verb such as “whisper,” the system paraphrases the verb “whisper” to the verb “say.” The system can re-examine the verb list on the basis of the new information obtained and can recreate the desired scene accurately. The image retrieval function. When sentences are made visible, sight information is insufficient only in the movement of 3D character. We think that the appearances of the character and information of the scene are necessary. The system uses the image retrieval function to add visual information. We use “Google image retrieval” to obtain visual information. Using the image retrieval function, users can obtain and display images related to the subject and the place described in the input. As an example, consider the following sentence inputted by a user: “My uncle was in a meadow.” The system assumes “uncle” to be the subject and “meadow” to be the location. The images corresponding to this subject and place are then retrieved by the system and displayed. Using this function, further information such as the role of the character and other information pertaining to the scene can also be added. Our system also uses Google Image Search’s face filter to retrieve only facial images related to a subject. This option displays all the images related to a particular face by priority. The probability of obtaining inaccurate images of a particular subject using this search option is very low. 3.4 Development of the Movement Synthesis Function Even though the dictionary retrieval function allows us to obtain the definition of several unregistered verbs, it is not possible to process all unregistered verbs using this function. Therefore, we have developed the movement synthesis function. This function combines the movements of verbs already registered in the verb list and then develops movements for unregistered verbs. In other word, the movement synthesis function first selects registered verbs and then combines their movements.
BunBunMovie: Scenario Visualizing System Based on 3-D Character
639
A verb is selected on the basis of synonyms of the unregistered verb. The flow of the movement synthesis function can be explained as follows, and it is illustrated in Figure 3. 1. We first retrieve synonyms for verbs that are already registered in the verb list. As a result, synonym tag is added to these verbs. 2. The system then retrieves synonyms of the unregistered word from a thesaurus (Figure 3(1)). 3. The system compares the synonym tag of the unregistered word with that of the registered verbs and examines the number of agreements between them (Figure 3(2)). 4. It then combines the movements of two or more registered verbs and develops the movement corresponding to the unregistered verb (Figure 3(3)). Comparison of synonym tag information
(1) Unregistered verb
(2) Registered verb
Verb:UNKNOWN [ jump, enjoy…]
Verb:A[ smile, enjoy…] Verb:B[ sad, cry…] Verb:C[ jump, snap…]
(3) The verb with the same synonym tag as synonym tag information on an unregistered verb are chosen. Verb:A
Verb:C
Fig. 3. Movement synthesis function. UNKNOWN, A, B and C denote the verb. The words in parentheses denote the synonym tag.
4 Experiments for Evaluation of Movement Synthesis Function We carried out an experiment to evaluate the performance of the movement synthesis function. The purpose of the experiment was to verify whether a user can correctly recognize the movements simulated by the system. In the experiment, ten students of Wakayama University observed the characters and movements developed by the system. They then provided feedback by filling out a questionnaire. 4.1 Experimental Procedure Movement synthesis function made five movements of the character. Those movements of character are ”set, hang, see off, be penitent, and burst.” The test users evaluated the movement of the characters. The number of verbs that were combined to develop the movements was approximately 70. The movement synthesis function chose two verbs from among these verbs to develop each set of movements. The test users answered a questionnaire with a five-point evaluation system and another description-based questionnaire. 4.2 Results of the Experiment Table 1 shows the results of the five-point evaluation system used in the questionnaire.
640
T. Matsuo and T. Yoshino
A movement that had a high average rating in the evaluation results was a movement that was correctly recognized by the user. However, there were also movements that had a low average rating. The average ratings of the verbs “burst open” and “see off” were high. Table 2 shows the feedback provided by the test users in the description-based questionnaire. The feedback was both positive and negative. One of the positive instances of feedback stated that the movement of the characters was interesting. On the other hands we also received negative feedback that stated that the movements of the characters were inaccurate. Table 1. Results of the questionnaire of the experiments evaluating the movement synthesis function Movement developed by the system
Movements used for Average synthesis set up the traps go out + poke 3.0 hang up dig + apologize 1.8 see off go to + say no 3.8 amend be mad + be worried 2.4 burst be mad + jump 3.4 The values in the “average” column of the table denote the mean values of the ratings given in response to the question “Is the movement of the character appropriate?” The following five-point Likert scale was used as the grading system in the questionnaire. The scale is as follows. 1: Strongly disagree, 2: Disagree, 3: Neutral, 4: Agree, and 5: Strongly agree.
Table 2. Impression of the movement synthesis function
Positive feedback
・The fact that the system itself develops the movements of the character is interesting. ・Even though the movement of the character may not be recognizable, it is important that the characters react to more words. ・Both appropriate and inappropriate movements were present. The appropriate movements were useful. Negative feedback ・The character expresses two separate movements for a single verb. ・Several movements of the character were not recognizable. ・It is difficult to understand for the synthesis of the movement of a monotonous character. 4.3 Discussion The results of the five-point evaluation in the questionnaire show that the average ratings for the verbs “burst” and “see off” were high. This is an example of the effectiveness of the movement synthesis function. It should be noted that when one of the two verbs used in the synthesis of a movement is related to the target verb, the average rating is high. For instance, “jump” was used for the synthesis of “burst”, and “go to” was used for the synthesis of “see off.” This shows that if verbs with similar meanings are used for synthesis, the required movements can be developed accurately. However, the overall accuracy of the movement synthesis function that is based on synonym information is not high. One of the negative instances of the feedback we obtained stated
BunBunMovie: Scenario Visualizing System Based on 3-D Character
641
that the character expressed two separate movements for a single verb. In other words, it implied that the test user was unable to recognize that the movement of the character corresponded only to a single verb. To solve this problem, we plan to improve the system such that it displays more natural movements. Therefore, we intend to utilize composite motions developed by Oshita [7].
5 Trial Evaluation of System by Users We performed trial experiments to allow users to evaluate the system. In the experiment, the test users inputted sentences describing a scene and then viewed the movie developed by the system. The purpose of the experiment was to allow users to evaluate the accuracy of the analysis of the sentences and the quality of the movie. Ten students from Wakayama University tested our system. 5.1 Experimental Process The test users first inputted the sentences and then observed the movies developed by the system on the basis of the information provided. The interval of the experiment was 10 minutes. After the experiment, the test users answered a questionnaire based on a five-point evaluation system and a description-based questionnaire. 5.2 Results of the Experiment When sentences where the test user clearly described the subject and predicate, such as “I went to college” and “I climb the mountain“ were inputted, the system accurately developed the movie. However, when only predicates such as “Went dancing” and “It kicked and knocked it down” were input, the system was not able to develop a movie. Table 3 shows the questionnaire result of the system. Overall system was highly evaluated by the test users. Table 4 shows the requests and impressions of the test users to the system. Some of the requests of the users included a desire to develop their own images and characters that were then recreated by the system. Other requests involved an improvement in Table 3. Questionnaire result of the system Question
Average
2. A related image improves interest.
4.4 3.9
3. I want to visually recreate my diary using this system.
4.1
1. The movements of the characters are interesting.
The values in the “average” column of the table indicate the mean values of the ratings given in response to the questions. A five-point Likert scale was used for the evaluation. The scale is as follows. 1: Strongly disagree, 2: Disagree, 3: Neutral, 4: Agree, and 5: Strongly agree.
642
T. Matsuo and T. Yoshino
the response of the system to different words and an improvement in the analytical accuracy of the system. Most of the test users felt that the system was interesting, although they stated that it was necessary to increase the number of words that can be translated by the system into images. However, most of the feedback that we received was positive. Table 4. Requests and impressions for the system Request for the system I would like to use sentences that allow the system to recreate images that I have already prepared. I would like to develop the character and then use it in the system. The system should be able to respond to more words. When the system can not analyze sentences, the system should display, “We can not analyze sentences.” The speed of the reproduction of images by the system should be improved.
・ ・ ・ ・ ・
Impression for the system The system is interesting. I enjoyed using the system, although the movie was slightly unclear. Although the system is interesting, it is necessary to improve it. The system can only be used to visualize diaries when it responds to more words. If the input sentences are not grammatically correct, the characters do not respond well.
・ ・ ・ ・
5.3 Discussion From the results of the questionnaire with the five-point evaluation system, we found that the rating for the item “I want to visually recreate my diary using the system” was high. However, there were several other requests concerning the system. There were requests for better analysis of the sentences, an increase in the number of words responded to by the system, and improved analytical accuracy. In the experiment, our system was unable to analyze a lot of sentences. Therefore, test users requested an improvement in the accuracy of the analysis of the sentences by the system.
6 Conclusion Conventional visualization systems are not able to recreate scenarios when the words extracted from input information are not registered with the system. Moreover, these systems are also not able to develop movements for each character. To resolve these issues, we proposed a system that uses a dictionary retrieval function and a movement synthesis function to recreate the required scenarios. The performance of our system was evaluated through trial tests by users. The results of our tests were as follows. 1. It was observed that developing appropriate movements for a character was a problem in conventional systems. We developed a movement synthesis function that utilized synonyms to develop the movements corresponding to a particular verb.
BunBunMovie: Scenario Visualizing System Based on 3-D Character
643
To verify the accuracy of the movement synthesis function, we performed experiments that evaluated its accuracy. It was observed that many of the movements developed by the synthesis function were not appropriate. However, the use of the movement synthesis function allowed us to develop animated characters. Hence, we proved that our proposed technique has good potential. 2. We also performed test experiments that allowed users to evaluate our system. The test users stated that they were satisfied with the system. We received feedback that stated that the users found the system interesting. This proves the potential of our system. However, our system only responds to sentences in which the subject and the predicate are input appropriately. Therefore, there were several requests concerning the accuracy of the analysis of a natural sentence. We plan to further improve the analytical accuracy of the system and the movement synthesis function.
References 1. Aoki, T.: Digital Movie Director, http://www.rcast.utokyo.ac.jp/ja/research/pioneers/007/ index.html 2. Douke, M., Hayashi, M., Makino, E.: A Study of Automatic Program Production Using TVML, Short Papers and Demos. In: Eurographics 1999, pp. 42–45 (1999) 3. Hayashi, M.: TVML (TV program Making Language) Make Your Own TV Programs on a PC! In: International Conferences, Virtual Studios And Virtual Production (2000) 4. Kudo, T., Yamamoto, K., Matsumoto, Y.: Applying Conditional Random Fields to Japanese Morphological Analysis. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP-2004), pp. 230–237 (2004) 5. Oshita, M.: Smart Motion Synthesis. In: SIGGRAPH 2007 Posters (2007) 6. Zeng, X., Mehdi, Q.H., Gough, N.E.: Shape of the Story: Story Visualization Techniques. In: Seventh International Conference on Information Visualization, pp.144–150 (2003) 7. Zeng, X., Mehdi, Q.H., Gough, N.E.: From Visual Semantic Parameterization to Graphic Visualization. In: Ninth International Conference on Information Visualization, pp. 488– 493 (2005)
Augmented Collaborative Card-Based Creative Activity with Digital Pens Motoki Miura, Taro Sugihara, and Susumu Kunifuji School of Knowledge Science, Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, Ishikawa, 923-1292, Japan miuramo@jaist.ac.jp, sugihara@jaist.ac.jp, kuni@jaist.ac.jp
Abstract. Typically, practitioners of the KJ method use paper labels and fourcolored ball-point pens to externalize their thoughts and ideas during the process. A similar approach and method is used in group KJ lessons. However, due to the large paper size required, this approach is limited in effective capturing and sharing of outcomes. Considering the merits of the conventional paper–pen approach and the demand for quick sharing of outcomes after the session, we designed and implemented a system to digitize the group KJ session—not just the outcomes but also the details of the creative work processes. We use digital pens to capture position and orientation of labels, as well as their contents, during the session. We confirmed the efficiency of our system with several KJ sessions. Keywords: CSCW, Creative meeting, Label work, KJ method.
Augmented Collaborative Card-Based Creative Activity with Digital Pens
645
2 Capturing Card Locations by Digital Pens We used Anoto-based pens to store drawings on paper cards and a base sheet. The Anoto-based pen can recognize the position of drawings by scanning special dotted patterns on the paper. Using the unique features in patterns, the system can distinguish between data of the drawings on the cards and those on the base sheet. Using these characteristics, the drawings can be used not only for handwriting notes but also describing the relationships between the paper cards and the base sheet. When the user draws a line that covers the sheet and a card (Fig. 1 left), the pen recognizes the line as three drawings (Fig. 1 right). If these drawings are generated at almost the same time, we can consider that, at that time, the paper card was placed so as to connect the three drawings. We call this operation scanning and the connecting points joints.
Fig. 1. A line over the card border is separated into three lines
By enhancing the technique, we can recognize orientation and overlapping state of paper cards if two joints are extracted by scanning. (Fig. 2)
Fig. 2. Recognizing of orientation and overlapping states
To eliminate unnecessary pen drawings on paper for scanning, we can simply use a semi-transparent plastic sheet while scanning (Fig. 3)
Fig. 3. Using transparent plastic sheet to eliminate drawings
646
M. Miura, T. Sugihara, and S. Kunifuji
Fig. 4. Grouping (left) and Ungrouping (right) gestures
A similar method was proposed by [3] as research on digitized experiment record notes; however, to edit the card structure, we introduced extra pen gestures called grouping and ungrouping. Grouping can be performed by a continuous round stroke from the top-level card to the child cards (Fig. 4 left). Ungrouping can be defined by a continuous single stroke from the top-level card to the child cards (Fig. 4 right). These grouping and ungrouping operations can be used for common card-based creative activities, especially for making figures in the KJ method [4, 5]. Of course a digital camera can store the paper sheet status in detail. But reusability of the card content is crucial for creative tasks, and the atomic data should be provided to enhance the process. In particular, an authentic KJ method involves procedures that use repetitive tasks for refinement and deepening.
3 GKJ System We developed a system named GKJ (Group KJ) that handles scanned handwritten drawings and gestures captured by multiple Anoto pens. The GKJ system consists of (1) Anoto pens, (2) an L-Box Digital Pen Gateway System (DPGW), and (3) a GKJ editor. A system overview is shown in Fig. 5. The L-Box DPGW collects pen data from multiple pens simultaneously, via a Bluetooth connection, and sends it to a MySQL table in a PC. The GKJ editor checks the updated data and handles the data to construct a digital representation of the current paperwork status. For further editing
Fig. 5. GKJ system overview and data flow
Augmented Collaborative Card-Based Creative Activity with Digital Pens
647
tasks, the GKJ editor provides functions for organizing the virtual cards, using a mouse and a keyboard as alternative input devices.
4 Usage Scenario Typically the group KJ method session consists of two stages—card gathering and card unfolding. In the gathering stage, the participants discuss and collect cards with similar meanings or arguments. After that, they add extra cards to those gathered, and write an abstract of the cards on the added cards. Then they clip these cards together with paper clips, considering them as a single card. This is repeated until there are less than 6~9 cards in the stack. In the GKJ editor, the folding operation can be performed by the grouping gesture, and the folded cards are shown in the left of Fig. 6. In the authentic KJ method, the participants are basically prohibited from referring to the child cards during this stage. Then they proceed to the unfolding stage. Usually the participants extract the cards on the base sheet, but this requires special care to not destroy the constructed structure of piled cards. Also, in the real world, it is difficult to re-organize the unfolded cards because the amount of area necessary for the cards depends on the number of cards and the layout. A high number of cards prevent a trial and error approach. Therefore we recommend that the participants use virtual cards to estimate a preliminary layout. The GKJ editor provides a function for unfolding virtual cards and pre-organizes the extracted virtual cards by dragging the top-level cards. Figure 6 right shows the unfolded virtual card view. Using this function, the participants can effectively layout the cards by considering the relationship between the cards. After the two stages, the participants obtain a figure (Fig. 7) which represents their issues and viewpoint as an outcome. The participants can review the process by replaying the operations with the GKJ editor. Also they can export the process data or print figures in a PDF format. Incidentally, the curved line in Fig. 7 was constructed by a Bezier curve, whose control points were generated by the convex hull algorithm. The curved lines are automatically recalculated by moving the virtual cards. As described above, the proposed GKJ system allows freely choosing a proper environment (real or digitized cards) for their task such as review by the digitized log
Fig. 6–7. Digitized views of grouping — (left) pile and (right) unfolded view
648
M. Miura, T. Sugihara, and S. Kunifuji
rollback, and distribution of data. The high portability of the GKJ system makes it useful in a variety of environments. Group sessions with the GKJ system can be held with (1) a base paper sheet, (2) paper cards, (3) digital pens, (4) L-Box, a small Linux box, and (5) a PC.
Fig. 8. Final KJ method figure in the GKJ system
4.1 Practice We operated the GKJ system for small courses of collaborative card-base activity (Fig. 9 shows a class at our institute and Fig 10 shows a lecture course with city hall staff). We used pre-printed cards as material, and the participants positioned the cards spatially to represent their thoughts and considerations. In this case, the system may not enhance the ongoing work, but the participants enjoyed scanning and checking the digitized data. The instructor could conduct the course in the same manner as the conventional courses that use the cards (Fig. 9 left). Since the scanning is intuitive, the instructor could easily capture the card locations. The captured data was utilized to generate PDF files, which represent their layout after the course. The precise transition log of the cards was helpful for retrospection. In the lecture course at a city hall (Fig. 10), the participants first wrote their thoughts and work related problems on plain paper cards with pens. Then they classified their cards by hand and discussed the issues. After organizing the cards, by putting them on the base sheet, we scanned the card positions.
Augmented Collaborative Card-Based Creative Activity with Digital Pens
649
Fig. 9. Lecture courses with collaborative card-base activity at the university
Fig. 10. Lecture course with city hall staffs
We obtained the following findings from observations of the sessions and comments from participants. (Advantages) 1. Card writing with a pen was straightforward and intuitive for participants. 2. The participants could naturally organize their cards because they could see other people’s behavior. 3. The quick distribution of digitized logs and figures (PDF) is effective for reviewing the session and discussion. Incidentally, the instructor had been providing digitized logs and session videos for participants even for the conventional lectures, but it took a few days to digitize the outcomes. 4. The instructor and some participants could master the scanning and enjoyed the operation. 5. The position of the scanned data accurately represented the real figures. (Drawbacks) 1. Sometimes the scanning failed due to errors. The most frequent mistake was a scanning section error. The GKJ system utilizes four A2 size sheets to compose an A0 sized base sheet. Since the printed dot pattern of the A2 sheet was same, the user needed to specify the section of the base sheet to the system before scanning by tapping a checkbox. Occasionally the user missed the presetting or scanned with
650
M. Miura, T. Sugihara, and S. Kunifuji
a wrong pen. To reduce this error, we now prepare four preset pens for each A2 sheet section. Even if the error occurs, the user can easily fix the misrecognition by rescanning. 2. Sometimes unnecessary scanning lines appeared on cards and base sheets. The reason was misrecognition during scanning. The misrecognition was caused by weak pen pressure, high scanning speed, and the lack of a gap between cards (less than 1 cm). To solve the issue, we added a “transparent” pen mode, which does not draw unnecessary lines while scanning. 3. Some participants wrote upside down on the cards, because it is difficult to recognize the top and bottom. This caused the card content to be shown wrong side up, and the position was scanned incorrectly. The issue could be solved by implementing a function to automatically detect when the wrong side was up by considering the handwritten note, and handling the card carefully. 4. As we described in points 1 to 3, scanning required skill and know-how. The user needed to understand the characteristics of the pen and the GKJ to operate the system adequately. However, the skill could be easily acquired with a few minutes of training, and a failed scanning could be easily recovered by rescanning. Even with its drawbacks, the GKJ system has the potential to augment conventional paper based discussion. We also found that most of the drawbacks could be solved by further system refinements.
5 Conclusion We proposed a method for capturing the location and hierarchical structure of paper cards written with Anoto-based pens. The method enables the participants to record a precise atomic transition log of the cards. We also developed a system for digitizing paper-based card organization tasks instantly, based on the proposed method. Due to simplicity of the pen-based input, the GKJ system is universal; it can be used by office workers and also the elderly and primary school children. We confirmed the effectiveness of the system with several sessions. We applied it to small group learning sessions of up to 10 persons, but this system is applicable for many participants and groups, since the system can handle up to 40+ pens at the same time. We will refine the GKJ system to improve its usability, and it should contribute to the effectiveness of group discussions that include various types of participants, such as town meetings.
Acknowledgement Digital Pen Gateway System and related technologies are from NTT Comware Tokai Corporation. Our research is partly supported by a grant-in-aid for Scientific Research (20680036, 20300046).
References 1. Klemmer, S.R., Newman, M.W., Farrell, R., Bilezikjian, M., Landay, J.A.: The Designers’ Outpost: A Tangible Interface for Collaborative Web Site Design. In: Proceedings of UIST 2001, pp. 1–10 (2001)
Augmented Collaborative Card-Based Creative Activity with Digital Pens
651
2. Miura, M., Kunifuji, S.: A Tabletop Interface Using Controllable Transparency Glass for Collaborative Card-based Creative Activity. In: Lovrek, I., Howlett, R.J., Jain, L.C. (eds.) KES 2008, Part II. LNCS, vol. 5178, pp. 855–862. Springer, Heidelberg (2008) 3. Ikeda, H., Furakawa, N., Konoishi, K.: iJITinLab: Information Handling Environment Enabling Integration of Paper and Electronic Documents. In: CSCW 2006 Workshop (Collaborating over Paper and Digital Documents) (2006) 4. Kawakita, J.: An Idea Development Method, Chuuko Shinsho, Chuuo Kouron-sha (1968) 5. http://www.mycoted.com/KJ-Method 6. Ohiwa, H., Takeda, N., Kawai, K., Shiomi, A.: KJ editor: a card-handling tool for creative work support. Knowledge-Based Systems 10(1), 43–50 (1997) 7. Munemori, J.: GUNGEN: Groupware for a new idea generation support system. Inf. Soft. Technology 38(3), 213–220 (1996) 8. Misue, K., Nitta, K., Sugiyama, K., Koshiba, T., Inder, R.: Enhancing D-ABDUCTOR Towards a Diagrammatic User Interface Platform. In: Proceedings of KES 1998, pp. 359–368 (1998)
Usability-Engineering-Requirements as a Basis for the Integration with Software Engineering Karsten Nebe1 and Volker Paelke2 1
University of Paderborn, C-LAB, Fürstenallee 11, 33098 Paderborn, Germany 2 Leibniz University Hannover, Appelstrasse 9a, 30167 Hannover, Germany Karsten.Nebe@c-lab.de, Volker.Paelke@ikg.uni-hannover.de
Abstract. Usability is growing to become an integral quality aspect of software development, but it is not an exclusive attribute of the generated product; it is also a fundamental attribute for the development process itself. The question is how to adapt software engineering processes (or models) in such a way that they can ensure the development of usable solutions. In this paper, the authors present an integration approach pursuing this goal. It draws on so called ‘Compliancy and Key Requirements’ that can be used for the definition of software processes (or process models) and thereby support the integration of both disciplines. The requirements are based upon representative standards (DIN ISO 13407 and ISO/PAS 18152) but were enhanced by the results of an expert based survey using interviews and questionnaires. Additionally the requirements have been verified by experts and represent an evaluated knowledge base for the development of usable products. Keywords: Integration, Software Engineering, Usability Engineering, Standards DIN EN ISO 13407 and ISO/PAS 18152, Process Models, Process Definition, Process Improvement, Assessment.
Usability-Engineering-Requirements as a Basis for the Integration with SE
653
the goals of SE and UE in a way that allows systematic and predictable implementations to be generated while considering the factors of costs, time and quality adequately for both SE and UE proposes. In this paper, the authors present an integration approach pursuing this goal.
2 Integration Approaches In theory and praxis, a considerable number of integration approaches with distinct focuses exist [18]. Some of these approaches tend to define common activities and artifacts for both SE and UE and to integrate these specific activities into the process of development. They aim at a ’soft integration‘ of UE aspects on a mutual basis, e.g. at interlinking relative results (e.g. [17, 5, 2]). Most approaches focus on a minimal organizational and structural transformation and/or change. Quite similar are approaches that aim at a common specification of activities and artifacts. They are grounded on communication and information exchange by using shared definitions (e.g. [1, 21, 20]). These two kinds of approaches could be summarized as a group of approaches that aim directly at the operational development processes in organizations. Other integration approaches relate to the level of process definitions and process models (e.g. [6, 11,3]). These aim to define pre-settings for the development and contain both a more concrete approach (focusing on the integration of UE activities in an already existing SE-Models), and more fundamental aspects of process models (independently of any concrete SE-Model). In general these approaches concentrate on the combination of phases, activities and results (within existing structures) on the level of process models to build up the base for integration. In addition there is another third group of integration approaches focusing on a higher level ob abstraction. Those approaches are independent from any specific process model or activities but rather describe organizational measures, principles, paradigms or meta-models (e.g. [16, 7, 5, 19]). Those approaches aim at the definition of general procedures for the development, which is comparable to standards in SE and UE on this level of abstraction. Accordingly, strategies for the implementation are abstract and need to be adapted to particular situations. Altogether these groups of approaches aim to provide systematic procedures for developing usable software. At a closer look, they address three different levels of abstraction: 1. The abstract overarching level of standards in software engineering and usability engineering, serving as a framework to ensure consistency, compatibility, exchangeability, and quality within and beyond the organizational borders and to cover the improvement of quality and communication. 2. The level of process models for software engineering and usability engineering, to provide a procedural model that can serve as a framework for an organization, providing specific features, e.g. predictability, risk management, coverage of complexity, generation of fast deliverables and outcomes, etc. 3. The operational process level which reflects the execution of activities and the processing of information within the organization. It is an instance of the underlying model and the implementation of activities and information processing within the organization.
654
K. Nebe and V. Paelke
These are related in a hierarchy: standards define the overarching framework, process models describe systematic and traceable approaches within such a framework, and at the operational level the models are tailored to fit the specifics of an organization. 2.1 Integration on the Level of Standards, Process Models and Operational Process It can be observed that this hierarchy of standards, process models and processes exists in both disciplines, but there have been few attempts to exploit these similarities for integration. With this goal in mind, the authors analyzed these three levels and presented a holistic approach for integration of SE and UE [12, 13, 14]. By doing this, the authors identified similarities between SE and UE on the level of standards. The standards’ detailed descriptions of processes, activities and tasks, output artifacts, etc. were analyzed and compared. For this, the SE standard ISO/IEC 12207 [8] was chosen for comparison with the UE standard DIN EN ISO 13407 [4]. On a high level, when examining the descriptions of each activity, by relating tasks and outputs with each other, similarities could be identified in terms of the characteristics, objectives and proceedings of activities. Based on these similarities single activities were consolidated as groups of activities (so called, ‘common activities’). These ‘common activities’ are part of both disciplines SE and UE on the highest level of standards. The result is a compilation of five ‘common activities’: Requirement Analysis, Software Specification, Software Design and Implementation, Software Validation, Evaluation that represent the process of development from both, a SE and a UE point of view [12, 13]. These activities define the overarching framework for the next level, the ‘level of process models’. In a following analysis the authors identified the maturity of software engineering process models’ ability to create usable products [12, 14]. For that purpose, the authors used a two-step approach to synthesize the demands of usability engineering and performed an assessment of selected software engineering models. To obtain detailed knowledge about usability engineering activities, methods, deliverables and their regarding quality aspects, the authors analyzed the two usability engineering standards DIN EN ISO 13407 and the ISO/PAS 18152 [9]. The ISO/PAS 18152 defines detailed base practices that specify the tasks for creating usable products. These base practices were used as a foundation to derive requirements that represent the ‘common activities’ from a usability engineering perspective. The quantity of fulfilled requirements for each activity of the framework informs about the level of compliance of the software engineering model. It provides an estimate of how well the UE base practices are covered in a given SE model. The results of the assessment provide an overview about the degree of compliance of the selected models with usability engineering demands. It turned out that there is a relatively small compliance to the usability engineering activities across all selected software engineering models. This is an indicator that only little integration between usability engineering and software engineering currently exists on the level of process models. The analysis did not only highlight weaknesses of SE Models, it also pinpointed the potential for integration between software engineering and usability engineering:
Usability-Engineering-Requirements as a Basis for the Integration with SE
655
Where base practices are not considered as fulfilled, recommendations could be derived, which would contribute to an accomplishment. The underlying base practices provide indices what needs to be considered on the level of process models. This can be used a foundation for implementing the operational process level. However, during the analysis it became apparent that there is a clear need for more detailed/adequate criteria for the assessment by which more objective and reliable statements about process models and their ability to create usable software could be made. Such detailed criteria would also be useful to formalize process-requirements that can influence the definition of SE-Models and development processes that are usercentered and by improve the interplay of SE and UE in practice. Having this in mind, the authors performed semi-structured interviews with experts from the domain of UE to identify requirements from the UE perspective. The results have been analyzed and evaluated as described in the following paragraph.
3 UE-Process-Requirements In order to make software development processes user-centered there is a need for explicit knowledge about relevant activities, their dependencies, regarding results, roles, and quality aspects, etc. One goal is to develop such a knowledge-base using existing findings and to enrich them by expert’s knowledge. Therefore the authors created an interview-guideline and questionnaires that correspond to the overall process framework of common activities particularly with regards to the usability engineering perspective. The analysis is based on the four humancentered design activities of the DIN EN ISO 13407 (‘context of use’, ‘user requirements’, ‘produce design solutions’ and ‘evaluation of use’) and their respective base practices and specifics as defined in the ISO/PAS 18152 (i.e. fundamental activities, basic conditions and constraints, relevance of activities, resulting outcomes, type of documentation, and respective roles and responsibilities). The goal was not to evaluate these standards but to add details for further use. A substantial part of the analysis referred explicitly to quality characteristics of the four human-centered design activities. The goal was to identify what constitutes the quality of a certain activity from the experts’ point of view and what kind of (potentially measurable) success and quality criteria exist that are relevant on a process level and subsequently for the implementation in practice. Examples of the questionnaire are: How to identify good activities? How to identify good results or deliverables? How to identify appropriate roles? What are properties/characteristics for the relevance and frequency? How could the progress of an activity or deliverable be measured and controlled? Based on the results the authors identified activities, deliverables and roles that are necessary to ensure the development of usable products from the experts’ point of view. Relevant factors of influence could be for instance: „When will an activity A not be performed, and why?” or “Under which circumstances will an activity A be performed completely, when just partly?” Additionally, criteria that allow measuring the progress of the development process.
656
K. Nebe and V. Paelke
It was expected that the results could be used not just as more detailed criteria for an assessment but would also provide an indication of the level of completeness of the ISO/PAS 18152 and identify potential areas of improvement. To achieve this, the authors performed semi-structured interviews and questionnaires with six experts in the field of UE [15]. The experts were well grounded in theoretical terms, i.e. standards and process models, as well as in usability practice. 3.1 Derivation of Requirements As a result, about 470 statements from the experts have been gathered which then have been consolidated and classified by adding references to its source (i.e. the interview partner and the question out of the interview-guideline); to one of the four activities (‘context of use’, ‘user requirements’, ‘produce design solutions’ or ‘evaluation of use’); whether it addresses quality aspects regarding the process, an activity, or deliverable; whether it complies to the activities’ and base practices’ goals (as defined in the two ISO standards), etc. Thus, overarching process- and quality characteristics could be identified that led to findings about the relevance, the applicability and need of usability activities, methods and artifacts to be implemented in SE. By performing several iterations of analysis similar statements were merged and formalized in terms of 107 ‘requirements for development processes or process models. There are two distinct types of requirements: ‘Compliancy and Key Requirements’. Compliancy requirements represent the goals and base practices defined in the standards DIN EN ISO 13407 and ISO/PAS 18152 but refine them with the output of the analysis. The key requirements define core characteristics of the overall frameworks usability activities focusing on the activities’ and results’ quality. Together, the requirements define the demands of UE and lead to the systematically creation of usable products. Examples of the resulting requirements are: • Context-analysis is an integral part of the process. • Analysis takes part early in the process before conceptual work is carried out. • Analysis activities are preformed iteratively until all incompletions and inconsistencies are eliminated. • Resources and time for the elicitation and evaluation of user requirements is sufficiently provided. • User requirements are addressed in the system design. • User requirements are the input for the next process step and accordingly positioned in the development process. • The requirements of the users of the system are defined. 3.2 Evaluation of Requirements In a subsequent analysis, both the compliancy and key requirements have been evaluated by 13 usability experts using questionnaires (three of these experts were also involved in the previous analysis). The questionnaire included a list of all 107 requirements grouped by the four activities (‘context of use’, ‘user requirements’, ‘produce design solutions’ or ‘evaluation of use’) and scales to rate the correctness and the relevance for the appliance in practice of each requirement. Some examples of Requirements are shown in Table 1.
Usability-Engineering-Requirements as a Basis for the Integration with SE
657
Table 1. Examples of the requirements for the UE-activites ‘context of use’ (CoU), ‘user requirements’ (UR), ‘produce design solutions’ (PDS) and ‘evaluation of use’ (EoU) and the experts’ rating in terms of correctness and relevance (in practice) Nr 2 17
Activity CoU CoU
27
CoU
24
CoU
33
CoU
46
UR
71
PDS
105
EoU
Requirement Context-analysis is an integral part of the process. The outcomes of the context analysis serve as the input for the next process step and the activity itself is anchored within the process model accordingly. The characteristics of the intended users and their tasks, including user interaction with other users and other systems, are documented. The analysis is focused on the original context of the users (their goals, tasks, characteristics of the tasks and the environment, etc.). The analysis is independent of any existing solution/implementation. The context-information is based on facts and not an interpretation of any situation. A sufficient amount of user requirements are the basis for the next process step (PDS). The development of solutions is carried out in collaboration with the development team. It is checked that the system is ready for evaluation.
Correctness Correct Correct
Relevance Very high High
Correct
Very High
Correct
High
Sufficient
Medium
Correct
Very High
Correct
Very High
Sufficient
Medium
By looking at the overall results it turned out that most requirements are rated correct by the majority of experts: 31 requirements by all 13 participants; 29 requirements by 12 experts; 27 requirements by 11; and 6 requirements by at least 10 Experts. No requirement has been rated incorrect. All together there is a high compliance of the experts opinions to the requirements. The sum of requirements that has been rated correct by at least 10 experts is 93 – which represent 87% of all 107 requirements. The rating of the relevance was used to derive recommendations about the priority for the appliance in practice (i.e. for the definition of processes). 1. Those requirements that have been rated as ‘correct’ and range from a ‘very high’ to a ‘high’ scale of relevance. (in general: the higher the relevance the higher the priority). 2. Those requirements that have been rated as ‘correct’ and show ‘medium’ scale of correctness. 3. Those requirements that depict a ‘sufficient’ scale of correctness. 4. Those requirements that show an ‘acceptable’ scale of correctness. 5. All remaining requirements. But, by applying the requirements in practice, it is important to consider requirements of all four activities in equal measure. A partially implementation of selective requirements will not lead to usable products. Only using them in a holistic way will support the systematic development of usable solutions. As a result of the analysis and evaluation the compliancy and key requirements represent an evaluated knowledge basis for the development of usable products. The analysis based on representative standards of UE and the requirements add here to more specific criteria based on experts’ knowledge. The requirements account for the
658
K. Nebe and V. Paelke
integration of SE and UE as they can be used for the definition and adaption of SE process models as well as operational development processes.
4 Conclusions and Outlook In summary, there exist many integration approaches that aim to provide systematic procedures for developing usable software. At a closer look, they address three different levels of abstraction: standards, process models and operational processes. However, there have been few attempts to exploit the integration in a holistic way including all three levels. The authors report about such an approach and present a systematic way of integrating usability engineering demands into the software engineering methodology. The results of an expert based analysis (and subsequent evaluation) have been used to derive two distinct types of requirements: ‘Compliancy and Key Requirements’. Compliancy requirements represent the goals and base practices defined in the standards DIN EN ISO 13407 and ISO/PAS 18152 but those are refined by the output of the analysis. The key requirements define core characteristics of the overall frameworks usability activities focusing on the activities’ and results’ quality and are also based of the analysis’ results. The requirements represent an evaluated knowledge basis for the development of usable products. They add to an integration of software engineering and usability engineering as they can be used for the definition and adaption of software development processes and process models, too. In future we aim to evaluate these requirements in practical projects to observe process changes and their resulting effects to the usability of the products.
References 1. Constantine, L.L., Lockwood, L.A.D.: Software for Use: A Practical Guide to the Models and Methods of Usage-Centered Design. Addison-Wesley (ACM Press), New York (1999) 2. Constantine, L.L., Biddle, R., Noble, J.: Usage-centered design and software engineering. Models for integration. In: IFIP Working Group 2.7/13.4, ICSE 2003 Workshop on Bridging the Gap Between Software Engineering and Human-Computer Interaction, Portland (2003) 3. Düchting, M., Zimmermann, D., Nebe, K.: Incorporating User Centered Requirement Engineering into Agile Software Development. In: Jacko, J.A. (ed.) HCI 2007. LNCS, vol. 4550, pp. 58–67. Springer, Heidelberg (2007) 4. DIN EN ISO 13407: Human-centered design processes for interactive systems, CEN European Committee for Standardization, Brussels (1999) 5. Ferre, X.: Integration of Usability Techniques into Software Development Process. Bridging The Gaps Between Software Engineering and Human-Computer Interaction. In: Proceedings of ICSE 2003 International Conference on Software Engineering, pp. 28–35. ACM Press, Portland (2003) 6. Göransson, B., Lif, M., Gulliksen, J.: Usability Design-Extending Rational Unified Process with a New Discipline. In: Jorge, J.A., Jardim Nunes, N., Falcão e Cunha, J. (eds.) DSV-IS 2003. LNCS, vol. 2844, pp. 316–330. Springer, Heidelberg (2003)
Usability-Engineering-Requirements as a Basis for the Integration with SE
659
7. Granollers, T., Lorès, J., Perdrix, F.: Usability Engineering Process Model. Integration with Software Engineering. In: Proceedings of the Tenth International Conference on Human-Computer Interaction, pp. 965–969. Lawrence Erlbaum Associates, New Jersey (2002) 8. ISO/IEC 12207: Information technology - Software life cycle processes, 2nd edn., 200802-01. ISO/IEC, Genf (2008) 9. ISO/PAS 18152: Ergonomics of human-system interaction - Specification for the process assessment of human-system issues. ISO, Genf (2003) 10. Jokela, T.: An Assessment Approach for User-Centred Design Processes. In: Proceedings of EuroSPI 2001, Limerick Institute of Technology Press, Limerick (2001) 11. Kolski, C.: A call for answers around the proposition of an HCI-enriched model. ACM SIGSOFT Software Engineering Notes 23(3), 93–96 (1998) 12. Nebe, K., Zimmermann, D.: Suitability of Software Engineering Models for the Production of Usable Software. In: Proceedings of the Engineering Interactive Systems 2007. LNCS, Springer, Heidelberg (2007a) 13. Nebe, K., Zimmermann, D.: Aspects of Integrating User Centered Design into Software Engineering Processes. In: Jacko, J.A. (ed.) HCI 2007. LNCS, vol. 4550, pp. 194–203. Springer, Heidelberg (2007b) 14. Nebe, K., Zimmermann, D., Paelke, V.: Integrating Software Engineering and Usability Engineering. In: Pinder, S. (ed.) Advances in Human-Computer Interaction, ch. 20, pp. 331–350. I-Tech Education and Publishing, Wien (2008b) 15. Nebe, K.: Integration von Usability Engineering und Software Engineering: Konformitäts und Rahmenanforderungen zur Bewertung und Definition von Softwareentwicklungsprozessen, Doctoral Thesis, Shaker Verlag, Aachen (in print, 2009) 16. Pawar, S.A.: A Common Software Development Framework For Coordinating Usability Engineering and Software Engineering Activities. Master Thesis, Blacksburg, Virginia (2004) 17. Schaffer, E.: Institutionalization of usability: a step-by-step guide. Addison-Wesley, Pearson Education, Inc., Boston (2004) 18. Seffah, A., Desmarias, M.C., Metzker, E.: Human Centered Software Engineering, HCI, Usability and Software Engineering Integration: Present and Future. In: Seffah, A., Gulliksen, J., Desmarais, M.C. (eds.) Human-Centred Software Engineering – Integrating Usability in the Development Lifecycle, vol. 8, Springer, Heidelberg (2005) 19. Sousa, K., Furtado, E., Mendoca, H.: UPi: a software development process aiming at usability, productivity and integration. In: Proceedings of the 2005 Latin American conference on Human-computer interaction CLIHC 2005, ACM Press, New York (2005) 20. Tidwell, J.: Designing Interfaces: Patterns for Effective Interface Design. O’Reilly, Sebastopol (2005) 21. Van Harmelen, M.: Object Modeling and User Interface Design: Designing Interactive Systems. Addison-Wesley, Boston (2001)
Design Creation Based on KANSEI in Toshiba Yosoko Nishizawa and Kanya Hiroi Toshiba corporation, Design Center, 1-1-1 shibaura, minato-ku, Tokyo, Japan {yosoko takano,kanya kiroi}@toshiba.co.jp
Abstract. In endeavoring to increase the quality of design, Toshiba has outlined a concept of “perceived quality,” and evaluates designs on the basis of achieving a higher level of perceived quality. We defined six indices from the result of the image research into the design by the user. These six indicators of perceived quality were used in the creation and evaluation of designs, and a number of products were put on the market and evaluated. Keywords: KANSEI, design, product, quality of design, Evaluation of design.
As the source of added value, having shifted from quantity to quality, shifts from quality to more nebulous factors, we are driven by the necessity to create new value for customers. As one initiative in this direction, Toshiba is engaging in product development with “perceived quality” positioned as an added value to its products. As of the present, we have attempted to define the nature of perceived value, and we have evolved the terms “perceived quality” and “appealing quality.” Unfortunately, the question of how to incorporate these concepts in product development is an ongoing process of trial and error on the ground. In this presentation, we will offer examples to illustrate Toshiba’s concept of “perceived quality,” and will also discuss the methods by which the concept was derived.
2 The Concept of Perceived Quality 2.1 Derivation of Perceived Quality “Perceived quality” is quality that can be expressed in terms of an individual’s feelings and the images to which they respond. It is a quality that can be expressed in terms of the subjective requirements of the individual. For an automobile, examples of these subjective requirements would be “Does it feel good to drive?,” “Is it stylish?,” and the like. Contrasting with this, there are other aspects of quality which can be expressed as objective, physical characteristics[1]. For a car again, examples of these would be high horsepower, good fuel efficiency, and the like. We can say that “perceived quality” resides in design features that appeal to the emotions, and is something that the customer judges subjectively. What, then, is a design that appeals to the emotions, and what is a product in which this quality resides to a high degree? First, we studied what types of products appealed to the emotions and how customers evaluated these products. The results of these studies are shown in Figure 1. The numbers in the figure represent products. The evaluated product is a product that shines to the design prize or it is a commodity that designed evaluation is high for the user. These results, in addition to a series of interviews, showed that a design with a high level of perceived quality is one that is beautiful, easy to use, and that offers feelings of security and pleasure. Further, the systemic structural relationships shown in Figure 2 also exist. This shows that design expression can be used to increase quality, enabling the creation of a product that creates a strong impression on customers. We defined two types of perceived quality: Basic perceived quality and a perceived quality that goes beyond the basic to affect the emotions (appealing quality) (Figure 2). As a prerequisite for the creation of a design that affects the emotions (i.e., that possesses appealing quality), we established that first the design must produce feelings of pleasure (i.e., must possess basic perceived quality). Then, we searched for the factor that became an index from the result of the above-mentioned user survey.
662
Y. Nishizawa and K. Hiroi
Fig. 1. T This is a positioning map. We mapped the results of a correspondence analysis of data from a Web-based questionnaire on design images.
Fig. 2. This figure shows the elements making up perceived quality as defined by Toshiba. A design expression that offers simplicity and ease of use produces feelings of pleasure in users. This can be understood as basic perceived quality, but by itself this is not enough. We must also consider appealing quality, which transcends feelings of pleasure to affect the emotions.
The results of the user survey showed that this basic perceived quality was made up of six elements. Indicators we defined for these six elements are `Aesthetic quality`, `Quality with feeling of warmth`, `Quality in use`, `Universal quality`, `Quality that transmits the “message” of the product`, and `Original quality` (Figure 2).
Design Creation Based on KANSEI in Toshiba
663
Fig. 3. These are the six elements of perceived quality. These were derived by the factorial analysis.
2.2 Evaluating Perceived Quality Using the six indicators to evaluate a group of popular products with excellent design features that were available on the market showed us that it was indeed possible to evaluate these products to some extent. The results of this survey showed two patterns for products of high perceived quality: Either the product was evaluated for all six indicators to some degree (Group B), or it was evaluated extremely highly for only some of the indicators (Group A). Figure 4 shows products that were selected as displaying a high level of perceived quality. Evaluated on the basis of the six indicators of perceived quality, the products A in Group A received relatively high evaluations for `Quality with feeling of warmth` and `Aesthetic quality`, but low evaluations for `Quality in use` and `Universal quality`. The products in Group B, by contrast, received balanced evaluations across the entire spectrum of indicators, despite receiving fairly low evaluations for `Quality with feeling of warmth`. Product A was liked for users more than products B. As this shows, rating highly in all six indicators of perceived quality does not mean that a product will be evaluated as displaying a high level of perceived quality. Rather, we can say that a product that receives an extremely high evaluation on one axis is more likely to be selected as a product of high perceived quality. Given this, perceived quality can be considered as something which strongly displays a specific tendency rather than something that is balanced overall.
664
Y. Nishizawa and K. Hiroi
Fig. 4. This is an example of the evaluation of award-winning designs using the six indicators of perceived quality that we defined. These are the results of a questionnaire given to average users. Both A and B was evaluated highly for only some indicators.
3 Creating Products of High Perceived Quality at Toshiba Based on the results discussed above, we set the six indicators of perceived quality as shared guidelines for designers in creating perceived quality. And we attempted to create products at Toshiba displaying a high level of perceived quality. This enabled us to develop a variety of products of high perceived quality. Examples include a range of IH cookers, a cellular phone (KOTO), and high-quality home electronic products (washing machines, ovens, and vacuum cleaners). The IH cookers and the KOTO cellular phone (Figure 5) incorporate Japanese-style design, and both received extremely high evaluations for some indicators of perceived quality. The form of the Japanese musical instrument, the koto, was used as a design element in the KOTO cellular phones (see Figure 5B), and the projected design image saw them as being finished in vermillion. This products was made for `Aesthetic quality`, `Quality that transmits the “message” of the product` and `Original quality`.
A
B
Fig. 5. A is an IH cooker, and B is a “Koto” model cellular phone. Both products were evaluated highly for design.
Design Creation Based on KANSEI in Toshiba
665
On the other hand, the IH cookers (see Figure 5A) was evaluated extremely highly for `Quality with feeling of warmth` and `Original quality`. This IH cookers feature an unusual combination of the forms of conventional IH cookers and metal pots, and present them as an integrated whole. Design efforts have also enabled the cookers to be presented as tableware. The original designs of the cookers have incorporated materials traditionally used in different parts of Japan – stainless steel in Tsubame city, Nambu ironware in Mizusawa city, earthenware in Yokkaichi, etc. – and they have been marketed as products in which the feelings of users can find a resonance. In addition to gold of the Japanese G Mark, this design has received Germany’s IF Award and Red Dot Award, indicating how highly-regarded it is in Europe. In order to demonstrate the appeal of Toshiba design, we developed advertisements that focused on their perceived quality (Figure 6).
Fig. 6. These are examples of advertising Toshiba that is the front side appeal for the sensibility quality
In addition, Toshiba is globally selling a lot of products and systems now. Therefore, design must be conducted on a global level. Issues for the future will include how to blend design elements having a global appeal with those whose appeal is unique to Japan, how to judge the amount of the ingredients, and how to incorporate essences in the design. To respond to these issues, we are at present engaging in further study of perceived quality and revising our six indicators, in order to enable them to function as a yardstick of perceived quality that will be globally valid.
4 Conclusion In the field of products for B to C, which now represents a mature market, it will be increasingly important in future to use design to create products with originality and
666
Y. Nishizawa and K. Hiroi
high perceived quality. Toshiba has introduced a “yardstick” of perceived quality as a guide to answering the question of how this originality is to be created. In 2006, the Ministry of Economy, Trade and Industry also launched a program for the development of products that were both original and incorporated a new Japanese style, as an initiative for the creation of perceived value [2]. And this year is positioned as a year for the creation of perceived value, and Japanese products falling within this category will be presented in exhibitions in Paris and elsewhere. On such movement, training its focus on the perceptions involved in perceived quality, Toshiba makes efforts to the development of the commodity with a sensibility that runs globally and an original sensibility of Japan.
References 1. The Japanesque Modern Committee: Towards a Japanesque Modern Style – Representing Japanese Tradition to the World, http://www.rieti.go.jp/jp/events/bbl/06041801.html 2. Policy Office for Design and Human Life System Manufacturing Industries Bureau METI 3. Kansei Initiative – Proposal of a fourth value axis, IIST WORLD FORUM (June 16, 2008), http://www.iist.or.jp/wf/magazine/0618/0618_E.html 4. Opinions presented at the symposium Kansei initiatives (Initiatives for the Creation of Perceived Value), held by the Japan Industrial Designers Association (JIDA) (June 18, 2007) 5. Hiroi, K.: About the sensibility value creation of the design: Research leader, vol. 10, pp. 43–51. Technical Information Institute Co., Ltd., Japan (2007)
High-Fidelity Prototyping of Interactive Systems Can Be Formal Too Philippe Palanque, Jean-François Ladry, David Navarre, and Eric Barboni IHCS-IRIT, Université Paul Sabatier – Toulouse 3, France {ladry,palanque,navarre,barboni}@irit.fr
Abstract. The design of safety critical systems calls for advanced software engineering models, methods and tools in order to meet the safety requirements that will avoid putting human life at stake. When the safety critical system encompasses a substantial interactive component, the same level of confidence is required towards the human-computer interface. Conventional empirical or semi-formal techniques, although very fruitful, do not provide sufficient insight on the reliability of the human-system cooperation, and offer no easy way to, for example, quantitatively compare two design options. The aim of this paper is to present a method, with supporting tools and techniques, for engineering the design and development of usable user interfaces for safety-critical applications. More precisely we present the Petshop environment which is a Petri net based tool for the design specification, prototyping and validation of interactive software. In this environment models of the interactive application can be interactively modified and executed. This is used to support prototyping phases (when the models and the interactive application evolve significantly to meet late user requirements for instance) as well as in the operation phase (after the system is deployed). The use of the description technique (the ICO formalism) supported by PetShop is presented on a multimodal ground segment application for satellite control and more precisely how prototyping can be performed at the various levels of the architecture of interactive systems. Keywords: Model-based approaches, formal description techniques, interactive prototyping, reliability, evolvability.
tested with potential users of the system under development while, in SE, the product is evaluated by different stakeholders including client or customer (the one who pays for or buys the product) and more unlikely users (but user-centered approaches (such as task analysis and modelling). At design stage, HCI approaches promote iteration through the production of prototypes to be presented to and used by “real” users. While such design process is widely agreed upon, the debate is still vivid whether one should use low-fidelity [24] or high-fidelity prototyping [26, 14]. When it comes to complex applications at the interaction level [19], or at the application level [25], low fidelity approaches only address a small part of that complexity. The outcome is too informal for making it exploitable further on in the development process without losing a significant part of it. This limits the use of low-fidelity prototyping approaches to the earlier phases of the development process, where main design questions are addressed and low level ones left to later phases. The main drawback of high-fidelity prototyping lays in the fact that the iterations are more time consuming and thus prevent exploration of new ideas without jeopardizing the entire project by overrun on schedule. Another inconvenient of high-fidelity prototyping is related to the product of that phase which most of the time corresponds to program code, making its integration in the rest of the application very difficult due to lack of abstraction. In this paper, we promote the use of an executable formal approach called Interactive Cooperative Objects (ICOs) within the high-fidelity prototyping phase of interactive systems development. This formal approach solves some of the limitations of Rapid Application Development (RAD) techniques currently used for high-fidelity prototyping. Indeed, it provides abstraction through models, rapid execution through simulation and testing through generation of test cases and scenarios. In addition, when the prototyping phase is terminated, the outcome is not only a partially running prototype, but also a partial formal description of its behaviour that can then be passed on to the development team in charge of the development of the final system to be deployed. Previous work we have done in that domain was focusing on the rapid prototyping of the interactive application [17]; our current work addresses the 3 levels of interactive systems prototyping: interaction technique level (including multimodal interactions with non standard input devices as tactile screens), interactive component (including sophisticated widgets such as range sliders of semi-transparent pop-up menus) [16] and the interactive application in complex environment as cockpits (both military and civil [1]), grounds segment for satellite control rooms [20] and Air Traffic Management interactive applications. This paper focuses on the use of the ICOs formal description technique to support rapid prototyping of interaction techniques. More precisely, it presents how an interaction technique can be defined and then how it can “rapidly” evolve according to users’ feedback and users’ performance. Indeed, the tool support environment for ICOs (called PetShop) has been now extended to provide additional facilities such as model-based logging of events and state-changes to support usability evaluation activities classically imbricated with rapid prototyping. This paper also addresses how logging support can be used to carry out performance analysis of the interaction technique thus limiting user testing to interaction techniques that have been previously formally analysed.
High-Fidelity Prototyping of Interactive Systems Can Be Formal Too
669
This paper is organized as follows. Next section presents some related work and research questions in the field of model-based approaches for interactive systems. The ICO notation is described in section 3. Section 4 presents the CASE tool Petshop which allows editing and execution of ICO models. Section 5 presents, on two small examples, how prototyping can be managed with PetShop and ICOs. Section 6 concludes the paper.
2 Model-Based Approaches for Interactive Systems When formal methods were initially used for interactive systems [16], models were limited to the dialog part, making them less prominent for runtime use as only one part of the interactive system was taken into account. In order to address issues raised by real life application, current trend in interactive systems engineering is to develop models for all the parts of the systems. Another parallel track of research work has been targeting at modelling new interaction techniques in order to be able to deal with current practice in the field of HCI. To deal with WIMP and post-WIMP interaction techniques, several notations have been proposed from Data-flow-based notations such as Wizz’ed [7], Icon [6], Nimmit [23] or InTml [8] to event-based notations such as Marigold [23], Hynets [23] or ICO [8]. Hybrid models integrating both event-based and data-flow-based notations have also been presented in [8] and in [15]. With respect to that later work, the work presented here extends the work presented in [15] by removing the data-flow model dealing with input devices configuration and proposing a single event-based notation described in the next section. The work presented in this paper is about providing a modelling technique capable of representing the behaviour of an entire interactive application (from physical to functional interaction) using a dedicated Petri net dialect. It also targets at new interaction techniques (e.g. multimodal, direct manipulation ...) such as the ones used in the field of HCI. This paper shows how the CASE tool Petshop [1] embeds the system models (which represent an interactive system from the interaction technique through to the system functional core) using the ICO notation at runtime for: • Prototyping of models • Execution of application to check • Analysis as a way of supporting models construction by providing additional information about the properties of the models under construction.
3 The ICO Formalism The ICO formalism is a formal description technique dedicated to the specification of interactive systems [19]. It uses concepts borrowed from the object-oriented approach (dynamic instantiation, classification, encapsulation, inheritance, client/server relationship) to describe the structural or static aspects of systems, and uses high-level Petri nets [23] to describe their dynamic behavioral aspects.
670
P. Palanque et al.
3.1 Cooperative Object The ICO notation depends on Cooperative objects, A Cooperative Object states how the object reacts to external stimuli according to its inner state. The COs behaviour is called the Object Control Structure (ObCS) is expressed in a language based on Object Petri Net (OPN) (see Fig. 1.). An ObCS can have multiple places and transitions that are linked with arcs like standard Petri nets. As an extension to these standard arcs, ICO provides additional input arcs: Test arcs and Inhibitor arcs. Each place has an initial marking (represented by one or several tokens in the place) describing the initial state of the system.
Fig. 1. Metamodel of the COs exhibiting runtime features
With respect to “standard” Petri nets, the object-oriented nature of the Cooperative Objects supports instantiation. Indeed, every ObCS can be instantiated and allows multiple executions of the same class as in object oriented programming languages. These instances can be parameterised by constructor arguments. This parameterisation is used to associate markings to the Petri net describing the behaviour of the instantiated Cooperative Object. For example, in a case of a multiple mouse interaction (i.e. in interactive cockpits such as the Airbus A380), each mouse driver is a distinct instance of an ObCS class with different Class Parameters (i.e. the number of the mouse) and so the behaviour model of each driver handle its own coordinates represented in the marking of the instance. For more details about that type of modelling see [1]. Fig. 1 presents a subset of the class diagram of ICOs. As stated above, the main element used for prototyping is related to the fact that each class can have several instances (as shown on the right-hand side of the figure) and that instances can be Played, Paused or Stopped. 3.2 Interactive Cooperative Objects To allow dealing with the specificities of interactive systems the Cooperative Objects formalism has been extended. The resulting notation is called Interactive Cooperative Objects.
High-Fidelity Prototyping of Interactive Systems Can Be Formal Too
671
An ICO is a 6-tuple where: • • • • •
CO is a Cooperative Object described in section 3.1, Su is a set of user services (a user service is a set of synchronized transitions), Wid is a set of interactive widgets (e.g. buttons, listbox, …) linked to the ICO class, Event is a set of user events coming from items of Wid, Act and Rend are the activation and rendering function described below.
Act: An activation function defines the relationship between events triggered by users while interacting with user interface objects (by manipulation of input devices such as mouse, keyboard, voice recognition systems …) with the transitions of the ObCS. When an event is triggered the related transition can be fired if the transition was fireable (according to the current marking of the Petri net). Rend: A rendering function defines how the state changes in the ObCS influence the changes in the presentation (what the user perceives of the application). The state changes are linked to the entering in or exiting of a token in a place.
4 Prototyping of ICO Models Using Petshop Tool To support the manipulation of the ICO notation, a CASE tool called Petshop [1] has been developed. It includes a Java implementation of a Object-oriented Petri net interpreter and some analysis tools for verifying properties on the models. The tool is publicly available at http://ihcs.irit.fr/petshop. 4.1 Structure Fig. 2 represents the high level structure of Petshop. In Petshop, it is possible to edit, execute and analyze the instances of ObCS. When the user edits an instance, Petshop starts to update the ObCS (the class) and then updates all the instances of this class. During the first execution of the instance, the instantiation engine takes the ObCS to create an instance. Next, this instance is executed and can be directly managed by the user of Petshop (started, paused and stopped). When the instance is running, Petshop can also analyze the model (currently limited to the calculation of place invariants and transition invariants [10]). An example of PetShop user interface is presented in Fig. 2.
Fig. 2. High Level Structure of Petshop
672
P. Palanque et al.
4.2 Edition of Models The CASE tool Petshop allows: • to graphically add Petri net items (place, transition and different arcs) , • to modify the initial construction parameters of the class (e.g. editing a set of variables that may have different values for each instantiation) • to modify the initial marking for each place (that corresponds to raw values or to references to the initial parameters of the class), • to change the executable code in the transition, • to modify the layout of the Petri net, • to cut copy paste part of the model, • to undo redo any change, • to navigate through large models via mini map or through a large set of models via a tree. 4.3 Execution of Models In Petshop a toolbar ( ) allows the user to start/stop/pause an instance of the ObCS. There are two modes of execution of instances: • A normal execution in which the user is a spectator of execution and observes the execution of an instance. Transitions are fired using random enabling substitutions, • A step by step execution in which the user can select a substitution to fire the transition. At runtime, the execution of instances gives the following feedback to the user: • The marking is shown by the number of tokens present in a place, • The fireability of transitions is shown by colour changes: purple for fireable or gray for not fireable, • The firing of a transition and the updating of the marking (by the evolution of tokens in the input and output places of the fired transition). Petshop also provides observability and controllability services via an API for externals programs (in our case the window manager of the plateform handing input devices). Observability services send events to subscribers when: markings change, substitutions change and events are raised in code associated to the transitions. Controllability services receive events from external sources and fire the related transition of a user service. All traces of execution can be logged to an external file allowing further analysis such as usability evaluation of the interactive systems [5].
5 Prototyping Interactive Systems with ICOs This section presents the prototyping capabilities of PetShop and the ICO notation. These capabilities are presented on two examples extracted from case studies. They
High-Fidelity Prototyping of Interactive Systems Can Be Formal Too
673
show different aspects illustrating how prototyping can be performed at different levels of the architecture of interactive systems. 5.1 Prototyping Interaction Techniques The example in this section presents how it is possible using the ICO notation to prototype low level interaction techniques. Such prototyping is critical to increase usability of interactive applications as fine tuning of interaction can have a huge impact on the overall performance of users [13].
Fig. 3. ICO model of a mouse driver
The model of Fig. 3 describes a transducer for handling low level events. It models how events from the input device (in such as a case a pointing device like a mouse) are received from the input device and how they are transformed according to the need of the interactive application. Dark transitions represent the transitions that are available according to the current marking of the model. Their black border means that they are connected to events i.e. even though they are available according to the current marking, they must additionally receive an event to be actually fired. The model can receive 4 different events: mouseMove, mousePressed, mouseReleased and mouseClick. The current position of the cursor of the input device is stored in the place Currentxy. When a mouseMove event is received the transducer has to transform the dx, dy parameters received in x and y position to reflect that change on the mouse cursor. In order to keep the cursor inside a set of predefined bounds (this could be for instance the size of the screen or the size of a portion of a window) the transformation of x and y values according to dx and dy parameters has to be constrained. This is the role of the places named Bounds. As for a notational aspect these places are virtual places i.e. virtual copies of a single place. This notational aspect is used to reduce the number of arcs when the same place is connected to many transitions. The code of the transitions mouseClick, mouseReleased and mousePressed feature contain the Trigger construct. This means that, when one of these transitions is fired
674
P. Palanque et al.
the model will raise an event. Other models registered to the current model will then be notified for each event triggered. The model in Fig. 4 shows how the previous model can be modified according to requests from modification (after usability evaluation for instance).
Fig. 4. Modified ICO model of a mouse driver (acceleration of mouse move events)
The modification includes a new element in the interaction technique: the acceleration. Indeed, the movements on the table where the mouse is located are typically much more constrained than the virtual space available to the cursor. For this reason mouse drivers will embed an acceleration mechanism that increase cursor movement according to speed. This is modelled by adding the places Coef in the models and connecting them to the transitions in charge of the calculation of the new position of the cursor. The code of these transitions shows that dx and dy parameters are multiplied by the coefficient (stored in the token of the place Coef). 5.2 Prototyping Applications While the prototyping of interaction techniques is critical for fine tuning of interaction, prototyping is also needed at a higher level. This section presents how PetShop and ICO support prototyping at the dialogue level of interactive applications. The prototyping aspects remain the same as for the interaction technique i.e. models describing the behaviour of the applications at the dialogue level can be interactively modified and the impact of the modifications can be immediately perceived. The application under consideration here is called MPIA. The Multi Purpose Interactive Application (MPIA) is an application available in the cockpits of several aircrafts that aims at handling several flight parameters. It is made up of 3 pages (called WXR, GCAS and AIRCOND). The WXR page is responsible for managing weather radar information; GCAS is responsible for the Ground Anti Collision System parameters while AIRCOND deals with settings of the air conditioning. Due to space constraints we don’t present in details the interactive modifications of the models but the interested reader can see detailed behaviour of that application (in a reconfiguration process after hardware failure in a cockpit) in [18].
High-Fidelity Prototyping of Interactive Systems Can Be Formal Too
675
6 Conclusion This paper presents the ICO notation for the description of interactive systems via graphical models which can be edited and executed at runtime. The ICO notation, an extension of object Petri nets has a dedicated CASE tool called Petshop. This runtime capability increases the possibilities of modelling by supporting prototyping, testing, and verification. This paper presented how prototyping of interactive applications can be performed at two different levels: interaction technique and dialogue model. The later is extracted from an industrial example dealing with cockpit applications in civil aircrafts. We have studied the usability of ICOs and PetShop for prototyping phases in an informal with software engineers involved in the field of Air Traffic Control applications [2]. Informally we can report that modification of models was fine while creation of models and connecting models was not performed in a satisfying way. Testing of the tool is available at http://ihcs.irit.fr/petshop. The specific application area that we consider in the paper is ground segment applications for satellite control, but the results have been applied and are applicable to other application areas with similar requirements. Acknowledgements. This work is supported by the EU funded Network of Excellence ResIST http://www.resist-noe.eu contract n°026764 and the CNES funded R&T Tortuga project http://ihcs.irit.fr/tortuga/ contract n° R-S08/BS-0003-029. We would also like to thanks the reviewers for their in-depth thoughtful comments.
References 1. Barboni, E., Navarre, D., Palanque, P., Basnyat, S.: Addressing Issues Raised by the Exploitation of Formal Specification Techniques for Interactive Cockpit Applications. In: HCI Aero 2006, p. t.b.p., Seattle (2006) 2. Bastide, R., Navarre, D., Palanque, P.: A Tool-Supported Design Framework for Safety Critical Interactive Systems. Interacting with computers 15(3), 309–328 (2003) 3. Bastide, R., Palanque, P., Duc, L.: Integrating Rendering Specifications into a Formalism for the Design of Interactive Systems. In: DSV-IS 1998, pp. 171–190 (1998) 4. Beck, K.: Extreme Programming Explained: Embrace Change. Addison-Wesley, US (1999) 5. Bernhaupt, R., Navarre, D., Palanque, P., Winckler, M.: Model-Based Evaluation: A New Way to Support Usability Evaluation of Multimodal Interactive Applications In Maturing Usability, Quality in Software, Interaction and Value. In: Human-Computer Interaction Series, pp. 96–119. Springer, Heidelberg (2007) 6. Dragicevic, P., Fekete, J.-D.: Input Device Selection and Interaction Configuration with ICON. In: Proceedings of IHM-HCI 2001, People and Computers XV - Interaction without Frontiers, pp. 543–448. Springer, Heidelberg (2001) 7. Esteban, O., Chatty, S., Palanque, P.: Whizz’Ed: a visual environment for building highly interactive interfaces. In: Proceedings of the Interact 1995 conference, pp. 121–126 (1995) 8. Figueroa, P., Green, M., Hoover, J.: InTml: A Description Language for VR Applications. In: Proceedings of Web3D 2002, Arizona, USA, pp. 53–58 (2002) 9. Fowler, M., Highsmith, J.: The Agile Manifesto. Software Development (August 2001) 10. Genrich, H.J.: Predicate/Transitions Nets. In: Jensen, K., Rozenberg, G. (eds.) High-Levels Petri Nets: Theory and Application, pp. 3–43. Springer, Berlin (1991)
676
P. Palanque et al.
11. Gulliksen, J., Goransson, B., Boivie, I., Blomkvist, S., Persson, J., Cajander, A.: Key principles for user-centred systems design. Behaviour and Inf. Tech. 22, 397–409 (2003) 12. Jacob, R.: A Software Model and Specification Language for Non-WIMP User Interfaces. ACM Transactions on Computer-Human Interaction 6(1), 1–46 (1999) 13. Kabbash, P., Buxton, W.A.: The “prince” technique: Fitts’ law and selection using area cursors. In: Proceedings of the ACM CHI Conference, pp. 273–279. ACM Press, New York (1995) 14. Lim, Y., Pangam, A., Periyasami, S., Aneja, S.: Comparative analysis of high- and lowfidelity prototypes for more valid usability evaluations of mobile devices. In: Proc. of NordiCHI 2006, vol. 189, pp. 291–300. ACM, New York (2006) 15. Navarre, D., Palanque, P., Dragicevic, P., Bastide, R.: An Approach Integrating two Complementary Model-based Environments for the Construction of Multimodal Interactive Applications. Interacting with Computers 18(5), 910–941 (2006) 16. Navarre, D., Palanque, P., Bastide, R., Sy, O.: Structuring interactive systems specifications for executability and prototypability. In: Palanque, P., Paternó, F. (eds.) DSV-IS 2000. LNCS, vol. 1946, pp. 97–120. Springer, Heidelberg (2001) 17. Navarre, D., Palanque, P., Bastide, R., Sy, O.: A Model-Based Tool for Interactive Prototyping of Highly Interactive Applications. In: 12th IEEE International Workshop on Rapid System Prototyping, Monterey, USA, IEEE, Los Alamitos (2001) 18. Navarre, D., Palanque, P., Basnyat, S.: Usability Service Continuation through Reconfiguration of Input and Output Devices in Safety Critical Interactive Systems. In: Harrison, M.D., Sujan, M.-A. (eds.) SAFECOMP 2008. LNCS, vol. 5219, pp. 373–386. Springer, Heidelberg (2008) 19. Navarre, D., Palanque, P., Bastide, R., Schyn, A., Winckler, M., Nedel, L.P., Freitas, C.M.D.S.: A model-based approach for engineering multimodal interactive systems. In: Costabile, M.F., Paternó, F. (eds.) INTERACT 2005. LNCS, vol. 3585, pp. 170–183. Springer, Heidelberg (2005) 20. Palanque, P., Bernhaupt, R., Navarre, D., Ould, M., Winckler, M.: Supporting Usability Evaluation of Multimodal Man-Machine Interfaces for Space Ground Segment Applications Using Petri net Based Formal Specification. In: Ninth International Conference on Space Operations, CD-ROM proceedings, Rome, Italy, June 18-22 (2006) 21. Parnas, D.L.: On the use of transition diagram in the design of a user interface for interactive computer system. In: Proceedings of the 24th ACM Conference, pp. 379–385 (1969) 22. Peterson, J.L.: Petri Net Theory and the Modeling of Systems. Prentice-Hall, Englewood Cliffs (1981) 23. Reason, J.: Human Error, 302 pages. Cambridge University Press, Cambridge (1990) 24. Rettig, M.: Prototyping for tiny fingers. Commun. ACM 37(4), 21–27 (1994) 25. Risoldi, M., Amaral, V.: Towards a Formal, Model-Based Framework for Control Systems Interaction Prototyping. Rapid Integration of Software Engineering Techniques, 144–159 (2007) 26. Rudd, J., Stern, K., Isensee, S.: Low vs. high-fidelity prototyping debate. Interactions 3(1), 76–85 (1996) 27. Vanacken, D., De Boeck, J., Raymaekers, C., Coninx, K.: NiMMiT: a Notation for Modelling Multimodal Interaction Techniques. In: International Conference on Computer Graphics Theory and Applications, Portugal (2006) 28. Wieting, R.: Hybrid High-Level Nets. In: Proc. of the 1996 Winter Simulation Conference, pp. 848–855. ACM Press, New York (1996) 29. Willans, J.S., Harrison, M.D.: Prototyping pre-implementation designs of virtual environment behaviour. In: Nigay, L., Little, M.R. (eds.) EHCI 2001. LNCS, vol. 2254, pp. 91– 108. Springer, Heidelberg (2001)
RUCID: Rapid Usable Consistent Interaction Design Patterns-Based Mobile Phone UI Design Library, Process and Tool Avinash Raj1 and Vihari Komaragiri2 1
Toronto, Canada avinash.raj@hotmail.com 2 Bangalore, India vihari@gmail.com
Abstract. This paper is based on a research effort at Kyocera Wireless, India that aimed to overcome the limitations in the mobile phone design process, by giving designers an improved design and specification tool and helping them deal routinely with some of the more rooted constraints of phone design. The tool extends the idea of templates from simple visual elements, to more abstract design components. It adds further value to this modularization of design, by taking an approach of extensive and ever-growing library of patterns to define and refine these components. The components cover most of the low- to medium-level building blocks of design. They are specified in the library as a tuple(patterns) of <design problem, design solution, context, constraints> each at the different level of hierarchy. The components are visually represented using standardized shapes with placeholder and help text and are made available as part of the design work surface of a visual prototyping tool such as MS Visio or Adobe Fireworks. Keywords: Mobile phone UI design, patterns, architecture, design process, lib.
same set of application over multiple mobile phones of either the same vendor or, in fact, even of different vendors. 1.1 Current Mobile Phone Interaction Design Process Though design guideline documents deliver coherence among User Interfaces, they cannot be used effectively to communicate how different components of the design will work together and how Users will interact with them. In addition, guidelines can get obsolete or ignored very soon in the fast developing world of mobile phones. Furthermore, the interaction designer is limited to the task of specifying the design that a software engineer then implements it in an embedded software development environment that is notorious for its lack of sophisticated APIs for UI creation. This indirection and the limitations inherent in the development environment, also mean that a lot of design intent and time could be lost in translation. There needs to be a way to put design implementation in the hands of interaction designers, and this needs to be done in a “backward” compatible manner. Even when new technologies like FlashLite, uiOne become the platforms of UI development, there will still be some phones that require UI development in native code. So the solution will have to support both styles of interaction design and implementation. While phone vendors have started adopting a platform approach to software development, adding incremental features to existing code bases and builds, this has the bug/feature of perpetuating design from older phones, whether good or bad. There is also no easy way of upgrading a design element for greater usability because it is difficult to trace a design element across various features where it is used. A solution to this problem could be ensure that interaction design is modular to the extent possible and utilizes design elements in a consistent, traceable manner. 1.2 Proposal to Solve Usability, Consistency and Time-to-Market Constraints The aim of this research is to provide a tool where interaction designers can choose from the pre-packaged design element library and use the appropriate element by mapping the usability constraints and context of the pattern with the needs and context of the feature being designed. Presented in the form of a Microsoft Visio template-based prototype tool, these patterns can be easily used by the designers. After the usual steps of analyzing the design problem, identifying the User goals and then breaking them down into tasks, the paper proposes a change to the design process. Instead of trying to sketch the design from scratch from that point onwards, the designer simply uses the design tool and its template library to look for and use design modules that already exist. For the part of the design that does not yet exists, the designer then builds newer tasks and flows from existing building-block objects. The designer then adds these newer creations as potential candidates to the pattern template library to be verified for usability, to be incorporated and then used by other designers in creating other features. The designer achieves speed and design consistency with this approach. The modular pattern library allows for reuse across designers and design teams, and for usability refinement, design evolution and for backward compatibility with changes as the product evolves.
RUCID: Rapid Usable Consistent Interaction Design
679
2 RUCID Basics In this section, we present a novel formulation of mobile phone interaction design architecture and build on this framework using the patterns based approach inspired by Christopher Alexander [2]. Some samples of mobile phone interaction design patterns are presented here to illustrate the concept. We draw from the work of Alan Cooper [1] to derive interaction design architecture for mobile phones. The mobile phone has many input Triggers (typically the 12-key keypad, plus five-way navigation keys and so on). The context of use of the mobile phone is much different from the mouse and hence its interaction design structure is quite different from that of a computer mouse. There is a need for having a Primitive action for example a key press achieve not just generic input/output or application specific commands but directly address User goals. We address this in our model as follows: The User’s goals in using an application can be broken down into some generic tasks common across applications. These generic tasks precede and succeed specific tasks called into existence by the needs catered to by the feature being designed. For example, “starting an application or closing it” are typically generic tasks; “playing a music track” or “composing an SMS message” are feature specific tasks. Generic tasks are made up of Flows of input interaction in conjunction with the output – actions, symbols, graphics, and other feedback information – expressed on the screen (and speakers, vibrations etc.) of a mobile phone. The Flows are represented as Idioms to left hand side of the Alan Cooper inverted pyramid, while the output is represented on the right hand side of the inverted pyramid. The output shown on screen can be further divided into information, widgets and graphics. A Flow can be thought of as a sub-task or a sequence of Primitive and/or Compound actions that results in an application specific function to be executed. A Compound action in turn constitutes a sequence of Primitive User actions and phone reactions that achieve a User’s sub-objective. In a typical mobile phone design, a Flow that achieves a User objective, or a Compound action that achieves a subobjective, may simply consist of one Primitive action for example; press and hold of hash key can actually achieve a User goal of locking the phone. This User goal can also be achieved by accessing settings from the menu, choosing the keypad lock menu option and then enabling the keypad lock option. The architecture (Fig.1) of interaction design in mobile phones is at the heart of our pattern exploration. It anchors our search for interaction design patterns in mobile phones and also provides means to organize, link and document them. Since this model also articulates the typical top down process of design, it lends itself to very practical application as evidenced by the prototype tool that we created. In the following pages we look at sample patterns generated at the Widget, Primitive, Compound and Flow level, one sample each. The Compound, Primitive and Flow patterns that follow are tailored to the tool rather than the Alexandrian pattern. There were altogether 23 Primitive patterns, 3 Compound patterns, 15 Flow and 29 Widget patterns that were captured during the course of this research.
680
A. Raj and V. Komaragiri
User Goal
Tasks
Screens
Info
Graphics
Primitives
Flows
Compounds
Triggers
Widgets
Fig. 1. “A New Mobile UI Design Architecture Model” that details out the hierarchy of design levels, starting from “User goals” all the way to “ Action-Triggers” Table 1. Sample Widget pattern (Soft key Window) Problem Context Solution
Rationale
User needs to access additional functions that can be performed on a screen. For a given Screen, a User has more number of actions possible than the maximum number of Soft keys. One of the options that can be accessed through a softkey can provide gateway to multiple options. The User can move to these options and select a desired one. A limited number of Softkeys can be displayed at any one time. A dedicated key cannot be assigned to each option because: 1) The options and their number keep changing depending on the screen. 2) The surface area of mobile is small and limited. Using a single key to access a variable sized list allows for any number of items to be accommodated.