Grading Student Achievement in Higher Education: Signals and shortcomings (Key Issues in Higher Education)

Grading Student Achievement in Higher Education The grading of students in higher education is important for a number ...

Author: Mantz Yorke

77 downloads 1771 Views 2MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Grading Student Achievement in Higher Education

The grading of students in higher education is important for a number of reasons, not least because it can strongly influence whether particular career opportunities can be pursued. In contemporary higher education systems, students are expected to demonstrate not only high standards of academic achievement but also their broader employability. Many aspects of employability are difficult to grade, however, and some students may lose out because their particular strengths are not sufficiently acknowledged by current summative assessment practices. Drawing on evidence from Australia, the UK and the US, Grading Student Achievement in Higher Education appraises the way in which summative assessment in higher education is approached and demonstrates that current practices are of questionable robustness. Topics discussed include: • • • • • •

the fuzziness of many grading criteria; the difficulty of achieving reliability in grading; aspects of student achievement that are resistant to numerical grading; differences between subject areas regarding the outcomes of grading; weaknesses inherent in the statistical manipulation of grades; variation between institutions in the regulations that determine overall grades.

The book also discusses critically, and with respect to a new analysis of data from the UK, ‘grade inflation’, showing that grades may rise for reasons that are not necessarily deplorable. Grading Student Achievement in Higher Education argues that there is a need to widen the assessment frame if the breadth of valued student achievements is to be recognised adequately and meaningful information is to be conveyed to interested parties such as employers. Concluding with suggestions towards resolving the problems identified, the book will appeal to researchers, managers and policymakers in higher education, as well as those involved in quality assurance and the enhancement of teaching and learning. Mantz Yorke is Visiting Professor in the Department of Educational Research, Lancaster University, UK.

Key Issues in Higher Education series Series Editors: Gill Nicholls and Ron Barnett

Books published in this series include: Citizenship and Higher Education The role of universities in communities and society Edited by James Arthur with Karen Bohlin The Challenge to Scholarship Rethinking learning, teaching and research Gill Nicholls Understanding Teaching Excellence in Higher Education Towards a critical approach Alan Skelton The Academic Citizen The virtue of service in university life Bruce Macfarlane

Grading Student Achievement in Higher Education Signals and shortcomings Mantz Yorke

First published 2008 by Routledge 2 Park Square, Milton Park, Abingdon, Oxon OX14 4RN Simultaneously published in the USA and Canada by Routledge 270 Madison Ave, New York, NY 10016 Routledge is an imprint of the Taylor & Francis Group, an informa business © 2008 Mantz Yorke This edition published in the Taylor & Francis e-Library, 2007. “To purchase your own copy of this or any of Taylor & Francis or Routledge’s collection of thousands of eBooks please go to www.eBookstore.tandf.co.uk.” All rights reserved. No part of this book may be reprinted or reproduced or utilised in any form or by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing from the publishers. British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library Library of Congress Cataloging in Publication Data Yorke, Mantz. Grading student achievement in higher education: signals and shortcomings/Mantz Yorke. p. cm. – (Key issues in higher education series) Includes bibliographical references and index. ISBN 978-0-415-39396-6 (hardback) 1. Grading and marking (Students) 2. Grading and marking (Students – Case studies. I. Title. LB2368.Y67 2007 371.27´2–dc22 2007008677 ISBN 0-203-93941-7 Master e-book ISBN

ISBN10: 0-415-39396-5 (hbk) ISBN10: 0-203-93941-7 (ebk) ISBN13: 978-0-415-39396-6 (hbk) ISBN13: 978-0-203-93941-3 (ebk)

To past and present members of the Student Assessment and Classification Working Group, in appreciation of their colleagueship and support in difficult times. Also to the memory of Peter Knight, great friend and collaborator on many projects, who died as this book went to press.

Contents

List of figures List of tables Acknowledgements Abbreviations

Prologue: through a glass, darkly

viii ix xi xii 1

1 The complexity of assessment

10

2 Grading and its limitations

31

3 Variations in assessment regulations: three case studies

68

4 UK honours degree classifications, 1994–95 to 2001–02: a case study

81

5 How real is grade inflation?

105

6 The cumulation of grades

134

7 Value added

155

8 Fuzziness in assessment

172

9 Judgement, rather than measurement?

182

Epilogue: beyond the sunset

207

References Index

209 231

Figures

2.1 4.1 4.2 6.1 6.2 6.3 6.4 8.1 9.1

Scattergram of gain/loss in percentage mark, for 791 students in a new university in the UK Illustrations of a rising trend, but at a relatively weak level of statistical significance, and of a rising trend, but at a much higher level of statistical significance Trends in entry qualifications and exit performance in Biological Sciences for Russell Group and other pre-1992 universities The distribution of mean percentages for candidates’ honours degree classifications A hierarchy of professional practice, based on Miller (1990) and Shumway and Harden (2003), and related to possible assessment methods Mean percentages for 832 students, with confidence limits set at 2 times the SE(M) either side of each observed mean An illustration of the loss of information as a consequence of cumulating assessment outcomes Frequency of mention of three aspects of achievement related to categories of honours degree outcome An approach to the assessment of portfolios

65 90 95 136 141 148 153 177 189

Tables

1.1 1.2 1.3 1.4 2.1 2.2 2.3 2.4 2.5 2.6 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.1 4.2 4.3 4.4 4.5 4.6

Purposes of assessment Criteria for assessment, and some related questions Two contrasting models of educational assessment Realist and relativist perspectives on assessment Grading of undergraduate work at the University of Derby An illustration of the diversity of reporting grades in Europe Contrasting perspectives on the number of scale divisions A matrix relating assessment outcome to competence in practice Illustrative statistical data from a selection of modules Percentages of A and B grades in a selection of highenrolment courses Grades outside A to F which can appear on students’ records at the University at Buffalo Approaches to the reporting of student achievement Number of modules that can be assessed on a pass/fail basis for each term/quarter/semester Number of modules that can be assessed on a pass/fail basis during a student’s academic career The lowest passing letter grade Institutional rules regarding the retaking of modules The number of times a single module can be repeated Degree classification in some Australian universities Grade points at the University of Canberra Conflicting interpretations of the percentage of ‘good honours degrees’ An implausible ‘run’ of awards Sharp changes in the number of unclassified degrees awarded Trends in the percentage of ‘good honours degrees’ by subject area, 1994–95 to 2001–02 Trends in ‘good honours degrees’ by institutional type, 1994–2002 Trends in ‘good honours degrees’ in pre-1992 universities, 1994–2002

11 20 27 27 36 37 38 39 57 58 69 70 71 71 71 72 72 78 79 82 85 87 89 91 92

Tables 4.7

Rising and falling trends in the three different types of institution 4.8 Entry and exit data for Russell Group (RG) and non-Russell Group pre-1992 universities 4.9 A comparison of the proportion of good honours degrees awarded in selected subject areas in three groups of institutions 4.10 Levels of part-time employment reported by first year fulltime students from contrasting socio-economic backgrounds 5.1 Mean percentages of A and B grades of undergraduates who graduated from high-school in the stated years 5.2 Percentages of bachelor’s degrees gained in the UK, by broad subject area, summer 2005 5.3 The effect of +/– grading on undergraduate grades at North Carolina State University 6.1 Variations on an ‘original’ set of marks, and their implications for the honours degree classification 6.2 The distribution of overall percentages of 832 students between bands of the honours degree classification 7.1 An excerpt from the feasibility study showing the performance of students 7.2 An excerpt from the feasibility study showing the employment profile of students 7.3 Broad benefits accruing from participation in higher education 8.1 An adapted excerpt from a document informing students of assessment issues 8.2 The grading of work according to the extent to which expected learning outcomes have been achieved 9.1 Categories of educational objectives 9.2 Aspects of ‘graduateness’ 9.3 The categories of the SOLO taxonomy mapped against those of professional development according to Dreyfus and Dreyfus 9.4 A fuzzy approach to the honours degree classification and GPA

93 94 97 99 113 118 131 144 149 160 160 165 175 176 184 188 191 198

Acknowledgements

This book owes a great deal to the Student Assessment and Classification Working Group in the UK, which has been active, particularly in respect of the ‘macro’ side of assessment, since 1994. Most of the past and present members of SACWG are listed in references to multiple-authored papers whose first authors are Bridges, Woolf and Yorke. It also owes a lot to others whose names appear at intervals in the text, and also to the encouragement of Vic Borden and Sally Brown. I am grateful to the following for the following permissions to include mat erial: the University of Derby, in respect of Table 2.1; the University of Canberra, in respect of Table 3.9; the Higher Education Funding Council for England, in respect of Tables 7.1 and 7.2; the Institute for Higher Education Policy, in respect of Table 7.3; the Quality Assurance Agency for Higher Education, in respect of Table 9.2; the Higher Education Academy, for permission to draw on material originally commissioned to support the work of the Burgess Group. The data analysed in Chapter 4 were provided over a number of yeears by the Higher Education Statistics Agency (HESA), which requires authors who use its data to point out that HESA cannot accept responsibility for any inferences or conclusions derived from the data by third parties. None of the above is responsible for the content of this book, of course: any sins of commission and omission are mine alone.

Abbreviations

AACRAO American Association of Collegiate Registrars and Admissions Officers ACER Australian Council for Educational Research ACT American College Testing APL assessment of prior learning AQA Assessment and Qualifications Alliance AVCC Australian Vice-Chancellors’ Committee CAE Council for Aid to Education CEM Curriculum, Evaluation and Management (Centre, Durham University) CEQ Course Experience Questionnaire CLA Collegiate Learning Assessment (Project) CMU Campaign for Mainstream Universities (now Coalition of Modern Universities) CNAA Council for National Academic Awards CPD continuing professional development DEST Department of Education, Science and Training DfES Department for Education and Skills ECTS European Credit Transfer and Accumulation System ESECT Enhancing Student Employability Co-ordination Team GCSE General Certificate of Secondary Education GPA grade-point average HEFCE Higher Education Funding Council for England HEFCW Higher Education Funding Council for Wales HEQC Higher Education Quality Council HESA Higher Education Statistics Agency HSR High School Record IHEP Institute for Higher Education Policy LTPF Learning and Teaching Performance Fund LTSN Learning and Teaching Support Network MRSA Measuring and Recording Student Achievement (Scoping Group) NAB National Advisory Body for Public Sector Higher Education NCIHE National Committee of Inquiry into Higher Education NCPPHE National Center for Public Policy and Higher Education

Abbreviations xiii NGA NUCCAT NVQ OECD OFSTED OIA OSCE PCFC PDP PISA QAA RAE SACWG SAT SCoP SED SEEQ THES TQi UCAS UUK VAM

National Governors Association Northern Universities Consortium for Credit Accumulation and Transfer National Vocational Qualification(s) Organisation for Economic Co-operation and Development Office for Standards in Education Office of the Independent Adjudicator for Higher Education Objective Structured Clinical Examination Polytechnics and Colleges Funding Council personal development planning Program for International Student Assessment Quality Assurance Agency for Higher Education Research Assessment Exercise Student Assessment and Classification Working Group Scholastic Aptitude Test Standing Conference of Principals (now GuildHE) Scottish Education Department Student Evaluation of Educational Quality The Times Higher Education Supplement Teaching Quality Information Universities and Colleges Admissions Service Universities UK value added modelling

Prologue Through a glass, darkly

There is no general agreement in higher education regarding how student performances (whether in coursework or in formal examinations) should be graded, and no general understanding of the detail of how grades are cumulated into an overall index of achievement. This presents those who use grades with an interpretive challenge, the scale of which is rarely appreciated. What can be inferred from the fact that an applicant for a job claims to have an upper second class honours degree or a grade-point average at bachelor’s level of 3.24? Not a lot, since, even when the subject(s) of study are named, there is often relatively little detail of the content or about the circumstances under which the student’s achievements were produced. A recruiter will, if the applicant passes the first selection filter, delve into the background of the achievements in order to tease out more information. Some of that information will be grades achieved in components of the programme of study: the odds are that these will not be subjected to much scrutiny. A score of, say, 62 per cent in a relevant module may be deemed adequate for the recruiter’s purposes, without clarification of the assessment tasks involved or how such a percentage stands in relation to those of the student’s peers. The number is accorded a robustness that it does not merit. There is, in addition, variation in the scales that institutions use in grading work. In Australia, the Australian Vice-Chancellors’ Committee reported in 2002 that there were at least 13 different grading scales in operation for reporting students’ overall achievement at bachelor’s level.1 In a number of universities, the grading methodology was determined at faculty level. Across 21 European higher education systems, there is a similar amount of variability in respect of grading students’ overall performances at bachelor’s level (Karran, 2005). The challenges to the international recruiter are obvious. The Australian report Striving for quality (DEST, 2002) expressed concerns that resonate beyond that continent’s shores when it observed:

1 The use of the words ‘at least’ indicated the opacity of practice across the Australian system. See http://www.avcc.edu.au/documents/universities/key_survey_summaries/Grades_for_Degree_Subjects_Jun02.xls (accessed 19 May 2006).

Prologue The variability in assessment approaches has the potential to cause domestic and international students considerable confusion. Many school leavers are coming to higher education with an experience, understanding and expectation of assessment that is seldom the same as that they experience in higher education. Most overseas students will be arriving with other frames of reference in terms of assessment. The lack of consistency also serves to complicate the application of a common understanding of standards across the sector. (DEST, 2002: para 152) There is plenty of room for confusion and misunderstanding, whether on the part of a student making the transition into higher education or of an employer in the process of recruitment. A lot hangs on the robustness and interpretability of the grades that are awarded: a contention in this book is that grading is asked to bear considerably more weight than it can validly sustain. Geisinger (1982: 1139) perhaps inadvertently pointed to the problem when writing ‘The chief function of marks is to carry information concisely, without needless detail’. The difficulty is that the conciseness of the mark is accompanied by the loss of the detail that allows the receiver to appreciate with some richness the achievement that has been made, unless supplementary information is made available. Too often the shorthand of the raw mark or derived grade is taken synecdochically for a fuller report. Richard James, who was a protagonist in a major survey of assessment practices in Australia,2 concluded that Assessment is possibly one of the least sophisticated aspects of university teaching and learning. (James, 2003: 197)

Disarray Summative assessment in higher education is, as Knight (2002) argues, in disarray – indeed, in such disarray that James’ words constitute something of an understatement. With very few exceptions, such as in the case of Medicine, summative assessment is probably the least sophisticated aspect of provision in higher education. More broadly, the Quality Assurance Agency for Higher Education in the UK has consistently shown in its reports that assessment is the aspect of curriculum that stands in the greatest need of development. The disarray is detectable at a number of points in the systems in use for summative assessment, which are brought together under the five broad headings below and are discussed in the chapters that follow.

2 See the extensive bank of resources compiled by James et al. (2002a).

Prologue 1 Variation in assessment practices and regulations. There is in the literature testimony to the variation in assessment practices between and within institutions (and, beyond this, between and within national systems of higher education), but rather less about the variation in assessment regulations which exert an often unappreciated influence on assessment outcomes. 2 Lack of technical robustness. The concern here is typically in respect of the reliability of the assessment process. Reliability may be achievable only at prohibitive cost. Validity can be problematic, especially where the student is expected to demonstrate the ability to deal with complex situations of various kinds. The duality of validity, in respect of past and future performance, is a longstanding problem. 3 Concerns about grading. Grading patterns vary across subject disciplines. Assessment criteria are often fuzzy. Some aspects of performance are less amenable to grading than others, yet assessment regulations often require performances to be fitted into a grading system for which they are ill-suited. A further concern about grading relates to the way in which arithmetical manipulations are often used inappropriately to produce an overall grading for a student’s achievements. Grade inflation is perceived by many to be a problem. 4 Lack of clarity about the student’s performance. Even when expected learning outcomes are stated in some detail, the extent to which these cover the domain of study is not always apparent, and the student’s performance may be based on a different profile of achievement from the profile specified for the programme of study. Further, the assessment demand may influence the ‘curriculum in use’. 5 Problems in communication. Assessment (whether formative or summative) is a complex signalling system in which breakdown of communication may occur because the recipient of information about grading may misinterpret its meaning – for example, by being unaware of the conditions under which the performance was achieved. The purpose of this book, which concentrates on assessment at the level of the bachelor’s degree, is to explore a number of the problems with grading and summative assessment in general, in order to demonstrate that it is in need of radical reform, and to suggest some ways forward. The system is broke, and needs fixing.

What prompted this book The stimulus for this book derived from my membership of the Measuring and Recording Student Achievement (MRSA) Scoping Group established by Universities UK and the Standing Conference of Principals in 2003, which reported as UUK and SCoP (2004). The MRSA Scoping Group was charged with reviewing the recommendations from the government White Paper The Future of Higher Education (DfES, 2003) relating to recording student achievement, value added,

Prologue degree classifications and credit systems. This aspect of the White Paper probably had its origins in the Dearing Report (NCIHE, 1997: para 9.37ff), in which the assessment and the recording of student achievement were discussed. The Dearing Report noted that the evidence it had received from a substantial minority of contributors indicated that they took the view that the honours classification system had outlived its usefulness, and went on to observe that those who held this view felt that, while the classification made sense in a small homogeneous system where general classifications said something meaningful about a student’s achievements, it no longer provided useful information, given the varying aims of degree programmes. (ibid.: para 9.44) In other words, the diversity inherent in a massified system of higher education undercut the validity and value of a single index of achievement – a point that is germane beyond the shores of the UK. With the adoption of a more detailed approach to the recording of student achievement, the report envisaged that the degree classification system would become increasingly redundant (ibid.: para 9.52). The MRSA Scoping Group was charged with considering in particular: • • • •

The relationships that potentially exist between recording student achievement, measuring value added, degree classification and credit. Existing work and research that could inform the work of the Group and the taking forward of its recommendations and the input of experts from the sector. The diversity in missions of providers of Higher Education, and their students and the autonomy of their systems and processes. International implications of both the issues and suggested outcomes within the groups [sic] remit, particularly in relation to the Bologna process. (UUK and SCoP, 2004: 51)

In the context of this book, the most important areas of the MRSA Scoping Group’s work (rephrased here) were: • • • • •

reviewing current approaches to the recording of student achievement; identifying robust ways of describing, measuring and recording student achievement; evaluating research on the honours classification system; reviewing progress on the use of transcripts and personal development portfolios; developing more sophisticated ways of measuring ‘value added’.

Prologue As the Scoping Group’s work unfolded, it became apparent that the dimensions of ‘the assessment problem’ were so extensive that the time available to deal with it was insufficient. Much that might have been brought into consideration was left unexamined. Notwithstanding these difficulties, a key finding of the Group was that, whilst the UK honours degree is a robust qualification which continues to serve us well, the existing honours degree classification system has outlived its usefulness and is no longer fit for purpose. There should be further investigation of alternative systems for representing achievement which better meet the needs of different audiences and a set of criteria need to be identified and agreed for the purposes of evaluating such a system. There is merit in incorporating some of the existing initiatives in this area including the higher education Transcript, the Progress File and Personal Development Planning. Account must also be taken of developments elsewhere in the UK, in other sectors and European developments such as the Diploma Supplement and the Europass. (UUK and SCoP, 2004: Executive Summary) Given even a cursory examination of assessment methodology, this was not a difficult conclusion to reach. The greater challenge of making proposals to deal satisfactorily with the problem of the honours degree classification was left to its successor (and differently constituted) group ‘the Burgess Group’. The Burgess Group came up with a proposal (also foreshadowed in the Dearing Report3) for a three-category approach to the honours degree – distinction, pass, fail – backed up by transcript material which would have to be consistent with the requirements of the Bologna Declaration in Europe (UUK and SCoP, 2005). This proposal received a mixed response, and after further work, including research on grading systems in the UK and elsewhere (see Chapter 3), the Burgess Group issued a further consultation document which included a proposal that honours degrees should be awarded on a pass/fail basis but supplemented with information in transcript form that would satisfy the expectations of the Bologna Declaration (UUK and GuildHE, 2006). At the time of writing, one of the possibilities being considered by the Burgess Group was that classification should be discontinued, with the emphasis shifting to the provision of transcripts recording the achievements of students (which might be credit that could be set against future studies, sub-degree awards and the honours degree itself). This is a matter that is likely to continue to be debated for some time before a sector-wide resolution is reached.

Not a parochial wrangle Using the wrangling over the honours degree classification as a starting point might lead readers to think that what follows is a rather parochial discussion of 3 See NCIHE (1997: para 9.52).

Prologue assessment in higher education in the UK. It is not. Most of the issues discussed are relevant to assessment in other systems (those in Australia and the US are emphasized in this book), even if there are differences in precise details of the way in which assessment is treated and the outcomes of assessment are recorded. This book follows in the footsteps of others such as Milton et al. (1986), who have argued, with varying degrees of persuasiveness, that grading processes are generally suspect, and that summarizing performances in a single index is particularly so. Although the use of transcripts of achievement mitigates the worst failings of mono-indexing, transcripts are incapable of overcoming a number of the weaknesses in grading, and of warranting some aspects of performance. A challenge to assessment has crept up on higher education systems as governments at national and state level have sought to tie higher education more closely to a ‘human capital’ approach to policy-making. Higher education institutions around the world are having to come to terms with the development in students of ‘employability’ (see Chapter 1). This takes assessment beyond the subject discipline into areas of achievement that are particularly difficult to assess, such as students’ performances in workplaces and other environments in which academic capabilities are put to practical use. The shift in policy towards a more instrumental view of higher education exacerbates a tension in assessment that has always existed – that between the academic (what Oakeshott, 1962, somewhat misleadingly termed ‘technical knowledge’) and the more overtly practical capability that contributes significantly to success in workplaces and life in general.4 Ryle (1949) succinctly – and more appropriately – summed up the contrast as between ‘knowing that’ and ‘knowing how’. The latter is the more difficult to accommodate in the kinds of grading scale that are typical of higher education. It might be thought that increasing the tightness of definition of instructional objectives (or, in more contemporary terms, of intended learning outcomes) would be complemented by greater tightness in grading practice. In some respects, the suggestion is well founded; in others – the more wide-ranging kinds of demand made in higher education – it is not. Assessment and grading are, at root, social practices in which new colleagues are inducted, with varying degrees of formalism. A preliminary investigation in the UK (Yorke et al., 2000) showed that a majority of teachers had gained their understanding of marking students’ work from internal and external colleagues, and a small number said that they based their approach on what had been done to them when they were students. Some had developed their understanding of assessment through workshops run by the university’s educational development unit. Contact with colleagues elsewhere suggests that this picture may not be atypical, though the recent emphasis in the UK on academics’ professional development as teachers is likely to have shifted the balance towards formal learning about assessment methodology and away from informal and non-formal learning. It is unlikely, given the competing pressures on academics’ time, that many will have found the time to study assessment practices in any depth. Given the complexity inherent in 4 One might wish, following Schön (1983), to dispute Oakeshott’s contention that practical knowledge is not reflective.

Prologue assessment, it is probably fair to suggest that academics undertake relatively little developmental work in the area – a point which Rosovsky and Hartley (2002: 14) make about the United States.

Purposes This book has three main purposes: 1 to draw the attention of academics and other interested parties to the complex and problematic nature of summative assessment; 2 to suggest some ways in which summative assessment might be developed in order to respond to changing expectations of higher education; 3 to provide a resource for those who hold responsibilities for the development of assessment practices at institutional and supra-institutional levels. It also has a relevance beyond the ambit of higher education. Ebel (1972) implicitly acknowledged a concern regarding the general level of expertise in the practice of assessment when he wrote: The more confident a teacher is that he [sic] is doing a good job of marking, the less likely he is to be aware of the difficulties of marking, the fallibility of his judgments, and the personal biases he may be reflecting in his marks. (Ebel, 1972: 309) David Boud, a couple of decades later, expressed a similar concern regarding the insouciance with which assessment has at times been approached: There is often a gap between what we do in teaching as academics and what we do in other aspects of our professional practice. This is particularly marked in our approach to assessment. We place a high value on critical analysis in our own work, but we are in general uncritically accepting of our assessment practices. (Boud, 1990: 101) Ebel (1972: 310) remarked that there was a need to recognize shortcomings in assessment. This recognition, he said, was the beginning of wisdom: the cultivation of wisdom implied the need to work to rectify the shortcomings. On the evidence presented in these pages, work is still needed. This book aims to plant a few varied seeds in that under-explored corner of the curricular garden in which are found the grading and reporting of student achievement.

Prologue

Navigating this book Readers will approach this book with differing purposes in mind. Some will want to follow the argument through from the beginning to the end. More are likely to want to dip into the book for particular aspects of grading. The main chapters are 2 (Grading and its limitations), 6 (The cumulation of grades), 8 (Fuzziness in assessment) and 9 (Judgement, rather than measurement?): the titles of the chapters hint at the trajectory of the argument of the book as a whole. Elsewhere, Chapter 1 is a brief package trip, rather than a more extended tour d’horizon, around the complexity that is inherent in assessment. Chapter 3 illustrates, with reference to three differing case studies of assessment regulations, that considerable care needs to be taken in interpreting grades, and to a limited extent it foreshadows the discussion in Chapter 6 of the cumulation of grades into a single index of achievement. Chapter 4 is a case study of honours degree classifications in England, Wales and Northern Ireland, which shows that the proportion of ‘good honours degrees’ tended to rise during the period 1994–2002. Such findings may be considered – not necessarily accurately – to be evidence of grade inflation, which it the theme of Chapter 5. Chapters 4 and 5 both point to the need for considerable caution before the words ‘grade inflation’ are used. Chapter 7, on ‘Value added’, is a digression from the main argument, though the political attractiveness of the concept is such that it could not be ignored. A short epilogue concludes the book. In seeking to accommodate the needs of those who will be dipping into the book, there is a little duplication of content here and there. This may be a minor irritant to those who read the text through from start to finish, but perhaps will be condoned in the interest of a wider utility for the book. A note on terminology Module, course and programme The term ‘module’ is used generically in this book for a component of a whole programme of study (e.g. leading to a bachelor’s degree), and ‘programme’ is used in a similarly generic fashion for the full complement of components. The use of the terms ‘module’ and ‘programme’ avoids the ambiguity associated with the word ‘course’, which can apply to both (though not simultaneously). The word ‘course’ has however been retained where it appears in a specific quotation, and in that context is probably not ambiguous. Marks and grades The potential ambiguity of ‘marks’ and ‘grades’ is not so cleanly resolved. ‘Marks’ refers to the raw scores awarded to items of work, typically a percentage. ‘Grades’ refers to what is reported: thus a mark of 93 per cent is, in usage typical of the US, converted into a grade of ‘A’. The difficulty arises when grades are awarded

Prologue for achievements, as occurs in a minority of institutions in the UK. Provided the reader is aware of the context of use of the term ‘grade’, the ambiguity may not be much of a problem. Honors and honours The spellings in use in the US, and in Australia and the UK, have been retained – probably to advantage, since the terms have different meanings in each context.

Chapter 1

The complexity of assessment

Introduction Student achievement is assessed for a variety of purposes, some of which are in tension with others. This multiplicity of purposes engenders compromises that are not always helpful to the assessor. This chapter clears the ground for more detailed examination of grading in the subsequent chapters. There are plenty of books covering various aspects of assessment (e.g. Brown and Knight, 1994; Brown et al., 1997; Walvoord and Anderson, 1998; Heywood, 2000; Knight and Yorke, 2003; Walvoord, 2004) to which the reader can turn for discussions of approaches to assessment; because the emphasis of this book is on grading and its implications, this chapter is limited to a brief overview of a number of the main aspects of assessment. These include formative and summative assessment, with emphasis being given predominantly to the latter; norm- and criterion-referencing; and technical issues in assessment. The approaches that are adopted in respect of summative assessment, and their technical quality, influence the approach taken in this book towards the long-running debate on the extent to which achievements can be measured, or have to be judged.

Purposes of assessment Students are assessed for three main reasons: to promote learning; to certify achievements; and to provide data that can be used for quality assurance (sometimes quality control) purposes (Table 1.1). Boud (2000: 159) refers to assessment doing ‘double duty’ – the ostensible and the tacit. His elaboration and Table 1.1 suggest the multiplicity of purposes of assessment under both the ostensible and the tacit. Hounsell (2007) summarizes neatly the tensions inherent in assessment. [Assessment] is called upon to be rigorous but not exclusive, to be authentic yet reliable, to be exacting while also being fair and equitable, to adhere to long-established standards but to reflect and adapt to contemporary needs, and at one and the same time to accommodate the

The complexity of assessment 11 Table 1.1 Purposes of assessment Broad purpose Learning

More detailed purpose To motivate students To diagnose strengths and weaknesses To provide feedback To consolidate work done to date To help students develop their capacity for self-assessment To establish the level of achievement at the end of a unit of study

Certification

To establish the level of achievement at the end of a programme of study To pass or fail a student To grade or rank a student (with reference to norms and/or criteria) To underwrite a ‘licence to practise’ To demonstrate conformity with external regulations, such as those of a professional or statutory body To select for employment, further educational activity, etc. To predict future performance

Quality assurance To assess the extent to which a programme’s aims have been achieved To judge the effectiveness of the learning environment To provide feedback to teachers regarding their personal effectiveness To monitor levels of achievement over time To assure interested parties that the programme or unit of study is of an appropriate standard To protect the relevant profession To protect the public Note: This table is from Yorke (2005), and draws on Atkins et al. (1993), Brown et al. (1997: 11), Yorke (1998a: 178) and Nicklin and Kenworthy (2000: 108–109).

expectations not only of academics, their students and the university in which both are engaged, but also of government and government bodies, . . . employers, professional and accrediting organisations, subject and disciplinary associations, parents, and the public at large. Hounsell goes on to suggest that the most challenging tension is probably that between summative and formative assessment.

12 The complexity of assessment

Summative and formative assessment Summative assessments are couched in what Boud (1995) terms ‘final language’, since they sum up the achievements of students. The certification of achievement is a summative, ‘high stakes’ matter for students in that it possesses direct implications for their futures. Such certification has to be robust in that it has to demonstrate such technical qualities as high validity and reliability. Some summative assessments may not, in themselves, be particularly ‘high stakes’ in character. They may count relatively little towards an overall grade computed in respect of a whole course. A group presentation, for example, may be given a percentage mark, yet be weighted to a relatively small extent in the grade awarded for a module. Further, marks for some kinds of task cluster closely and may have a very limited influence on an overall module grade. Formative assessment does not necessarily have to reach the level of technical quality that is expected of summative assessment, since its primary purpose is to encourage the learner, in one way or another, to develop their capacity to meet the challenges that face them. Greenwood et al., for example, say that formative assessment implies no more (and no less) than a discerning judgement about [a] learner’s progress; it is ‘on-going’ in the sense that it goes on all the time; and it is formative in so far as its purpose is forward-looking, aiming to improve future learning (as distinct from the retrospective nature of summative assessment). (Greenwood et al., 2001: 109) Formative assessment is dialogic, conversational in intent, seeking to engage the student in identifying ways in which performance can be improved – and acting on the enhanced understanding. For some students, the challenge may be to repeat a task on which they have been adjudged to have failed; for others, it may be to do better on the next task that faces them. Formative assessment is, in principle, ‘low stakes’ since it is concerned with development much more than it is with grading – indeed, formative assessments may not involve any grading. However, some assessments are both formative and summative. Assessments that take place within modules may be both formative, providing feedback on performance, and summative, in that they count towards the grade to be awarded for performance on the module as a whole. Examples of such assessments are multiple-choice tests, formal class quizzes and short assignments (which may cumulate to fulfil the assessment requirements in a manner such as the ‘Patchwork Text’ described by Winter, 2003). It is for this reason that seeing high stakes as relating to summative, and low stakes to formative, assessment is an over-simplification. This book focuses on summative assessment, which has become increasingly challenging as the expectations placed on higher education have evolved.

The complexity of assessment 13

Some issues in summative assessment Coursework v. examination It must be borne in mind that ‘coursework’ and ‘examination’ are portmanteau terms within which there can be considerable variation. Coursework can span a range from extended projects to short pieces of work such as a 500-word review, and examinations can include not only the traditional unseen examination paper, but also papers made available to students before the day of the examination, when they have to respond without resources to hand, and ‘open book’ examinations, all of which are conducted under tightly controlled conditions. The point is frequently made that coursework tends to attract higher grades than examinations, and there is empirical data to this effect (e.g. Bridges et al., 1999, 2002; Yorke et al., 2000; Simonite, 2003). Simonite’s study pointed up the significance of the difference when she noted that, in the case of Biology and Molecular Sciences, if four modules that counted towards the honours degree classification switched from a mixture of examinations and coursework to coursework only, this would on average raise a student’s mean mark by 0.7 of a percentage point – enough to influence a number of classifications across the student cohort. A second effect of making more use of coursework in assessment is a tendency to narrow the range of marks, as Simonite points out. However, whereas some students might gain from a shift towards coursework, others might fare less well. She raises the question of what constitutes fairness in assessment. The widening frame A central issue in higher education is what summative assessment is expected to cover. In the UK of the 1960s and 1970s, summative assessment focused upon academic achievements related to the subject discipline being studied. Gradually the terms of reference of assessment widened, under governmental prompting based on human capital theory, to include concepts such as enterprise and employability. Whereas it may have been relatively easy (though not as easy as some seem to believe) to classify performances at bachelor’s level in terms of academic achievement, it becomes much more difficult to do this when the assessment requirements cover a much wider spectrum of achievements, some of which may not be amenable to reliable representation in grades. Academic and vocational education have often been depicted as different within higher education, yet there are many examples of vocational programmes which are accepted as ‘academic’ (those in Medicine, Law and Engineering are three). In others, such as Teacher Education and Social Work, the perception is arguably of less academicism and more vocationalism.

14 The complexity of assessment Employability Stephenson (1992, 1998) argued a case for ‘capability’ in higher education at a time when academics in the UK had, in the main, yet to come to terms with the government’s promotion of ‘enterprise’ through its Enterprise in Higher Education initiative. Stephenson saw capable people as possessing confidence in their ability to take effective and appropriate action; to explain what they were seeking to achieve; to live and work effectively with others; and to continue to learn from their experiences, both as individuals and in association with others, in a diverse and changing society. He made the point that capability was a necessary part of specialist expertise, and not separate from it. Capable people not only knew about their specialisms, they also possessed the confidence to apply and develop their knowledge and skills within varied and changing situations. If Stephenson’s conception was ahead of its time, it strongly influenced thinking in the UK about employability, seen by the Enhancing Student Employability Co-ordination Team (ESECT) as a set of achievements – skills, understandings and personal attributes – that make graduates more likely to gain employment and be successful in their chosen occupations.1 (Yorke, 2004/06: 8) ESECT developed an account of employability, given the acronym USEM, which linked Understanding (of subject disciplines and situations); Skilful practices in context; Efficacy beliefs and personal qualities; and Metacognition. ESECT’s approach differed from Stephenson’s presentation of capability in that it was able to draw on a range of theoretical and empirical work to give it the kind of academic foundation which academics could respect.2 Competence ‘Competence’ is a term that is used widely in discussions of vocational programmes, and rather less in respect of academic programmes. A difficulty is that the meanings ascribed to it vary: in some contexts, such as in North America and in management, it is taken to refer to a personal attribute or quality (as is the related term ‘competency’), whereas elsewhere it refers to social expectations or judgements relating to performance.3 Either way, it is a social construct which is inflected with values (Kemshall, 1993; Lum, 1999; Hager, 2004a). 1 Elsewhere, the development of graduates’ employability is subsumed under ‘workforce development’ (Voorhees and Harvey, 2005), or similar terms. 2 For discussions of employability and USEM see Knight and Yorke (2004), Yorke and Harvey (2005) and the range of resources available on the website of the Higher Education Academy: the series Learning and Employability and other relevant material can be reached directly via www. heacademy.ac.uk/resources/publications/learningandemployability, and other relevant material by searching from www.heacademy.ac.uk/resources, inputting ‘employability’. 3 See, for example, Chambers (1998); Watson et al. (2002); Eraut (2004a); Hager (1998, 2004a,b).

The complexity of assessment 15 Hager (2004a) points to the need to differentiate between three particular aspects of competence: performance and its outcomes; its underpinning constituents (i.e. capabilities, abilities and skills); and the development of people to be competent performers. In the UK, there are two variations on the theme of competence: the first is the subdivision of performance, through functional analysis, into the plethora of detailed competences that characterized the system of National Vocational Qualifications (NVQs) in the 1990s (see Jessup, 1991, for a detailed account); the second is a broader interpretation in which components of performance are ‘bundled together’. Some see competence as subsuming more than can be ‘measured’ through assessment processes. Worth-Butler et al. (1994) exemplify this when they describe competence in terms of the mastery of requirements for effective functioning, in the varied circumstances of the real world, and in a range of contexts and organizations. It involves not only observable behaviour which can be measured, but also unobservable attributes including attitudes, values, judgemental ability and personal dispositions: that is, not only performance, but capability. (Worth-Butler et al., 1994: 226–227) Slightly more narrowly, Hager and Butler (1996) describe competence in terms of the ability of a person to respond adequately to the range of demands that constitute a role. Jessup (1991: 27) makes much the same point. These descriptions of competence point to skilfulness in practice as encapsulated in the ‘S’ of USEM (see above). The literature bears witness to debate about the theory underlying notions of competence. Hyland, in a number of publications (e.g. Hyland, 1994), argued that competence was rooted in behaviourism, but Hager (2004a) criticizes him for not differentiating between behaviour (the manifestation of competence) and behaviourism. Hager’s argument is consistent with an earlier defence of competencebased assessment (Hager et al., 1994) which offered Jessup’s (1991) approach to competence (with its strong affinity with the behavioural objectives approach espoused by Mager, 1962, and others) the prospect of redemption by suggesting that, if Jessup’s statements of outcomes of performance were construed as performance descriptors, they would possess an abstractness that took them some distance away from the narrowness perceived in them by most critics of NVQs. Hager et al. may have pressed their interpretation farther than can be justified. Graduates entering the labour force do so, in most instances, as young professionals who are expected to be able – at least at a beginning level – to deal with the complex and ‘messy’ problems that life tends to throw at them. This complexity is far removed from the narrow approach favoured by Jessup (1991) and others concerned to decompose performance into narrowly focused skills to be demonstrated across a defined range of situations. Although highly disaggregated competences, such as those that were introduced in NVQs, have value in developing an understanding of the dimensions of workplace performance, there is more

16 The complexity of assessment general support for seeing ‘competence’ in much broader terms.4 Eraut (2004b: 804), for example, writes: treating [required competences] as separate bundles of knowledge and skills for assessment purposes fails to recognize that complex professional actions require more than several different areas of knowledge and skills. They all have to be integrated together in larger, more complex chunks of behaviour. Others have made the point that competence or competency frameworks derived from functional analysis are inadequate for assessment purposes, on the grounds that they miss some subtleties of performance (e.g. Owens, 1995, in respect of social work; Jones, 2001; Coll et al., 2002; and Cope et al., 2003, in respect of teaching; Jones, 1999, in respect of vocational education and training; and Lang and Woolston, 2005, in respect of policing in Australia). The importance of judgement in the assessment of complex achievements is emphasized by van der Vleuten and Schuwirth: As we move further towards the assessment of complex competencies, we will have to rely on other, and probably more qualitative, sources of information than we have been accustomed to and we will come to rely more on professional judgement as a basis for decision making. (van der Vleuten and Schuwirth, 2005: 313) A further complication is that some (e.g. Hays et al. 2002; Schuwirth et al. 2002) draw a distinction between ‘competence’ and ‘performance’. The former represents a person’s achievement under test conditions, knowing that they are being challenged to demonstrate knowledge, attitudes and skills (and is often implicitly taken to be the best that they can achieve, though not everyone gives their best performance under the stress of formal testing), whereas the latter is what the person achieves on a day-to-day basis. ‘Competence’ in these terms might be seen metaphorically as a peak whereas ‘performance’ might be seen as a broad col below peak level. Although one might perform at peak on specific items of assessment (coursework and/or examination) and be assessed accordingly, on an extended task such as a ward placement or teaching practice the assessment will quite probably integrate over the whole of the engagement and hence come closer to signalling the day-to-day level of performance. Norm-referenced and criterion-referenced assessment The assessment of student achievement implies some frame of reference against which judgements are made. If reference is made to the achievements of other stu 4 Some of the debate over competence has arisen because the protagonists have not made clear the level of analysis that they were applying in respect of the term. Hager et al. (1994) offer a spirited defence of competence-based assessment.

The complexity of assessment 17 dents (whether in the same cohort or multiple cohorts) then norm-referencing is to the fore. If the reference is to stated objectives or expected learning outcomes, then criterion-referencing5 is of key importance. Norm-referenced assessment Norm-referenced assessment is relativistic, in that, in typical practice in higher education, it seeks discrimination amongst students by placing their achievements in order of merit rather than by setting them against the kinds of norms that are developed for psychological and mass educational testing. An assumption sometimes made is that the observed performances are distributed approximately normally, i.e. that the frequencies of different categories of performance fit the normal distribution, and that the performances can be grouped in bands that fit a normal distribution reasonably well (one example is given below). Although this might work fairly well when the number of students is large, when the number is small the deviations from ‘normality’ can be quite noticeable and the application of a rigid approach to categorization, such as ‘grading on the curve’, would be particularly inequitable. Whatever the institutional grading system being used (norm- or criterion-referenced), or type of grading scale, for credit transfer purposes in Europe there is a recommendation that the performance of the student is also given an ECTS grade. The official document ECTS – European Credit Transfer and Accumulation System6 emphasizes the norm-referenced nature of the system: The ECTS grading scale ranks the students on a statistical basis. Therefore, statistical data on student performance is a prerequisite for applying the ECTS grading system. Grades are assigned among students with a pass grade as follows: A best 10% B next 25% C next 30% D next 25% E next 10% A distinction is made between the grades FX and F that are used for unsuccessful students. FX means: “fail – some more work required to pass” and F means: “fail – considerable further work required”. The inclusion of failure rates in the Transcript of Records is optional. The normative approach to student achievement adopted by the ECTS sits uncomfortably alongside national systems in which criterion-referencing is

5 Some prefer the term ‘criteria-referencing’, in acknowledgement that multiple criteria are very often involved. 6 See http://ec.europa.eu/education/programmes/socrates/ects/index_en.html#5 (accessed 18 September 2006).

18 The complexity of assessment privileged. The tension between a ‘local’ assessment system and that of the ECTS is evident in a statement from the University of Helsinki that ‘its students are graded on their individual performances and hence the distribution of outcomes will not necessarily parallel the normal distribution of the ECTS passing grades’.7 Criterion-referenced assessment In criterion-referenced assessment, the issue is the extent to which the student achieves the intended learning outcomes that have been specified. In some circumstances the student cannot pass the assessment unless they achieve some outcomes deemed essential: David Baume dryly makes the point that he would be unhappy to be flying with the pilot who had passed a training course without being able to demonstrate competence in landing. In criterion-referenced assessment it is theoretically possible for all students to achieve all of the intended learning outcomes and end up with ‘A’ grades, since the students know what is expected of them and can focus their efforts appropriately. In practice, however, the effect of criterion-referencing is typically to skew the distribution of grades towards the upper end of the grading scale. The validity of criterion-referenced assessment depends upon the precision with which the criteria are stated and applied. A study of the grading of dissertations by Webster et al. (2000), discussed in Chapter 2, suggested that, whilst the specified criteria were applied in the assessment process, other criteria were also being used that had not been articulated. In other words, the use of criterion-referencing had not fully overcome the problem of the archetypal assessment item that in effect asks the student to guess what the assessor really wants. The problem of using criterion-related assessment in a programme containing both academic and workplace activities is nicely raised by James and Hayward’s (2004) small-scale qualitative study of the education of chefs. This programme was at further education level, and locked into a tightly specified set of learning outcomes necessitated by the system of National Vocational Qualifications then in use in the UK. However, the issues raised by this study have a relevance to some aspects of higher education. Briefly, there was a tension between the formally specified learning outcomes of the academic institution and a different set of expectations in the workplace which had a lot of tacit and practical understanding behind them. A student might be successful in terms of the atomized assessment of the academic part of the programme but fail to achieve success in the much more fluid environment of the workplace. A difficulty for the trainee chefs was that the differences in the kinds of expectation were not worked through. An important issue for criterion-referencing is the balancing of criteria when the underlying (and, in the case of the workplace, implicit) models of learning are in some tension. Some (e.g. Hornby, 2003) suggest that criterion referencing would eliminate some of the problems with assessment. Criterion referencing can be tightly focused, especially where the level of an acceptable performance can be stated with 7 See www.helsinki.fi/exchange/credgrad.htm (accessed 17 September 2006).

The complexity of assessment 19 a high level of precision – but this applies only in particular circumstances. Criterion referencing can be more general, and hence encompass a wider range of educational achievements, but it then becomes more open to interpretation. The wider the latitude in criteria, the more difficult the issue of standards becomes. Sadler’s (2005) foray into standards shows how complex the issue of stating standards is. The rather more heroic suggestion is floated from time to time that it would be desirable to have a set of descriptors that would be common across subject areas, but its impracticability becomes apparent as soon as one thinks about differences both between subject disciplines (English Literature, Computer Science and Social Work, for instance) and even within them (examples being Nursing and Business Studies, where practical performance, quantitative calculations and discursive writing are all likely to figure). As Cope et al. (2003: 682) observe in respect of teacher education, the problem of variability in assessment will not be solved by increasing the clarity of written descriptors, since inference always obtrudes. Their point seems generally applicable. Whereas detailed discussion might in theory lead to greater shared understanding, seeking to develop a shared understanding across a system of any size, given the spread of subject disciplines involved, is a challenge that would seem to be as forbidding as that of sweeping a beach clear of sand. In practice, a fuzzy distinction The distinction between norm- and criterion-referenced assessment is not sharp, since normative considerations bear upon the criteria that are selected. In assessing, it is unlikely that, where judgement is required about the extent to which a student has achieved the expected outcomes, assessors can wholly detach themselves from normative assumptions. Indeed, the balance between norm-referencing and criterion-referencing in an assessment is often left implicit and is hence unclear. However, as is argued in Chapter 2, the balance struck in the various curricular components could have implications for the overall grade awarded to a student on completion of their programme of study. Entry to postgraduate medical education in the UK involves a mixture of norm- and criterion-referenced assessment. Although bachelor’s degrees in Medicine are typically not classified, graduates from each medical school are placed in quartiles, which are given points scores of 30, 35, 40, and 45. In applying for the Foundation Programme (which bridges between medical school and specialist/general practice training) they must answer a series of key questions, and their answers are scored by panels against nationally-determined criteria to produce a total out of 40. The maximum score obtainable is therefore 85 points. The higher the score, the better are an applicant’s chances of obtaining their preferred location for the Foundation Programme.8 The fuzziness of the distinction between norm- and criterion-referencing pervades assessment practice. Modules and programmes differ in the balance that 8 See www.mtas.nhs.uk/round_info/53/score_detail.html (accessed 26 November 2006). I am grateful to Peter McCrorie for directing me to this material.

20 The complexity of assessment they strike between the two, which makes both comparisons and the cumulation of assessment outcomes problematic. The ‘big picture’ is more Jackson Pollock than Piet Mondrian.

Technical aspects of assessment Traditionally, assessment ideally has to satisfy a number of criteria (Table 1.2). In addition, an assessment method needs to be low in ‘reactivity’ – that is, its use should not unduly distort the student’s behaviour. Assessments almost always influence student behaviour, so some reactivity is inevitable: however, if the answer to the third question against ‘Reliability’ were a strong ‘yes’, then clearly the assessment would be exhibiting an undue reactivity. Where the assessment is ‘high stakes’, as in most summative assessment, validity and reliability have to be sufficiently high to command confidence. Where the stakes are lower, the reliability criterion can be relaxed to some extent. If reliability drops too far, then the validity of the assessment is seriously compromised. This is an over-simplification of the situation, however, since both validity and reliability can be invested with different emphases (as is implicit in the illustrative questions in Table 1.2). They can also be treated in considerably greater depth than is possible here, as has been done for validity by Messick (1989) and for reliability by Feldt and Brennan (1989).

Table 1.2 Criteria for assessment, and some related questions Criterion Validity

Illustrative question(s) Does the assessment deal with what we think we are assessing? Are we considering judgements of past performance or predictions of future performance? Reliability Do assessors agree on their judgements/gradings? Is the student’s performance replicable? Is the student’s performance a ‘one-off’ produced solely for the assessment occasion? Fairness Does the chosen assessment approach disadvantage particular groups of students? Efficiency Is the desired assessment approach valid, reliable and also affordable? Is the ratio of benefits to costs acceptable? ‘Cheat-proofness’ How vulnerable to plagiarism and other forms of cheating are the assessment methods? How has vulnerability to cheating been minimized? Generalizability To what extent can the results from assessment be generalized to different situations? Utility Are the assessment outcomes useful to those who will be drawing on them? Intelligibility Are the assessment outcomes readily understood by those who need to rely on them?

The complexity of assessment 21 Validity Validity is socially determined, in that it reflects the preferences of interested parties – or, rather, of those who exert power in respect of the assessment system. Where there is more than one focus of interest, validity is multi-valued according to the party making the judgement. Hence an assessment that is a valid representation of a student’s good performance on a degree programme will not be construed as valid by an employer who finds the student inadequate in the workplace. Cronbach and Meehl (1955) offered a fourfold categorization of validity in psychometric testing: 1 predictive, in which the candidate’s subsequent performance is set against the outcomes of the test that has been taken; 2 concurrent, determined by the correlation between the performances on the test and another, previously established, test; 3 content, which depends on the extent to which the test can be shown to be a representative sample from the relevant universe of possible test items; 4 construct, which depends on the extent to which the test can be taken as a measuring instrument for a particular attribute or quality. Assessment in higher education is generally not concerned with concurrent validity, because there is rarely a parallel ‘test’ that can act as a yardstick. The most prevalent aspect of validity is content validity, since curriculum designers have to show that the assessments align (in Biggs’, 2003, sense) with the expected learning outcomes, the subject content, and the pedagogical methods that have been chosen. The challenge for curriculum designers is to show that the sampling of the curricular intentions in the assessment demand is representative. Sometimes this sampling is insufficient to satisfy penetrating scrutiny, but perhaps sufficient to justify the use of a cognate and arguably weaker term, ‘face validity’. Construct validity overlaps to some extent with content validity, if one makes the assumption that the demonstration of content mastery is a representation of academic ability, intelligence, and so on. There is some evidence that employers invest assessments with meanings that may not be merited, such as conscientiousness (Pascarella and Terenzini, 2005). Employers and others also implicitly make assumptions about the predictive validity of assessments – as when, in the UK, they often make an upper second class honours degree a marker for acceptability. What is missing from the list produced by Cronbach and Meehl – and generally from textbooks dealing with validity – is any conception of validity in terms of the capacity of the test to reveal something unexpected. An apparently unwitting example of this can be found in Fransella and Adams’ (1966) study of an arsonist, in which repertory grids revealed to the researchers something that they had previously not considered.9 In assessing students’ performances, there may be a need to go beyond the prescribed assessment framework in order to accommodate an aspect of achievement that was not built into the curriculum design (creativity is 9 This is discussed more fully in Yorke (1985).

22 The complexity of assessment particularly challenging in this respect). This does, of course, create problems for grading systems, especially when they are seen as measuring systems. Reliability One can produce highly reliable assessments where tasks admit of unambiguously correct responses, as is the case with well designed multiple-choice questions, or when they are trivial in character. Gibbs and Simpson (2004–5: 3) point out that ‘The most reliable, rigorous and cheat-proof assessment systems are often accompanied by dull and lifeless learning that has short-lasting outcomes’ – surely undesirable collaterals for a higher education worthy of the name. Van der Vleuten et al. (1991) suggest that some subjective judgements can be at least as reliable as assessments based on more ‘objectified’ methods. Where the assessment demand is complex, it is more difficult to demonstrate high reliability. Assessors may disagree on the merits of the performance, even when they have been required to take part in exercises designed to narrow the variability of interpretation (Brooks, 2004, reviews a number of studies of assessment in public examinations in which inter-marker agreement was lower than would be expected for a ‘high stakes’ examination). Even when assessors have undergone training for particular assessment tasks, the reliability of assessment can be lower than for a multiple mark (Britton et al., 1966). Double marking is often used in higher education. Dracup (1997) offered four reasons: 1 to minimize the effect of random distortions in marking; 2 to promote shared standards as colleagues mark across the subject discipline (in this case, Psychology); 3 to allow for the possibility of a different perspective on the piece of work being assessed (although discrepant perspectives invite concern about reliability); 4 to provide data that can enable the reliability of the assessment process to be evaluated. However, Cannings et al. (2005), who undertook two studies of double marking, found that it would need an impractical number of markers to reach an acceptable level of reliability. Newton (1996) showed that, whereas high reliability could be achieved in the assessment of General Certificate of Secondary Education (GCSE) scripts in Mathematics, the marking of scripts in English was more troublesome. This is not surprising, since the Mathematics scripts required responses that were essentially correct or incorrect, and hence there was little scope for interpretation on the part of the assessor. Newton wrote: in English the task is generally, not so much to distinguish right from wrong, but to evaluate the quality of the work. Where quality is to be assessed there is more emphasis on interpretation, and with more scope

The complexity of assessment 23 for interpretation comes more scope for genuine difference of opinion. No marking scheme or standardisation meeting could ever hope to arbitrate for every possible difference of opinion regarding a candidate’s work. Hence it is inevitable that the reliability of marking in such subjects will be lower. (Newton, 1996: 418) Newton remarked that a public examinations board had to make a trade-off between reliability and cost-effectiveness. Although it could, in theory, enhance reliability by multiple marking, the constraints of time and resources militated against doing more than single marking with trained markers, backed up by a hierarchy of more senior assessors able to ensure a normative approach across the whole body of assessors.10 The more ‘open’ the task is, the more difficult it is to demonstrate high reliability, because different assessors could weight any stated broad criterion in different ways, as was shown in Huot’s (1990: 250ff) review of the research that had studied influences on the rating of writing. The same point was strongly implicit in comments by Mike Cresswell, now Director General of the examining board Assessment and Qualifications Alliance (AQA) in the UK but previously a contributor to the research literature (Cresswell, 1986, 1988), when, in the context of a news item on appeals for regrading in school examinations, he observed that in assessing ‘there is room for legitimate difference of opinion which will lead to some differences in marks’.11 Indeed, Wiseman (1949: 206, n3) had earlier argued, with reference to the assessment of pupils at age 11, that the lack of a high intercorrelation between assessors was desirable, since it illustrated a diversity of viewpoint in the judgement of complex material: the total mark of four independent assessors would ipso facto give a truer depiction of the individual pupil’s achievement. This would be similar to the methodology used in judging performance in sports such as diving, gymnastics and ice skating, in which scores from multiple assessors are combined (save for those from the most extreme assessor at either end).12 Multiplicity of assessment on these kinds of scale, however, is generally impracticable in the circumstances of higher education. Where a second marker is privy to the first marker’s gradings, then inter-marker reliability is likely to be high in statistical terms. However, the situation injects a bias in favour of the first marker’s assessment,13 and so the validity of such a reliability figure is doubtful. When second marking is ‘blind’, the inter-marker reliability is potentially lower. If double marking of student work lacks adequate 10 Newton (1996: 406) outlines the hierarchy of examiners that was adopted in the UK at the time of his study for publicly organized examinations for school pupils. 11 On the BBC Radio 4 Today programme, 25 November 2006. 12 As anyone who is acquainted with the outcomes of Olympic panel judgements will be aware, the ‘trimming’ of outliers does not necessarily resolve problems of intentional bias, misperception of what the athlete was attempting, and so on. Looney (2004) has explored the potential of Rasch analysis to identify atypical behaviour in the judging of sporting achievement. 13 Though this could be affected by a power differential between assessors.

24 The complexity of assessment reliability, then questions are implicitly raised about awarded grades and the combination of grades (Chapters 2 and 6). Where complex behaviour is being assessed, as in the interaction between a medical student and a ‘standardized patient’, holistic judgements of the student’s professionalism can produce quite respectable reliabilities (e.g. Hodges et al., 1997; Regehr et al., 1998; Friedlich et al., 2001) that have in some instances been shown to be higher than those from checklist-based assessments. Holistic judgements should not automatically be eliminated from the assessor’s armoury because of presumptions of unreliability. Alhough high validity is important in assessment, it can be compromised by poor reliability. Some of the potential problems with reliability can be minimized by standardizing the conditions under which assessment takes place. The scoring process can also be standardized through mark schemes or computer-marking programs. This is why procedures in public examinations and psychometric testing are tightly controlled. The aim is that no student is unduly advantaged or disadvantaged. This works up to a point, but students who suffer from a disability – dyslexia, say – may need more time than other students in order to have an appropriate chance of demonstrating their capabilities. The achievement of ‘fairness’ (see below) is a complex matter. Variability in grading It is well understood that grading, except in some specific circumstances, is subject to variability which stems from the assessor’s interpretation of the assigned task and the criteria against which the student’s performance is to be judged, the assessor’s alertness, the assessor’s knowledge of the student, and so on. It is also a ‘local’ process (Knight, 2006) unless an external body is involved, which means that gradings in one institution may not be replicated were the assessments to take place in another. Pascarella and Terenzini (1991: 62–63; 2005: 65–66) list a number of variables that affect performance, which are combined here as: • • • • • • •

nature of the institution; major field of study; predominant mode of course implementation; academics’ attitudes to, and policies for, course grading; status of the teacher/assessor; pedagogic style and personality; situational constraints such as stress and workload.

Acknowledging that the grades awarded to students are influenced by academic ability and intelligence, Pascarella and Terenzini (1991: 388) note that they are also influenced by a range of student-centred considerations, including motivation, organization, study habits, quality of effort, and the dominant peer culture. Later studies have emphasized programmatic interventions of various kinds, not least because of their significance for retention and completion (Pascarella and Terenzini, 2005: 398ff).

The complexity of assessment 25 With large-scale public examinations, such as those for the General Certificate of Secondary Education and Advanced Level in England, Wales and Northern Ireland (and cognate examinations in Scotland), the ‘error variance’ is reduced by training the examiners and using comparisons of graded scripts to refine the collective understanding of norms. In higher education, where cohort numbers are much smaller and assignment briefs are interpreted variably, the procedures adopted by the examining boards are impractical. Training workshops for assessors are a partial response that can help to reduce some of the variation in the grading process. Hence the reliability of grading remains a problem in summative assessment. Other criteria regarding assessment Fairness Fairness, as Stowell (2004) shows, is a complex issue involving considerations of both equity (in which ‘fairness’ is not equivalent to ‘sameness’) and justice. Stowell discusses the implications with reference to two ‘ideal types’ of approaches to academic standards, equity and justice, one based on professional judgement and the other on outcomes, whose implementation in assessment boards can be far-reaching for some students. Her article demonstrates that homogeneity in assessment practice can actually prejudice fairness. Waterfield et al. (2006) report some developments on this front. Efficiency Efficiency refers basically to the benefit/cost ratio of assessment, and by extension to the amount of resource that can reasonably be utilized for the purposes of assessment. An ideal assessment – for example, multiple observation in a group environment – may be impractical on the grounds that it would consume too high a proportion of the resources available. Considerations of efficiency impact on the strength of the warrant that can be attached to assessments. ‘Cheat-proofness’ Reliance cannot be placed on assessments that are vulnerable to cheating. With concerns about plagiarism rising, it is not surprising to find institutions reverting to assessments conducted under examination conditions. Carroll and Appleton (2001) indicate some more imaginative approaches to the problem. Generalizability Generalizability presents a number of challenges, especially if the ‘local’ nature of assessment (Knight, 2006) is taken into account. What a student can do in one situation is not necessarily transferable to another. Campbell and Russo (2001) observe that, whereas differences between individuals can be reliable in very

26 The complexity of assessment specific settings, once the setting is varied the reliability of such differences may not be maintained: setting × person interactions are dominant. Further, the notion that ‘skills’ are transferable across settings has been questioned by Bridges (1993). Although it is in theory possible to assess a student’s achievements across a variety of situations, the practicability of this falls into question once the issue of efficiency (see above) is brought into consideration. Utility and intelligibility Utility and intelligibility refer mainly to the needs of stakeholders from outside higher education who need to make sense of what students and graduates claim to have achieved. A grade-point average (GPA) or an honours degree classification has a utility value (largely for selection purposes). It is doubtful whether its utility value extends much beyond this. Although some stakeholders may believe that a GPA or degree classification constitutes an intelligible index of a person’s ability, such belief is – as this book aims to show – largely misconceived.

Measurement or judgement? Shepard (2000) observed that, although approaches to learning had moved in the direction of constructivism, approaches to assessment had remained inappropriately focused on testing, which she saw as a legacy of the behaviourism strongly promoted during the twentieth century, with its connotations of scientific measurement and social efficiency. Although testing may have been prominent in parts of programmes, it is doubtful that its role was ever as overweening as Shepard implies, since students in higher education have always been expected to undertake some work which involves imagination, creativity and/or coordination of knowledge that extends beyond the compass of mere testing. Shepard is on stronger ground when she refers to measurement, since there has been a persistent use of grading systems that are implicitly (though, as this book will argue, erroneously) treated as measuring systems, with all that implies for the mathematical manipulation of ‘measures’. The issue of whether achievements can be measured surfaces most strongly where the expectations of students include aspects of performance as a beginning professional, though it pervades the whole of higher education. Hager and Butler (1996) distinguished between educational assessment as scientific measurement and educational assessment as judgement, seeing these as manifestations of two different epistemological approaches. The contrasting perspectives, labelled by Hager and Butler as ‘models’, are shown in Table 1.3 as Weberian ‘ideal types’ which provide the outer edges of a frame within which the characteristics of assessment can be discussed. Neither of the ideal types is likely to occur in pure form in actual practice, and there is perhaps a risk that assessment will be construed in over-polarized, Manichaean terms. Hager and Butler contrast an impersonal, theory-led approach with one in

The complexity of assessment 27 Table 1.3 Two contrasting models of educational assessment Scientific measurement model Practice derived from theory Knowledge is a ‘given’ for practical purposes Knowledge is ‘impersonal’ and context-free Discipline-driven Deals with structured problems

Judgemental model Practice and theory (loosely) symbiotic Knowledge is understood as provisional Knowledge is a human construct and reflects context Problem-driven Deals with unstructured problems

Source: from Yorke (2005), after Hager and Butler (1996).

which context and human engagement are acknowledged as key influences, with theory playing more of an interpretive role. One is reminded of the distinction made by Gibbons et al. (1994) between ‘Mode 1’ and ‘Mode 2’ approaches to knowledge generation, the former being driven through separate disciplines, the latter by multidisciplinary engagement with ‘messy’, ‘real life’ problems. There is a partial correlation with the distinction between realism and relativism, with the scientific measurement model being closer to realism and the judgemental model being closer to relativism (Table 1.4 likewise presents the distinction in terms of ‘ideal types’). As with Table 1.3, the distinction is not as cut and dried as presented. One might press the correlation a little further, by pointing out the respective connections to correspondence and coherence theories of truth. The scientific measurement model is not without its uses. For example, stu-

Table 1.4 Realist and relativist perspectives on assessment Realist Standards are objectively defined Performances can be measured against these standards The assessor/judge is objective, dispassionate Values play no part in assessment Considerations relating to students’ situations play no part in assessment Explicit assessment rubrics Measurements are true and reliable representations of achievement Tasks set by assessors/examiners

Relativist Standards are normative, consensual Performances are assessed with reference to these standards The assessor interprets the extent to which the performance relates to the standards Value positions are embedded in the norming of standards Assessor may take into account students’ and/or institutional circumstances Broad statements of expectation Assessments are judgements of the extent to which achievements relate to expectations Tasks selected by students to suit their strengths and interests

28 The complexity of assessment dents need to demonstrate knowledge of facts and principles; to be able to say whether calculations give sensible answers or whether actions are consistent with espoused principles; and to be able to perform necessary routines such as constructing financial accounts or conducting analyses of variance. There is an essential corpus of knowledge and application that has to be acquired. However, these activities are set in a human context, and the demands of practical action in the world beyond the academy move the focus of attention towards an area where the application of formulaic routines is inadequate. Whereas an academic exercise might lead towards an ideal outcome, in employment or in life more generally the most satisfactory outcome might be the best result possible in the prevailing circumstances (in which information is incomplete and/or rather rough and ready) and not the best possible result in the abstract. Success in ‘real world’ situations is often more a matter for judgement than for measurement, as Cope et al. (2003) observe in connection with the assessment of teaching practice, and Ashworth et al. (1999) likewise remark regarding nursing. Since a high proportion of assessments under the ‘scientific measurement’ model involves at least some degree of inference, the scientific measurement model is more open to challenge on this ground than is the judgemental model (in which it is inbuilt), and hence is of limited utility. It is perhaps no surprise that recent discussion of the honours degree classification in the UK has included a consideration of assessment approaches that move away from what is implicitly a kind of measurement towards the inclusion of ‘softer’ methods of recording achievement such as progress files, portfolios and transcripts (see Chapter 9).

Subject benchmarks Subject benchmarks have been developed under the aegis of the Quality Assurance Agency for Higher Education in the UK. These state expectations of student achievement in the subject areas covered, in terms that hover between the realist and relativist descriptions of Table 1.4. At the time of writing there are benchmark statements for some 70 subject areas at undergraduate level and three subjects at postgraduate (master’s) level. They are not statements of academic standards, but act as reference points for curriculum design and implementation, and for employers and others who need to appreciate what they can expect graduates to know and to do. In some instances, the benchmarks have been developed with the involvement of the relevant professional or statutory body. The benchmarks, however, are not uncontentious. They are essentially pragmatic documents developed out of the normative knowledge and understanding possessed by members of the relevant subject community. They lack an explicit theoretical rationale with respect to the epistemology of the subject discipline and to pedagogy. Since they have been developed within subject communities,14 their 14 In some cases, the notion of subject community has been stretched considerably – particularly so in the coalescence of Hospitality, Leisure, Sport and Tourism within a single, if subdivided, subject benchmark statement.

The complexity of assessment 29 terminology and meanings may vary across statements: does ‘critical thinking’, for instance, mean the same in Engineering and in English? The subject-centred approach to development has resulted in very different approaches to the content of the statements, as Yorke (2002a) showed in respect of the first set of statements to be published. These statements varied in the extent to which they addressed expected levels of student achievement – ‘threshold’, typical or ‘modal’ (by which was meant the level expected of the most frequent honours degree classification, typically an upper second class degree), and ‘excellent’. They also varied in respect to which they addressed the levels of the original Bloom (1956) Taxonomy of educational objectives: particularly noticeable was the apparent lack of the creative dimension (in Bloom’s terms, ‘synthesis’) in 11 of the 25 subject benchmark statements analysed.15 If this suggests that performances have to be interpreted with respect to the subject discipline, a further complication arises when the institutional dimension is taken into account. Institutions take different approaches to the development of programmes of study, reflecting the kinds of students they hope to enrol and the kinds of employment outcomes they envisage for these students when they graduate. So Biological Sciences, for example, may be very academic in one institution and much more applied in character in another. Neither is necessarily ‘better’ than the other – they are seeking to achieve different kinds of outcome. This potential for variation along two dimensions – the subject discipline and the institution – points up the necessity of interpreting student performances with reference to the conditions under which they were achieved, i.e. their local contexts (Knight, 2006). To add to this, the marking method chosen contributes to the outcome as represented in a grade. Knight and Yorke (2003: 47) point out that the kinds of marking template suggested by Walvoord and Anderson (1998) are subject to the teacher’s decision on the relative valuing of assessed components. An example given by Walvoord and Anderson of ‘primary trait analysis’ suggests equal weighting for the various components of the report of an experiment in Biology: others might choose to privilege some aspects of the reporting over others. There is a further point to be made here. In interpreting student achievements, the interpreter needs to ascertain the extent to which a student has received guidance towards producing an outcome. In some instances the student is shown the steps that need to be taken to achieve a particular outcome and obtains a good result by merely following the guidelines. The problem being dealt with is, in the terminology of Wood et al. (1976), heavily ‘scaffolded’. In other circumstances, a student may have had to work out the problem without much in the way of scaffolding. Yet both might attract similar grades for their output, and the signalling power of the grade is too weak to convey the considerable difference in the two students’ achievements.

15 There were 22 actual subject benchmark statements, but two of them were split to give 25 statements in practice for analysis.

30 The complexity of assessment

Grades as predictors Having reviewed a very extensive range of studies (predominantly from the US), Pascarella and Terenzini (2005) conclude that, despite the variability in computing grade-point averages, undergraduate grades are the best predictors of future success, whether success is construed in terms of persistence or academic attainment. They note that grades awarded in the first year of study may be particularly important in reducing the incidence of withdrawal or intercalation, and increasing the chances of graduation within a reasonable time. Although Pascarella and Terenzini have a point, there is a lot of unexplained variance. Sternberg’s (1997) discussion of ‘practical intelligence’ suggests where some of the unexplained variance might be identified – the ability to achieve success in ways which do not necessarily correlate with academic achievement. Hudson’s (1967) discussion of the success of ‘distinguished Englishmen’ includes a note that ‘poor degree classes were quite frequent’ in their curricula vitae (ibid.: 133). His argument relates to the inability of conventional education (and by inference the assessment methods used) to detect characteristics in the individual that may be harbingers of a successful career. This book could be seen as a search for some of the unexplained variance in the predictive value of grades. Some, it will be argued, can be found in technical weaknesses in grading methodology, and some in the incapacity of grading to represent achievements in a way that will be optimally useful to those who use grades as indicators.

Chapter 2

Grading and its limitations

Grading is a complex activity Grading is not an issue that attracts a great amount of reflective thought in higher education. Whereas publications that deal with the practice of grading gain a fair amount of exposure, the few that subject grading to critical review receive far less attention. This chapter shows that grading is – to put it mildly – neither simple nor without problems, and foreshadows the discussion in later chapters of the implications for the recording and reporting of student achievement.

Critical comment Over the years there has been a trickle of critical descriptions of grades and gradepoint averages, sometimes expressed in colourful terms, such as the following: An inadequate report of an inaccurate judgment by a biased and variable judge of the extent to which a student has attained an undefined level of mastery of an unknown proportion of an indefinite material. (Dressel, 1983, quoted in Milton et al., 1986: 23 and 212) A grade is a unidimensional symbol into which multidimensional phenomena have been incorporated, a true salmagundi [i.e. hotchpotch]. (Milton et al., 1986: 212) Grades are inherently ambiguous evaluations of performance with no absolute connection to educational achievement. (Felton and Koper, 2005: 562) The colourfulness of the language should not distract attention from the serious points being made. There are two issues that seem generally to afflict the practice of grading: a misplaced – perhaps implicit – belief that grading can be precise, and reporting systems that are also built upon a misplaced assumption of precision. In this

32 Grading and its limitations chapter the first of these issues is addressed; the second is discussed in Chapters 6, 8 and 9. Dale (1959), with a holistic perspective on grading, questioned the capacity of assessors to grade with precision, and criticized their failure to take account of research that indicated that accuracy in assessment was difficult to achieve. The calm assurance with which lecturers and professors alike believe that they can carry around in their heads an unfailingly correct conception of an absolute standard of forty percent as the pass line, is incomprehensible to anyone who has studied the research on the reliability of examinations. (Dale, 1959: 186) Kane’s (1994) discussion of passing scores and performance standards illustrates that what is often implicitly claimed for the accuracy of a mark or grade proves to be illusory. With articles such as Kane’s in mind, Dale’s later point gathers force:1 There is more than a touch of the ironical in a situation where university staff, often research leaders in their own field, either ignore or are ignorant of research which directly affects their own work (Dale, 1959: 191) Much of what is graded in higher education relates to achievements that are multidimensional, as grading rubrics for assignments quickly reveal. Although the grading process can be subdivided into component parts, this does not necessarily resolve the difficulties, as is argued later in this chapter and in Chapter 8. Even when an item of performance being graded can be graded unambiguously, there are always questions to be asked about the sampling of items and the combination of item scores – matters in which the value judgements are often left unarticulated. Ebel and Frisbie (1991), drawing on work by Stiggins et al. (1989), point to three aspects of grades in which there are deficiencies: • • •

clarity regarding what grades mean; adequacy of an evidential base (sufficiency, relevance and objectivity) for the assignment of grades; generally accepted definitions of what the various grades in use actually mean.

They claim that grading standards and the meanings of grades vary at all levels of the educational system – the individual teacher, the course, the department and the school. In other words, grading is inherently unreliable and subject to value 1 Ashby (1963) is more often quoted as making this point.

Grading and its limitations 33 judgements. There are no grounds for appealing to research findings for answers to questions such as ‘What should an A grade mean?’ and ‘What percent of the students in a class should receive a C?’ (Ebel and Frisbie, 1991: 266). Although Ebel and Frisbie are focusing on grading in high school, their points are of significance for higher education – especially in a climate in which the issues of standards and their comparability are invested with political interest.

Grading scales One advantage percentage grades have over letter grades is they are simple and quantitative, and they can be entered as they are into many statistical calculations. [. . .] Percentage grades are often interpreted as if they were perfectly valid and reliable and without measurement error. Division of Evaluation, Testing and Certification (2000: 44) The numbers associated with letter grades do have the properties of numbers, however. They can be manipulated through the use of arithmetic processes. Also, one can convert letter grades to numbers or numbers into letter grades as the need arises. (Ibid.: 45) [the distinction is made with qualitative, ‘rubric’, scores] The above quotations pick up a number of misperceptions about grading scales and their use. Although percentage marks appear simple and quantitative, they are by no means simple, and their quantitative character is more limited than many appreciate. Percentages are not simple, because of the varying ways in which markers use percentages to signal levels of achievement. As Miller (2006: 15) pointed out, reflecting an observation from a study carried out by the Higher Education Quality Council in 1997, covering four subject areas,2 many markers use percentages as ‘qualitative flags of level’ rather than as quantitative measures. Subject disciplines tend to have norms relating to all aspects of the assessment process that make percentage marks in one subject not necessarily commensurable with those in another. Entering apparently similar data into statistical calculations may obscure the fact that the percentages were awarded in different ways for different kinds of achievement. It also tends to be overlooked that the technical quality of any educational measurement depends upon a host of considerations including validity, reliability and measurement error and – it should not be forgotten – the social context in which assessment takes place. Hence an uncritical trust in the meaningfulness of a particular percentage score is misplaced. Further, and of considerable importance for cumulation, Dalziel (1998) showed that percentage grades did not have the mathematical properties that are needed for many calculations based on percentages. Conversion of letter grades to numbers does not resolve the kinds of difficulty 2 See HEQC (1997a: para 3.10).

34 Grading and its limitations noted by Dalziel. In the US, the conversion of the variants of the A to F scale into a scale running from 4 to 0, with divisions to a number of decimal places, invites calculations to a precision that the quality of the data simply does not merit – an invitation readily accepted in calculations of GPA. The reverse conversion, of numbers – typically percentages – into letters, offers a broad indication of level of performance, and carries the implicit acknowledgement of the inherent fuzziness of educational measurement. Ebel and Frisbie (1991: 272) suggest that the use of letter grades implies evaluation rather than measurement. The distinction is overdrawn, since much that might be labelled ‘measurement’ in assessment is actually judgemental, and hence evaluative. Perhaps more contentious is their implicit assumption (no doubt coloured by Ebel’s background in psychometrics) that student achievements can be ‘measured’ in a manner analogous to approaches used in the sciences. The percentage scale, of 101 categories ranging from 0 to 100, is widely used in higher education in Anglophone countries, though the grading bands typically used in the US, Australia and the UK are rather different, with the bands tending to be highest in the US and lowest in the UK. A mark of, say, 70 per cent might attract a letter grade of ‘D’ in the US, a signification of upper or lower second class level in Australia (depending on the university), and be regarded as just within the ‘first class’ band in the UK (see Chapter 3). Within national expectations, percentages are reached via marking schemes that vary for a number of important reasons including the subject tradition; the assessment specification for the work in question; and the predilections of the individual marker. In the UK it is widely appreciated that, in some subjects, percentages above about 80 are hard to obtain.3 This could disadvantage students when percentages are run through the degree-award algorithm, because of a ‘ceiling effect’ that limits the chances of an outstanding mark counterbalancing marks a little below the borderline between first and upper second class honours. Bridges et al. (1999) produced evidence in support of their argument that the use of percentages could disadvantage some students, depending on the subjects they were studying. Whereas this evidence (derived from grading scales considerably shorter than the percentage scale) showed that percentages could disadvantage at the level of the module or study unit, the effect does not seem to carry across to the level of the degree award (Yorke et al., 2002). The reasons for this discrepancy are unclear, but may be attributable to the way in which the honours degree classification is determined. There is a view that the ceiling effect would be mitigated if performance descriptors were attached to grading levels. This could be expected to weaken the reluctance of some assessors to use the full range of the scale, and to allow them to better assess student performances against levels that would be expected at their stage of the higher education programme. In part because of considerations of this sort, some institutions in the UK have adopted linear scales of between 16 and 25 3 The same applies at the bottom of the percentage scale, but is generally of little practical importance since the discrimination of levels of failure below those for which compensation is available is irrelevant.

Grading and its limitations 35 points in length which are often criterion-referenced, though the descriptors are very general and have to be interpreted at a local level within the institution. The grading scale in use at the University of Wolverhampton, for example, runs from A16 (outstanding performance) down to D5 (satisfactory performance), E4 (compensatable fail) and ultimately to F0 (uncompensatable fail).4 The letters reflect broad bands of achievement, whereas the numbers represent a finer calibration: for passing performances, each band subsumes three numerical grades and hence C10, C9 and C8 all signify ‘average to good’. Middlesex University uses a 20point scale, with the highest grade being 1 and the lowest 20. The grading scale adopted for undergraduate programmes at the University of Derby, although sharing with others the provision of broad descriptors against grade levels, is of particular interest in that the relationship between report grade and numerical grade is non-linear (Table 2.1). At the middle of the Derby scale the report grade is stretched out relative to the numerical grade scale, thus accentuating the effect of relatively small differences in the grades awarded for work. At the time of writing, the grading scale was under review. In discussing grading practices, an important distinction is that between the mark that is awarded for a piece of work (often a percentage) and what is recorded for the purposes of reporting achievement. In the US, the typical procedure is for percentages to be converted into letter grades (often with + and – affixes) and then into grade points. In the UK, percentages or other grades are converted into the categories of the honours degree classification (or, at school level, into grades in the national examinations). Reporting grades in different higher education systems vary considerably, as Karran (2005) has shown for a variety of European systems: Table 2.2 gives some indication of the variation. Reporting grades may even differ within a system, as is currently the case in Sweden, though alignment with the European Credit Transfer and Accumulation System is likely to press systems to greater harmonization (see Buscall, 2006).

Is there an optimum length for a grading scale? Grading scales used in the assessment of student work range from two points (pass/fail, or satisfactory/unsatisfactory), through short sequences of letter grades (such as those used for reporting in the US and Australia) and longer sequences of some 16–25 letters and/or numbers (used in some institutions in the UK), to 101 (where the percentage scale ranges from 0 to 100). A derived scale, such as the grade-point average, can notionally have a larger number of scale points – for example, the use of two decimal places in reporting GPAs implies a scale of 401 divisions (0.00 to 4.00). There are two contrasting approaches to scale length: the first approach favouring few divisions in the grading scale, the second favouring many (see, amongst others, Ebel, 1969; Please, 1971; Willmott and Nuttall, 1975; Cresswell, 1986; 4 See ‘Introduction to academic regulations’ at www.wlv.ac.uk/default.aspx?page=6932 (accessed 18 September 2006).

36 Grading and its limitations Table 2.1 Grading of undergraduate work at the University of Derby Grade descriptor Outstanding, exceptionally high standard Excellent in most respects Very good to excellent Very good standard Very good in most respects Good to very good Good standard Good in most respects Satisfactory to good Satisfactory standard Satisfactory in most respects Satisfactory: minimum pass standard Unsatisfactory: some significant shortcomings Unsatisfactory: some serious shortcomings Very poor but some relevant information Exceedingly poor: very little of merit Nothing of merit

Numerical grade 24 22 18 17 16 15 14 13 12.5 12 11 10 9 7 5 2 0

Report grade A+ A A– B+ B B– C+ C C– D+ D D– MP MP– F F– Z

Honours classification First class

Upper second

Lower second

Third

Fail

Data are extracted from a table that includes other qualifications. Source: www.derby.ac.uk/qa/3Rs/ S14%20Assessment%20Regulations%20for%20Undergraduate%20Programmes%20UG.C.pdf (accessed 18 September 2006, and reproduced with permission). Notes This scale is designed to apply to the assessment of individual pieces of work including examination answers. The Report Grade is determined by the assessor and reported to the student. MP and MP– are used to indicate achievement of the minimum standard of performance in the work concerned, which may or may not lead to the award of credit. The corresponding Numerical Grade is used to calculate the overall module grade. Classifications apply only to the final determination of Honours. At module level they are indicative only.

Heywood, 2000). Points for and against each approach are summarized in Table 2.3. Cresswell (1986) discusses the loss of information when the number of grades is decreased. However the example he gives seems to make the implicit assumption that the measures being used are sufficiently robust to allow the kinds of arithmetical manipulation that are often applied. As Dalziel (1998) points out in his critical review of the treatment of numerical grades, such an assumption might be difficult to justify despite the work of public examination boards to standardize marking. Any challenge regarding robustness is strengthened when assessments are local to a subject area within an institution, as is typically the case in higher education. Ebel (1969: 221) qualifies his argument that accuracy or reliability is raised by increasing the number of scale divisions when he remarks: ‘if the estimates [of achievement] are extremely inaccurate there is not much point in reporting them on an extremely fine scale’ – a judgement shared by Willmott and

UK First class honours Upper second class honours Lower second class honours Third class honours Non-honours degree (‘Pass’ or ‘Unclassified’)

France 21-point scale, running from 20 (highest) to 0 Pass grade is normally 11

Germany 1 (Excellent) 2 (Good) 3 (Satisfactory) 4 (Sufficient) 5 (Unsatisfactory) 6 (Poor)

Table 2.2 An illustration of the diversity of reporting grades in Europe Denmark 10-step scale covering the range from 13 (highest) to 00 Minimum pass is 6

Sweden (engineering institutions) 5 (80% and above) 4 (60–79%) 3 (40–59%) U – Underkänd (Fail)

Sweden (Stockholm University) VG – Väl godkänd (Pass with distinction: 75% and above) G – Godkänd (Passed 50–74%) U – Underkänd (Fail: below 50%)

38 Grading and its limitations Table 2.3 Contrasting perspectives on the number of scale divisions Few divisions Fewer errors in categorizing a performance, but when there are errors they may have considerable consequences Loss of information about the performance Reliability of grades is problematic Users may need to supplement grade data with information from other sources Users of grades may take them at face value. Having fewer grade-bands prevents a user from placing excessive meaning on gradings from a finer scale

Many divisions More errors in categorizing, but the errors are less likely to be large, and hence to have serious consequences Finer distinctions can be made Reliability of grades becomes higher with increasing number of divisions Users may believe that the fine gradings provide sufficient signalling for their purposes It may be motivational to have gradings that are capable of signalling improvement, even if the improvement is relatively small (but see below)

Nuttall (1975: 52) who say much the same thing: ‘the lower the reliability of grading, the higher the rate of misclassification into grades no matter how many grade points are used’. However, Ebel qualifies his qualification by suggesting that in most courses grading can be undertaken with an accuracy sufficient to justify reporting in terms of a scale of 10 to 15 divisions, though he provides no evidence in support. Mitchelmore (1981: 226) concluded that no particular scale length could be advocated, since the optimum length would depend on the prevailing assessment circumstances. A central issue, then, is the accuracy with which grading can be conducted: this is discussed in Chapters 8 and 9. Pass/fail grading Ebel (1972: 335–36) was highly critical of grading on a pass/fail basis, arguing inter alia that the non-differentiation in the ‘pass’ band was unhelpful to those who wanted to interpret performance and provided little stimulus to the student to study hard. Although Ebel had a point, his argument is perhaps not as powerful as he supposed, and is taken up below and elsewhere in this book (especially in Chapter 9). The pass/fail boundary The pass/fail boundary is arguably the most important boundary in the assessment process. Brown et al. (1997: 12) suggest that norm-referenced assessments pose particular problems for assessors in determining whether a student has passed or failed. Whereas other boundaries may have considerable significance for students’ futures, the removal of ‘above-pass’ gradations (as, for example, advocated by Winter, 1993, and Elton, 2004; 2005, in respect of the honours degree in the UK) is not necessarily the problem that some take it to be, since the gradation issue can be tackled by other means, as is discussed in Chapter 9.

Grading and its limitations 39 The boundary is highly important to students, given the stigma that attaches to failure. In some areas, it is invested with a public significance – for example, the health-related professions, social work and education, in which there is an important public safety element. An institution would be failing in its public responsibility to pass a student who it believed was a risk to the public. Hence in areas in which public safety is an issue, assessors need to adopt a conservative approach to grading (see, e.g., Newble et al., 1994 in respect of Medicine,5 and Cowburn et al., 2000 in respect of Social Work) which minimizes ‘false positives’ – those who pass their assessments, but should not (Table 2.4). The cost of minimizing false positives is, however, an increase in the chances of failing students who might well become competent practitioners. Determining an acceptable standard Norcini and Shea (1997) remark that pass/fail distinctions in licensing to practise as a member of a profession depend on educational and social criteria. There are problems in establishing what an acceptable standard actually is, and how a score can be said to be an indication that the standard has been achieved. They suggest the following as important for standard setting. • •

• •

Setters of standards should be knowledgeable and credible (some might be lay interested parties). Many setters of standards should be involved, since it is important to have a variety of perspectives (Wiseman’s,1949, point about the marking of essays comes to mind). There should be absolute standards and criterion-referencing. Standards should be based on the judgement of experts – following Angoff (1971), using a method that provides examples of student achievement and asks the experts to judge whether the exhibited performance exceeds the criterion level for a pass.

Table 2.4 A matrix relating assessment outcome to competence in practice

Assessment outcome Fail Pass

Performance in practice Not yet competent Accurate prediction False positive

Competent False negative* Accurate prediction

Source: Yorke (2005). * If the student failed their assessment, then they would of course not be permitted to enter actual practice. 5 Newble et al. suggested that assessment could be conducted as a filtering process in which cheap and efficient tests (such as multiple-choice questions) might differentiate those who were well above threshold levels (and who could therefore move on without further ado) from those whose level of achievement might need more extensive testing. This ‘decision-tree’ approach to assessment is logical, but it would probably not suit the formalized structure of assessments that are built into many validated curricula.

40 Grading and its limitations • • •

• •

Appropriate diligence should be applied to the task of setting standards, but avoiding the unreasonableness of practices such as multiple iterations of judgement. The method of setting standards should be supported by research that will undergird credibility. The proposed standards should be evaluated for their realism in relation to the context of their use. (The award of credentials should align with competence in practice (see Table 2.4); group pass rates should be reasonable in the light of other markers of competence; standards should be consistent with judgement of stakeholders.) Parallel forms of the assessment should be mutually consistent. The possibility should exist of scaling assessment performances in order to produce equivalent outcomes.

Although there is a lot of merit in these suggestions, the application of a ‘reality check’ to the generality of summative assessment in higher education would raise a number of doubts about the extent to which they could be made operational. In particular areas of higher education, such as Medicine, the setting of standards transcends the individual institution and collective engagement has a greater potential for bringing the suggestions into play. In the majority of subject areas, however, much more depends on voluntarism. The ‘subject benchmarks’ promoted by the Quality Assurance Agency in the UK,6 although not explicitly statements of standards, can be seen as having emerged from a process of officially prompted voluntarism. The determination of whether a person is considered worthy of a licence to practise in a profession depends upon ‘how much competence’ the assessee is expected to demonstrate as an entrant to the profession. The concept of ‘good enough’ to enter the profession is not always clear, as a number of writers (such as Kemshall, 1993; Murrell, 1993; Stones, 1994; Redfern et al., 2002; and Furness and Gilligan, 2004) have pointed out. Eraut (2004a) suggests that professional competence should be construed in terms of a trajectory that acknowledges a lifelong learning perspective, in which is embedded the professional’s commitment to continuing professional development. ‘Good enough’ competence for entry to the profession would therefore relate to the levels of achievement reasonably expected from first degree study, with the concept of ‘good enough’ steadily expanding as the person advanced through the profession – perhaps through the stages of professional development set out by Dreyfus and Dreyfus (2005). Can every aspect of performance be graded? Ebel specialized in test construction and psychometric theory, and conducted his major work at a time when when the ‘scientific measurement’ paradigm was dominant. If an achievement could be measured, then it was consequently scal 6 See www.qaa.ac.uk/academicinfrastructure/benchmark/default.asp.

Grading and its limitations 41 able and performances could be reported against the scale. Some aspects of performance lend themselves to scaling, even though the scaling might not fulfil all the technical requirements of psychometrics. Examples are the recall of facts and principles, the use of routines to solve problems (though such problems, with their predetermined right answers, are probably better characterized as puzzles), and so on – the kind of learning that Gibbs and Simpson (2004–5: 3) see as ‘dull and lifeless’ but, from the point of view of summative assessment, advantageous because it can be assessed rigorously, reliably, and with minimal risk of cheating. However, the scientific measurement paradigm cannot accommodate the ‘softer’ judgments of achievement that have become more important in undergraduate education, which have emerged in recent years as higher education has widened its view about what should be valued in student achievement, having been influenced by government policies stressing the importance of developing ‘human capital’. Contemporary students are expected to demonstrate a host of attributes, qualities and skilful practices that can be grouped under headings such as ‘employability’ or ‘workforce development’ (Knight and Yorke, 2004; Voorhees and Harvey, 2005). These are not easily measured, even with the resources of an assessment centre, let alone the much more limited resources available for this purpose in higher education. They may be broadly gradable, but in some instances it may be possible for an institution only to warrant that they have achieved a passing level of performance. Pass/fail grades are permitted in some modules, but classifications and GPAs do not incorporate them because they do not fit the formal algorithms for summing up performance in a single index. Reporting practice, therefore, may not always live up to the rhetoric regarding student achievement. It is even seriously open to question whether the scientific measurement paradigm can cope with some central intentions of higher education, such as the development of critical thinking. Taking critical thinking as an exemplar of the problem, Knight and Yorke (2003: 53ff) argue that, although it can be made the topic of standardized testing (with all that implies for the technicalities of measurement), the standardized test by its very nature cannot accommodate critical thinking outside pre-specified circumstances, such as when critical faculties are brought to bear on the unanticipated – perhaps unbounded – situations that can occur in, for example, Fine Art, History, Sociology or Philosophy. When one inquires what ‘critical thinking’ means to academics, one does not get homogeneity in response – there are variations at local level analogous to the physiological variations in tortoises on the different islands of the Galapagos.7 Further, when assessments are reported, the interpreter of that report is unable, unless a very special effort is made, to appreciate that an ‘A’ grade represents a performance of true originality on the part of the student or one that has been so subject to tutoring and perhaps ‘scaffolding’ (Wood et al., 1976) that it is a demonstration of competence in following rules rather than of the ability to strike out on their own. Measurement as Ebel envisaged it, then, cannot cope with the variability inherent in higher education. 7 See Webster et al. (2000: 77) for a similar point regarding ‘analysis’.

42 Grading and its limitations Different purposes need different scales Cresswell (1986) also makes the point that different purposes suggest different numbers of grades in the scale. A grading approach that suits one purpose may not be ideal for another. For example, public examinations in the UK tend to report on a relatively short grading scale, whereas percentages are often used internally in schools (even if reports to parents are couched in broad categories of performance). Further, what a stakeholder claims to want (e.g. precise grading) may not be what the stakeholder really needs (e.g. broad indications of achievement and capability). However, reducing a larger number of scale divisions to a smaller number (say, for the purposes of reporting) can on occasion produce inequities. Thyne (1974) constructed an example to illustrate the point. Whereas Student A scored 27 and Student B 25 out of a total of 76 (from four marking scales running from 0 to 19), when the marks were collapsed into five bands of equal interval, Student B came out ahead of Student A. The discrepancy occurred because some of the performances of Student B were low within the bands (and hence banding worked to B’s advantage), whereas the opposite was the case for Student A. Grades as signals Students need clear signals of their levels of achievement, so that they know where they stand and whether there is a need to make an effort to improve their performance in the area in question. Grading finely has more ‘signalling potential’ than grading coarsely. Heywood (2000) suggests that fine grading has ‘motivation value’, in that it can give students numerical signals regarding improvement (making, of course, the implicit assumption that the grading is sufficiently reliable for this purpose). The argument may not be as strong as Heywood supposes, as far as modular schemes in the UK are concerned. Students can work out, from the grades that they have attained, whether there is any realistic chance of improving their honours degree classification: if there is not, the motivation value of the grading system vanishes. In the US, where grade-points are cumulated into a rolling average which is more finely grained, the motivation potential of the grading system may be less affected – but for some purposes, such as the gaining of honors or of entry to a graduate-level programme, the attainment of the required GPA threshold may or may not be a realistic ambition, with possible consequences for motivation. If a student can work out that, even if they work very hard, they cannot move up a grade band (even if their GPA can be edged upward), there is little that the grading system can do to provide extrinsic motivation (Bressette, 2002: 37, gives an example). Employers and educational institutions use grades (and other contextual information, such as the institution attended by the student) in sifting through applications. Where pressure on places is high, the grade achieved by a student becomes very significant, and fine divisions in the grading system take on considerable importance. However, the extent to which a grade represents a ‘true score’ is always problematic – as is the concept of ‘true score’ itself, since this will reflect value judgements regarding what is to be subsumed within the grading process.

Grading and its limitations 43 By itself, grading contains only the weakest of pointers for future action, in that it often signals ‘do better next time’ whilst providing nothing to indicate how that improvement may be made. The connection between the signifier (the grade) and the performance (the signified) is tenuous unless the assessment is strongly criterion-referenced (and even then, as Wolf, 1995, has shown, criteria are often insufficient without the accompaniment of examples to show how the criteria are to be interpreted and used). For grading to be formatively effective, it needs to be accompanied by a commentary that covers both strengths and weaknesses and also points to the way in which the student can improve their work – whether in repeating a failed assignment or in moving on to a new assignment. Grading in this formative way is, of course, only successful if the student acts on the information that has been provided.8

Problems associated with grading Sampling from the curriculum Sampling of curriculum content for the purposes of assessment contains a risk of bias. Since it is usually unrealistic to attempt to assess all of the objectives or intended learning outcomes for a unit of study, a selection has to be made and ideally is as representative of the curriculum segment in question as can be achieved. The choice of assessment task(s) exerts an influence on the marks that are awarded, depending on the level of the assessment demand. So before marking takes place, the grades awarded are likely to have been influenced: this is demonstrated from personal experience, as in the following true account.9 As part of a subsidiary Mathematics course in my first degree studies, the students were given an examination on differential equations. The examination required the candidates to solve, using well-established rules, a number of differential equations. The marks awarded ranged up to 99 per cent, and the teaching department took the view that these were so elevated that the students should take a different – and harder – paper over the Christmas vacation. Enthusiasm for this extra task was understandably not high, and the mark profile was shifted some 20 or so percentage points lower since students took the view that the requirement was unfair and were only prepared to put in sufficient effort to secure a pass. 8 See the model of formative assessment in Knight and Yorke (2003, Chapter 3). 9 Contrast this with the roughly contemporaneous student Thompson’s behaviour at the University of Kansas. Thompson was awarded a mark of 78 but told the professor that the students wanted another examination. This action was interpreted as Thompson trying to raise his grade at a second attempt, and the other students indicated that they were going to ‘do something’ to ensure that he did not turn up for the second examination, and hence would not gain the advantage he was presumed to be seeking (Becker et al., 1968: 107). Frustratingly, Becker et al. do not provide the denouement of this episode.

44 Grading and its limitations Had this taken place under contemporary assessment systems in the UK, the students would probably have appealed en masse that they had eminently satisfied the examiners and that hence their original results should stand, but in those days the university was able to set without challenge whatever tasks it deemed appropriate. The problem, as far as assessment was concerned, was that the sampling of the content of this mathematics course was inadequate, and this only came to light when the results attracted attention in the teaching department. Something like this would be unlikely to occur on such a scale in contemporary higher education, given the emphasis on curriculum content and assessment in quality assurance systems, but the possibility of a more modest sampling bias still remains. How the scale is used A second issue is the kinds of task that are being assessed. If the curriculum content is amenable to students providing unambiguously correct responses to assigned tasks, as is the case with some technical subjects, then, using the percentage scale, the full range of percentages from 0 to 100 is available to the marker. The question ‘percentage of what?’ is not a problem because the ‘what’ has been determined by the sampling of the curriculum content for assessment. There is in operation an implicit notion of the possibility of a perfect performance, even if the student has to produce something new, such as a piece of computer programming in which functionality and algorithmic elegance can be judged. Where ‘correctness’ is difficult to identify, the question ‘percentage of what?’ becomes much more difficult to answer, and in the UK percentages well above the threshold for first class honours, 70 per cent, are relatively rare. For example, subjects in which students are expected to argue a point of view tend to limit percentages to a range of the order of 30 to 80, with only exceptional performances at both ends lying outside this range. An attempt to avoid some of the problems with percentage grading, such as implicit notions of an unachievable perfection in discursive subjects, is to use a criterion-referenced scale in which the top grade can be seen as representing an excellent achievement for a student at that particular stage in their programme. There is some evidence to suggest that this ‘works’ at the level of the study unit (Yorke et al., 2002), but the potential of approach may be realized only partially if assessors tacitly work on the basis of a percentage scale and only after judging the work convert the notional percentage into a point on the scale being officially used. Additive or holistic marking? The third issue is the way in which the marker uses the notion of a percentage scale. There seem to be two basic kinds of approach – additive and holistic: 1 Additive (or ‘menu marking’ with aggregation in Hornby’s, 2003, terms). The

Grading and its limitations 45 assessment task is subdivided into segments carrying a particular number of marks, with marks being awarded for component parts. The segment scores are built up into a total score. This is typical of criterion-referenced assessments with varying degrees of specificity. 2 Holistic. The marker assesses the overall performance with reference to broad categories (often levels of the honours degree classification), and then refines the assessment into a finer categorization. For example, a performance judged initially to be at upper second class level is subsequently assigned to the lower part of the band and then to a mark, say, of 62 per cent. It should be noted that this approach is very weakly referenced against criteria, rather than completely lacking in such referencing. Hornby (2003) suggests a third approach – criterion-referenced – but this can apply to both of the approaches outlined above, with the criteria for the latter likely to be broader than those for the former. The use of the additive approach sometimes produces an overall outcome that is at variance with a holistic assessment of the merit of the work. Wiseman (1949: 205) noted that he sometimes found the ‘best’ essay not top of the list when using the additive (he uses the term ‘analytical’) approach to marking, and Shay (2003: 80) reported that two markers in her study felt that the marking memorandum they had used gave marks that were too high and adjusted their marks downward to align better with their ‘gut-feel’ of the merits of the work. Another of her respondents observed, in resisting the constriction of criteria, that ‘I don’t have a mind that splits into seven bits’ (Shay, 2003: 97). It is not unknown for the outcome of additive marking to be a higher mark than that from holistic marking. This may be attributable – at least in part – to the way in which the short mark-scale available for components of the task is used. When the scale is short, there may be a tendency for markers to give the student the benefit of the doubt, and hence in effect to round up the mark. If this is done a number of times across a piece of assessed work, then the roundings-up are cumulated and the overall mark reflects this. Another possibility is that there is a difference of emphasis between the assessment of the individual parts and that of the whole. A holistic assessment of an essay may pay more attention to the structure, coherence and elegance of the argument than would a more atomized assessment process, where the emphasis might be on whether the student has covered the ground for the respective sections, and perhaps cited the expected amount of source material. As Wiseman (1949: 205) put it: ‘the total gestalt is more than the sum of the parts’. A problem for the holist assessor is the amount of material (or, psychologically, the number of ‘chunks’) that can be processed at the same time. Miller (1956) suggested somewhat speculatively that the number was in the region of seven: more recently Cowan (2000), after reviewing an extensive literature, has concluded that the number is closer to four, although the particular circumstances may edge this up to a figure closer to Miller’s. Whatever the figure, it is derived from psychological experimentation and its validity in respect of the assessment of student work is a

46 Grading and its limitations matter for further exploration: at the moment, Cowan’s analysis is more a prompt to reflection about the assessment process than an indication of practical limits. In reality, matters are usually not as sharply polarized between the additive and holistic approaches: marking schedules for essays typically give marks for structure and presentation, so that what happens is a mixture of the additive and the holistic. Further, there are aspects of assessment (discussed briefly below with reference to creativity) for which the divalent perspectives of additivity and holism seem inadequate. In science-based subjects, there is often the need to demonstrate that one has learned and can apply principles, formulae and so on, and questions tend to have unambiguous answers. Hence it is easy for an assessor to tot up the marks for the whole of the assessed piece of work. The same principle, in a weaker form, may apply in respect of the set of criteria used in judging the standard of a more discursive offering. This is ‘positive marking’, in that marks are awarded for correct or good points. ‘Negative marking’ can occur when a performance exhibits lapses from excellence, and marks are deducted as a consequence (and is acknowledged in Cameron and Tesoriero, 2004, quoted below). As a side-issue to the approach taken to marking, Wiseman (1949) noted that marking school examination essays holistically was much more rapid than approaching the task additively. One could therefore reduce the time devoted to marking (and hence the costs), or use multiple markers for the same amount of total time. By extension, one might infer, following Wiseman’s discussion of the advantage of multiple perspectives in marking (see Chapter 1), that the total or mean mark from multiple assessments would provide a better index of an essay’s worth than any individual mark. From the perspective of assessment in higher education, some caution is obviously necessary regarding Wiseman’s argument. In higher education, one is less likely to be marking a whole slew of essays addressing the same question in much the same way and, in any case, the essays are likely to be much longer and more complex than those produced in a school examination. However, if the need for precision in marking is not as great as it is often taken to be, then Wiseman’s argument should give pause for thought. Grades and levels Hornby (2003: 441ff) describes a matrix of achievements in which the grade awarded and the academic level (first year Scottish degree, etc.) are represented. He indicates that a student’s performance for each of the four years of the programme can be located in a cell of the matrix. This, he says, ‘provides greater clarity and transparency for all the stakeholders in the assessment process’ (ibid.: 444). It seems, though, that Hornby is being a little optimistic, since the matrix entries involve some form of combination of grades from different modules, and the grade descriptors (ibid.: 442–443) are open to considerable interpretation, even if academics’ understanding of them reaches a higher level of intersubjectivity than Morgan et al. (2004) found in their study of the understanding of levels within undergraduate programmes in Australia. Unarticulated in Hornby’s pres-

Grading and its limitations 47 entation of the approach and the grade descriptors are the actual standards of performance against which reference is being made: there is an acknowledgement of a ‘threshold standard’, but what this might be can only be determined by each subject discipline involved in the assessment. Hence the matrix, as presented by Hornby, goes only a little way beyond the cumulation of performances across a programme into a single index. Hornby (2003: 450) refers to the implicit pressures on academics not to stick their necks out as regards the use of grades at either end of the scale – something given a clear reality by Johnston (2004: 405–406) in her vignette of experiences as a marker trying to come to terms with normative expectations in an educational system with which she was unfamiliar.

Going beyond boundaries: the problem posed by creativity It is when the demonstration of creativity is a curricular expectation that the problems with the percentage scale are at their greatest. Elton (2005) remarks that ‘[f]or work to be considered creative, it has to be – within its context and, in the case of students, at a level appropriate for them – both new and significant’. This perspective is deliberately context-bound, so any notion of absolute perfection (valued at 100 per cent) has to be backgrounded. A student can receive a grade representative of excellence for outstanding work at the relevant stage of the programme. The most challenging word in the quotation is probably ‘significant’. How does one judge significance? Should not the truly creative person break through boundaries? If so, then how can the assessment specification allow for this? The only way is to state expected learning outcomes (or similar) in a manner that is open to interpretation – a curriculum model incorporating tight pre-specification of learning outcomes and assessment does not serve creativity well. One can only judge creativity after the event, by the assessor bringing all the knowledge and understanding they possess to bear on the judgement process. Eisner (1979) referred to this as ‘connoisseurship’, and this emphasizes interpretation and judgement by experts whilst reducing the ‘measurement’ aspect of assessment to at best a background role. Cowdroy and de Graaf (2005) discuss the assessment of creativity, inter alia arguing that an important part of the assessment must be determining the thought processes in the creator that led to the creative production, otherwise the assessment would merely represent the (detached) judgement of the assessor. Amongst a number of pertinent points, Elton (2005) argues that, when dealing with creativity, one cannot apply a ‘one size fits all’ approach to assessment. Fairness requires, not homogeneity in assessment methodology, but an approach that enables each person to demonstrate their individual – and perhaps idiosyncratic – achievements. The use of a portfolio in which the student’s work is showcased is not new in subject areas such as Art and Design, since applicants are often required to produce such a collection when attending for interview for a place on a degree or other programme, and the ‘degree show’ is essentially a portfolio-based presentation. Although portfolios may have high face validity, they run into some

48 Grading and its limitations problems regarding the reliability of their assessment (much of the evidence is reviewed by Baume and Yorke, 2002) and their usability. Judgements in the creative arts are not always shared. Many in the field will have been present at an assessment board at which strong debate has occurred over whether a student’s achievements merit a high or low grade. (I recall observing one assessment board in the area of Art and Design at which a candidate’s work was rated ‘A’ by the department and ‘E’ – a fail – by the external examiner. The debate was very heated, and eventually resolved in stereotypically British fashion by compromising on the award of a ‘C’.) An approach to the assessment of creativity which sits uncomfortably with formal grading methodology is to get the creator to comment on their creative inspirations. Balchin (2005) claims to have found self-reporting to be superior in a number of respects to other approaches to assessment, and there are hints of commonality with the line taken by Cowdroy and de Graaf (2005). As is the case with Eisner’s connoisseurship, self-reporting is difficult to align with unidimensional grading scales. Although assessment of creative achievement is arguably the most difficult to interlock into formal grading schemes, there are elements of ‘the creativity problem’ in other subjects as well.

How is grading undertaken? Advice There are many books that deal with assessment (a number were listed in the introduction to Chapter 1), some of which offer ideas on ‘how to do it’. Readers interested in the actual practice of grading are invited to consult texts such as these. As an example of the diversity of approach that can be found in grading, Walvoord and Anderson (1998: 93ff) suggest three approaches, recognizing that the choices made reflect values that are explicitly or implicitly being brought to bear. •

•

•

Weighted letter grades, in which the various components of the assessment are kept completely separate, and weighted according to a prior specification. Thus, if a student obtains a ‘B’ average grade for a module component which is weighted at 30 per cent of the total for the module, this B is treated as 30 per cent of the overall grade (whose computation requires some translations between letters and numbers). Accumulated points, in which compensation is permitted between components of assessment, so a student gaining a low score in one component can offset this by a higher score elsewhere in the module. Definitional system, in which there is a threshold of performance that has to be attained for the grade to be awarded. Where the module includes both graded and pass/fail components, this becomes cumbersome to operate and may be unforgiving of weaknesses since compensation between the two kinds of assessment is not allowed. The extension of the approach, in which

Grading and its limitations 49 grades are awarded in respect of module components (the example given is of test and examinations on one hand, and laboratory reports on the other) seems more practical. However, the tightness that seems to be a characteristic of the suggested approaches is undercut when Walvoord and Anderson, writing of the ‘Accumulated points’ approach, suggest awarding a total of 120 points yet awarding an A grade for, say, the points range 92–100 (ibid.: 95), or when they propose that you can hold back a “fudge factor” of 10 percent or so that you can award to students whose work shows a major improvement over the semester. Or you may simply announce in the syllabus and orally to the class that you reserve the right to raise a grade when the student’s work shows great improvement over the course of the semester. (Walvoord and Anderson 1998: 99) Empirical findings Variation in approach There is a relatively slender base of research on the way that academics in higher education go about the task of assessing students, which probably reflects the methodological difficulties involved. A survey conducted by Ekstrom and Villegas (1994) found that academics tended to use a variety of approaches to grading: 81 per cent claimed to use criterion-referenced assessment; 57 per cent to use norm-referenced assessment; and 44 per cent to use a self-referenced approach. When the same academics were asked which of the three approaches they used the most, 64 per cent indicated criterion-referencing; 29 per cent norm-referencing; and 8 per cent self-referencing. Clearly, from the evidence provided by Ekstrom and Villegas, many academics use more than a single approach to grading, a point that Hornby (2003) acknowledged following his interviews with lecturers who were asked to identify one of three approaches (criterion-referenced, holistic, menu marking) as representing their approach to marking. In converting test scores to the letter-based system of grading in Virginia Polytechnic Institute and State University, respondents to a survey of 365 teachers were evenly divided between direct conversion according to percentage bands (e.g. C awarded for a core in the range 71–80) and an approach that took into account factors such as test difficulty, the individual student, and even ‘natural breaks’ in the score distribution (Cross et al., 1993). A tiny minority described their practice as, effectively, ‘grading on the curve’. Pilot investigations reported by Yorke et al. (2000) involved workshops in which some 60 academics with varying degrees of experience were asked about the way(s) in which they went about the task of assessment. They were presented with a number of approaches including the following:

50 Grading and its limitations • • •

a ‘platonic’ model, in which the assessor had a clear idea of what was to be expected of the student, and assessed against it; intuitive approaches derived from the assessor’s experience of having had their own work assessed (prior to becoming an academic); marking against a set of pre-specified criteria (in Hornby’s, 2003, terms, ‘menu marking’).

The majority of this opportunity sample claimed to use a marking scheme or template. In science subjects this tended to be precise; in other subjects it was more general. It was easier in the latter group to make allowance for the inclusion of relevant material that had not been part of the original assessment specification. Other assessors were more holistic in their approach, claiming to recognize the broad level of a performance and using the grading scale to narrow down their judgement to a more precise grade. Most of the sample had learned about the practice of grading from colleagues, whereas others relied on recollections of having had their work assessed.10 A few had developed their expertise as assessors through attending institutional workshops on assessment. They had mixed views about the accuracy with which they felt that they could judge student work, some saying that their inexperience and uncertainty was mitigated by double-marking, moderation or the contribution of external examiners. Investigating practice Baume et al. (2004) suggested four ways in which investigations could get closer to the actual practice of assessment and hence to minimize the chances of post hoc rationalization: • • • •

sitting alongside an assessor, and asking them to think aloud as they worked; asking assessors to audiorecord their thoughts as they worked; asking assessors to write down, during the assessment process, their reasons for their judgements; interviewing or surveying them on completion of the assessment, and asking for the reasons for their decisions.

Baume et al. opted to ask assessors to comment on their recently completed assessment by responding to an on-screen pro-forma. They found occasional examples of assessors ‘bending the rules’ to fit outcomes that they thought were merited. In her study of assessor behaviour, Orrell (2004) asked assessors to think aloud whilst they worked, finding some differences between what they said they

10 Most of the respondents to Hand and Clewes’ (2000) study said that they learned about grading by simply doing it: there was at that time no formal training in assessing work.

Grading and its limitations 51 did and what they actually did in practice. There was a tendency to assess more holistically than was implied by the marking scheme to which they were expected to work. Assessors also paid particular attention to aspects of student writing that had been left unstated: the quality of the introductory paragraph; the graphic quality of the text; and the student’s use of academic writing conventions. Hand and Clewes (2000) undertook a qualitative study of the grading of final year dissertations. They were surprised to find variation between assessors regarding their adherence to formal guidelines, and commented on the unfairness that this could engender as well as the inconsistency compared with the expectations that students should work with guidelines in mind. They also noted the inflection by their respondents of their own glosses on provided criteria (as was also noted by Webster et al., 2000). These findings chime with those of Baume et al. (2004). Webster et al. (2000) examined 80 completed assessment forms relating to dissertations undertaken in the School of Social Sciences and Law at Oxford Brookes University. They described their rather mixed findings about the use of criteria as follows: The good news for students is that . . . it could be inferred that assessors do indeed use the published criteria in marking. However, the potentially bad news is that not all of them are necessarily applied. The possibly even worse news is that these are not always the only criteria used: comments were found which did not seem to relate to any published criteria (or if they did it was not explained how). These ranged from: “This is a poor copy of something published elsewhere”, through “This sort of dissertation could have been written on a brief cultural holiday”, and “It’s a planning report not a dissertation”, to “The summary tables should be embedded in the text!” Furthermore, the analysis suggested that judgements are sometimes related to and influenced by the orientation of the assessor towards wider value systems. Thus some markers would have wanted dissertations to address empirical issues while others would have liked to see more theory. However, that this variation in outlook might be encountered is nowhere made explicit to the student. (Webster et al., 2000: 76) The survey conducted by Ekstrom and Villegas (1994) involving 542 usable responses from individual academics showed the variation in departmental approaches to grading that one might expect – a bias towards objectivity in sciencebased subjects, and a recognition of the potential for disputation and alternative views inherent in humanities and programmes based on the social sciences. Ekstrom and Villegas also noted variation in approach within departments, in that differential attention was paid to matters such as attitude and effort, improvement,

52 Grading and its limitations class participation, attendance and personal circumstances (some of which were probably left unstated as criteria for assessment).11 Word and grade Echoing Wolf (1995), Webster et al. (2000) showed that academics assessing dissertations varied in the meanings that they gave to terms such as ‘analysis’ and ‘evaluation’. They also pointed to an occasional lack of congruence between the words used and the grade awarded, citing the following: • • •

‘Real awareness of the various perspectives’: mark awarded, 46% ‘this is a clear, well presented [dissertation] . . . which fulfils its specific aims’: mark awarded, 49% ‘results section unclear’: mark awarded, 57% (Webster et al., 2000: 76)

The first two seem to be awarding a grade well below what the words are signifying, whereas the last seems to be doing the reverse. Wherein lie the truths? Social considerations Anticipating the findings of Webster et al. (2000) that unarticulated criteria could creep into the assessment process, Cross et al. (1993) found, from their survey of teachers at Virginia Polytechnic Institute and State University, that some adjusted borderline grades by taking into account factors unrelated to achievement. They indicated that, in their opinion, some of this adjustment might have taken place for social reasons such as ‘political correctness’. In these various studies, then, divergences were found between intentions and practices which cast shadows over the accuracy with which student work was graded. The social aspect of assessing is particularly strong when assessors work together to reach a judgement about student work. The ‘group crit’ in Art and Design involves assessors collectively appraising artefacts and negotiating an appropriate grade. Orr (2005) observed group marking or moderation on six occasions which in total occupied some ten hours of interaction. She observed that views were exchanged, leading to a broad grade which was subsequently refined into an agreed numerical mark. Methods of moving towards agreement included offering a mark range rather than a specific mark (hence allowing some ‘wriggle room’ in negotiation); averaging the marks suggested by assessors; and moving towards the mark proposed by one assessor (who may have been perceived as the most powerful). Orr detected, during her observations, a mixture of norm-referenced, criterion-referenced and ipsative approaches to assessment.

11 Although there seem to be some peculiarities in their statistics, the general findings probably hold up.

Grading and its limitations 53 Anonymity in assessment Is grading affected by knowledge of the student? Concern over the possibility of bias in respect of gender, ethnicity, and other demographic variables has led many institutions in the UK to adopt an approach to marking in which scripts are anonymized. This succeeds up to a point, though there is debate (summarized by Cameron, 2004) about the extent to which scripts contain clues to the nature of their author. However, where the student has to be observed in a practice situation (for example, on teaching practice or in a hospital ward), or produces a unique response to an assessment demand (such as a dissertation or project report), anonymity cannot realistically be maintained. Another response which is ‘blind’ to the identity and characteristics of individuals is to use computer-based assessments of various kinds, but this general approach is necessarily limited to certain types of assessment. A disadvantage of not being able to assess on an anonymous basis is the pressure on assessors not to fail students, or to give them higher passing grades than their work merits. The passing of students whose work does not clearly meet the prescribed criteria is discussed in Chapter 5, as is the awarding of inflated grades as an act of kindness to students. These issues indicate that the social context of assessment can be influential in determining outcomes as well as the more ‘technical’ aspects of the assessment process. Multidimensionality Grading is often multidimensional. In the medical arena, some assessments employ both checklists relating to the performance of particular actions (for example, Reznick et al., 1997: 228, note the significance of spacing sutures at between 3 and 5 millimetres in bowel surgery) and global rating scales regarding the conduct of surgical procedures. In some instances, the two kinds of assessment are combined according to a weighting procedure (e.g. Hodges et al., 1997; McIlroy et al., 2002): this is to some extent problematic, since the constructs being assessed are different. The difference is exacerbated when technical checklists are used in respect of objective structured clinical examinations (OSCEs) and the global ratings relate to broader aspects of communication and patient management. On the other hand, there may well be some overlap in the two kinds of assessment, and scoring on one kind of instrument may influence scoring on the other (Hodges et al., 1997: 719). A problem with the use of checklists in OSCEs is that they may undervalue professional expertise. Hodges et al. (1999) used checklists with 14 medical practitioners at each of three different levels of experience (in ascending order, clinical clerks, family practice residents and family physicians) who were faced with two standardized patients. Scores on the OSCE checklist declined with increasing experience. Hodges et al. suggest that the checklist consists of the set of questions that a relative novice might need to ask in order to cover the diagnostic ground, whereas a practitioner with greater expertise would be able to diagnose without

54 Grading and its limitations having to ask all the listed questions because they would integrate situational and observational data and use their understandings in support of hypotheses formed on the basis of a smaller number of focused questions. This seems to be consistent with the levels of professional expertise proposed by Dreyfus and Dreyfus (2005).12 When Hodges et al. used global ratings with the same practitioners, the most experienced scored much more heavily than the others, which the Dreyfus brothers’ conception would predict.13 The limited empirical evidence on the actual practice of grading suggests that the variability in assessment behaviour – even if it can be mitigated by engaging second opinions in one way or another – introduces an often unacknowledged aspect of unreliability into the assessment process. This contributes to a formidable challenge regarding summative assessment, that of maximizing reliability (taking validity as a given) in a context in which real expertise in assessment appears to be rather thinly spread. Holroyd (2000) asked the question ‘Are assessors professional?’: his discussion of the various aspects of the question points to the difficulty of giving a wholly affirmative answer. The problem of professionalism in assessment in higher education can be posed baldly in the following terms. ‘Academics have professional expertise in respect of the subjects they teach; they are increasingly pressed to develop a second aspect of their professionalism as teachers; and the challenges of assessment imply further development of their expertise in the direction of educational assessment. Is it feasible, given the many competing demands on their time, to expect them to develop expertise in assessment? If not, then what are the implications for the robustness of summative assessments?’

The challenge of reliability in grading Reliability in the grading of student work has long been acknowledged as troublesome (e.g. Edgeworth, 1890a,b; Starch and Elliott, 1912, 1913a,b; Hartog and Rhodes, 1935, 1936). A number of subsequent studies show that the problems have not been eliminated. In practice, the elimination of all the problems is an unrealizable ideal. Fishman (1958) commented on the problem of unreliability in grading: The unreliability of grades within departments and the variability of grading standards across departments is [sic] apparent to all who have looked into these matters. (Fishman, 1958: 341, emphases in the original) Smith (1992) suggested that, for grades to be considered reliable, they should be reasonably stable across courses in the discipline concerned. This assumes a 12 Dreyfus and Dreyfus identify five stages in professional development which they label Novice; Advanced beginner; Competence; Proficiency; and Expertise. 13 There is in this study an implicit assumption that the more experienced should ipso facto be ‘better’: what could be being identified is a gradation of professional style.

Grading and its limitations 55 degree of homogeneity within the requirements of the discipline and/or within the student that may not be justified. For example, programmes in the field of biosciences may require students to be both experimental scientists and writers of reflective essays. In early work by the Student Assessment and Classification Working Group it was noticed that the occasional modular grade distribution differed quite markedly from the norm for this subject area, and discussion with a knowledgeable academic suggested that the type of demand in the subject – and the grading behaviour of the academics – might explain the differences that had been observed. Some subject areas may be treated fairly casually as ‘disciplines’ even though they are multidisciplinary in nature. The area of business studies constitutes a case in point, with its combination of mathematical (e.g. in quantitative methods modules) and social scientific foundations. Students who perform strongly in one aspect of the subject area mighty not perform well in another. A discipline-based approach to the reliability of grades may show a decline in reliability over time, as Humphreys (1968) found to be the case across eight semesters. This should perhaps occasion little surprise, since students develop their capabilities in different ways as they progress through higher education. The studies conducted by McVey (1975, 1976a,b) have often been cited in discussions of the reliability of assessment. In the first of these papers, McVey (1975) investigated marker consistency in electrical engineering, sometimes under ‘blind’ conditions, and sometimes when the second marker had sight of the marking of the first marker. With markers using a model answer template, there were the high correlations that one might anticipate: McVey observed that ‘once a schedule of marking has been prepared, markers are interchangeable’ (ibid.: 207). However, this may apply only in circumstances in which the correctness of the student’s answer is readily identifiable, as was the case with the majority of the problems that McVey posed: it is very unlikely to apply in any subject in which the student may choose how to approach the academic challenge, such as in the humanities, social sciences, and the creative and performing arts. Despite the high correlations observed by McVey, there were some quite large inter-marker differences – nearly 20 per cent of the mark-pairs differed by 11 percentage points or more. The larger discrepancies appear to have been associated with questions requiring a more descriptive answer, as one would expect. In the second paper, McVey (1976a) reported two experiments in which a total of 37 students sat parallel forms of the same examination in Electronic Engineering. The first sitting was the real examination. On the second occasion, as an inducement to take the examination seriously, the students were paid a sum of money related to their performance. Whereas McVey found correlations greater than 0.95 between markers for the same examination paper, the correlations of student marks across examinations were lower (between 0.63 and 0.77). McVey’s interpretation of this difference – that the examiners did not achieve high precision in examining despite high precision in marking – does not stand up to scrutiny. For example, there is no indication of whether the sums of money paid for performance were sufficient to encourage commitment to the task (one can imagine some students being satisfied with a relatively low payment, and others

56 Grading and its limitations striving to maximize the payment), and McVey seems to have overlooked the statistical tendency for marks to regress towards the mean (which would tend to lower the correlation between the scores on the two examinations). Baume and Yorke (2002) examined the assessments from archived portfolios from a course in teaching in higher education. The portfolios had been graded by trained assessors and, where there had been a significant discrepancy, a third assessor had been employed to resolve the differences. Baume and Yorke found that, although the general run of assessments showed a high level of consistency, the level of consistency was reduced where there was greater latitude for interpretation on the part of the assessors as regards whether the criterion had been met. Aspects of the course that were particularly susceptible to divergence in assessment were participants’ evidence in respect of ‘reflection’ and of engagement with ‘equal opportunities’, where the criteria turned out to be more general than the course designers had probably intended. A subsequent study (Baume et al., 2004) involved the re-assessment, by pairs of trained assessors, of ten archived portfolios, with the assessors also providing a commentary on why they made the judgements that they did. The commentaries revealed that, at times, the assessors were interpreting the guidance they had been given with some flexibility, sometimes bending the ‘rules’ in the interests of what they construed as exhibiting fairness to the candidates. This could be seen, for example, in the occasional decision to see the candidate’s performance holistically instead of as a series of separate components.

The challenge of consistency Sadler (2005: 183) comments that the adherence by some universities to a uniform mark range across the institution is claimed by them to produce a high degree of comparability of standards across schools. Such a claim, if ever it were made, is palpable nonsense. A common mark range might give rise to commonality in the spread of marks, but in no way guarantees comparability of standards. One cannot sensibly claim that a mark of, say, 61 per cent in English indicates an equivalent standard to that of an identical score in Engineering – the two subjects are too different. In fact, the weaker claim of comparability of marks is not sustainable, given the unevenness that many commentators have observed (see below and Chapter 5) in respect of marks and grades across different subjects. Inconsistency between subjects Table 2.5, which presents a sample of real data from first year programmes in a university in the UK, exemplifies how different subjects can produce different profiles of percentage marks. Subjects in which it is possible to perform to a criterion (implicit or explicit) of ‘correctness’ tend to produce a wider range of scores than subjects in which knowledge and understanding are more likely to be contested. These data parallel earlier findings by Yorke et al. (1996) from five post-1992 universities in the UK that Mathematics and Statistics and also Com-

Grading and its limitations 57 Table 2.5 Illustrative statistical data from a selection of modules Module title Information Analysis Computer Systems Microcomputing for European Business Introduction to Statistics Quantitative Analysis for Business Introduction to Business and its Environment Introduction to Business Introduction to Sociology Britain’s Environment and Society Groups, Discrimination and Empowerment

Mean 57.27 52.63 66.58

SD 13.75 14.20 23.07

58.09 56.88 55.65 52.92 54.58 59.36 53.21

Max 89 85 97

Min 13 6 13

N 323 134 33

SE(M) 0.77 1.23 4.02

20.95 97 21.87 97 10.40 78

4 9 16

113 48 320

1.97 3.16 0.58

8.30 10.59 9.18 4.16

22 5 33 40

151 106 67 29

0.68 1.03 1.12 0.77

72 71 73 60

Note: zero scores and missing scores were taken to indicate that the student did not attempt the assessment, and have been ignored.

puter Studies had the widest spreads of grades, whereas the spread of grades for Fine Art tended to be narrow. The confidence that one can place in a mean mark is indicated by the ‘standard error of the mean’ (SE(M)), which connects the spreads of marks – represented by their standard deviations in Table 2.5 – to the number of gradings that have been taken into account. The lower the SE(M), the greater the confidence that can be placed in the mean mark.14 Table 2.5 also points to the variation in mean percentage for different modules. The earlier work by Yorke et al. (1996) across six post-1992 universities in the UK, echoing the work of Prather et al. (1979) in the US, and of others, showed that there was a tendency for marks or grades to be higher in some subjects than in others. Although there was considerable variation between the universities in the rank order of the mean score for eight subject areas, Sociology tended to have a high rank (i.e. to have a high mean grade) whereas Law ranked low in all six universities. Data were not collected on entry qualifications (which might have been expected to have an influence), but it is unlikely that the entry profile for Law (for which competition is strong) would have been lower than that for Sociology. Although the grading approach may be normative within the particular study units represented in Table 2.5 (and, by extension, particular subject areas), the normativeness becomes problematic as one moves across the subject spectrum. 14 There is a probability of just higher than 95 per cent that a band running from 2 SE(M) above the observed mean to 2 SE(M) below will contain the true mean (the observed mean refers to the limited number of observations that have actually been made, rather than to the universe of all possible observations, from which the true mean would in theory be computed). If the band is narrowed to +/–1 SE(M), then the probability that it contains the true mean drops to just above 68 per cent. The use of confidence intervals makes some statistical assumptions about the randomness of drawing the observed performances from the population of possible performances.

58 Grading and its limitations Whereas variability in a student’s profile of grades may be relatively unimportant within a subject area’s boundary, once boundaries are crossed – as in combined programmes – it could exert an effect on a student’s overall honours degree classification. Foreshadowed in Learning from audit (HEQC, 1994: 66), recent reports from the Quality Assurance Agency indicate some of the difficulties associated with classifying joint and combined honours degrees (QAA, 2006a: para. 20; QAA 2006b: paras 53ff). Heterogeneity in subject grading profiles was found by Adelman (2004) in his analysis of grade profiles for high-enrolment courses for the class of 1992 (a sample is provided in Table 2.6). Adelman suggests that the data, which are based on a large number of institutions, ranging from 438 institutions (Organic Chemistry) to 1,105 (Introductory Accounting) prompt further research into grade patterning across higher education in the US. There is a mixed picture regarding student performances in subjects underpinned by mathematics and/or science. Data on degree classifications from the Higher Education Statistics Agency in the UK align quite well with those from grading at the study unit level: in these subjects, the distribution of grades is flatter (i.e. there are proportionately more grades at the extremes) than for other subjects. In the US there has been a suggestion that subjects involving mathematics or science produce lower grades than do the more discursive subjects (Johnson, 2003): however Kuh and Hu (1999) showed that self-reported grades for mathematics- and science-based subjects fell on average between those of other subject groupings, and Adelman (2004: 78) showed that grades awarded for majors at bachelor’s level differed relatively little across the spectrum of subject areas. Quality assurance The evidence from HESA statistics at the ‘macro’ level of honours degree classifications (see Table 5.2), and from studies of marks awarded at subject level (noted earlier in this chapter), points towards variation in assessment practice across and even within disciplines. Most, but not all, institutions in the UK now have assessTable 2.6 Percentages of A and B grades in a selection of high-enrolment courses Subject Introductory Accounting Introductory and Intermediate Spanish Technical Writing Microbiology Calculus Ethics Organic Chemistry US Government Source: Adelman (2004: 82).

Percentage of A grades Percentage of B grades 18.1 27.6 27.3 29.5 34.0 20.6 19.1 31.8 21.8 15.5

39.1 29.7 25.7 34.5 30.8 29.5

Grading and its limitations 59 ment regulations that apply across the range of disciplines that they cover, homogeneity of approach having been encouraged through the institutional auditing processes operated by the Quality Assurance Agency. For example, the existence in Goldsmiths College of two parallel approaches to assessment, one for Art and Design, and the other for unit-based programmes, attracted critical comment from the QAA (see QAA, 2002: paras 36–42) which was given a prominence in the educational press that was no doubt unwelcome to the institution. Reflecting the variability of assessment practice within some Australian universities (AVCC, 2002, and this volume, Chapter 3), Cameron and Tesoriero (2004) indicate that, in their School at the University of South Australia, there was no consistent position on marking approach: the vignette below, extracted from their paper, indicates something of the complexity and variation in grading practice. In 2003 a study of grade distributions within the courses in our two undergraduate programs was instituted to provide a basis for comparison. Inconsistency in grading has been voiced in student complaints in CEQ scores and other internal feedback mechanisms. We were aware that different philosophies amongst our staff underpinned numerical assessments – e.g. whether students start with 100 and then lose marks for what is wrong/done badly, or they begin with nothing and gain marks for what is right/done well. There is not a School position on this. As well, there are varying processes for monitoring and managing tutors’ marks/grades/levels and the School has no clear documentation of the different processes in place. Consequently we employed a research assistant to collate the grade profiles of all courses over a two year period (2001–2002) and to compare these. The report from this study indicated considerable variation in grade distribution from year to year within the same course and especially between courses. [. . .] [The] results indicated a need for closer monitoring of grades to limit the more marked variations on grade distribution. For example the grade profiles of some courses indicated 70% of students were awarded a credit or above, whereas other courses showed a rate of 30% or less. In two courses over 25% of students were awarded a Distinction (marks of 75% or above) whereas on several others this figure was between 2% and 6%. These variations appear to be related to both expected standards of students’ work and the internal monitoring of these with the tutorial staff involved – usually a team of sessional tutors. (Cameron and Tesoriero, 2004: 6) Lewis (2006), writing of Harvard University, remarks that he does not recall participating in a faculty meeting where an effort was made to coordinate grading practices. Grading was, in effect, a matter private to the academic concerned. The same can be drawn as an inference from Walvoord and Anderson’s (1998) suggestions of different ways of approaching the task of grading. The search for consistency across disciplines as regards grading raises complex

60 Grading and its limitations issues for quality assurance in institutions. Consistency goes some way towards satisfying the desires for comparability of academic standards and for equity in respect of performances on programmes that combine subjects. However, when one digs into what different subjects expect regarding student performances, commonality of grading scale and of descriptors of performance show themselves to be superficial characteristics. As was noted earlier in this chapter, an analysis of the first 22 ‘subject benchmark’ documents in the UK showed how variable these were, without having penetrated into the actual meanings given in different disciplinary areas to terms such as ‘problem solving’ (Yorke, 2002b). Standardization Standardization has often been suggested as a way of rendering comparable the variances of different sets of marks. However, the recommendation is at times rather airily given (e.g. Cox, 1994: 97; Newble and Cannon, 2000: 205). Cox, whilst pointing to the inappropriateness of simply adding together marks from different kinds of assignment, says The usual procedure is to convert each raw sub-assessment mark to a standardized score, suitable for averaging across the different assessment methods. (Cox, 1994: 97) It is doubtful whether this ever was, or is, a ‘usual’ procedure in higher education, save in the occasional specific circumstance. Some writers have amused themselves by constructing examples to demonstrate that a rank order of candidates can be completely overturned by manipulating (highly varied) raw marks so as to bring means and ranges into alignment. Bowley (1973) includes examples of such statistical playfulness, the last of which is based on the addition of ranks such that all 12 notional candidates end up with exactly the same total.15 In reality, however, the effects of standardization are much less dramatic. At first glance, standardization has its attractions. It smoothes out differences that may be attributable to local grading practices, and hence appears fairer. It also can illustrate in a single measure (albeit one whose meaning may not be easily understood by a lay person) where a student’s grade stands in relation to the grades of others who have taken the same assessment. However, second and subsequent glances reveal some problems. 1 Standardization ideally requires large numbers of candidates, since small numbers may be atypical of the student body in general. A group of students 15 See Bowley (1973: 111–121). These mathematical musings were reprinted from the November 1958 and January 1959 issues of The AMA [The Journal of the Incorporated Association of Assistant Masters in Secondary Schools]. One of the examples also appears in Richmond (1968: 181–183).

Grading and its limitations 61 may adventitiously be academically strong, in which case a student with a respectable grade may be given a standardized grade below the mean, which would reflect their standing relative to their peers but would do them an injustice by misrepresenting their actual level of achievement. This is particularly likely where criterion-referenced assessment is employed, because of the likelihood that grades will be skewed towards the upper end of the grading scale. A similar argument applies, mutatis mutandis, to academically weak cohorts. 2 Standardization loses transparency. In part, this is a consequence of (1) above, but the broader problem is faced by those, such as employers, who use grades in their selection processes. Even if unstandardized grades are troublesome as regards their meaning, those who draw on them have at least a ‘folk-knowledge’ of what they might mean from their own experiences in the education system, if not from elsewhere. For many, standardization would add a layer of mystification to assessment outcomes, a point made by Tognolini and Andrich (1995: 171) in respect of non-linear transformation of grades/marks in general. 3 Standardization makes failure more difficult to handle. Failure has to be determined with reference to the relevant assessment criteria. If the assessment is entirely norm-referenced, then failure is simply determined by the selected cut-off point. The difficulty noted in (1) above remains relevant. However, once criterion-referencing is brought into play, the pass/fail decision should rest on the extent to which the student has met the specified criteria. In other words, absolute, rather than relative, standards should be invoked. Standardization adds no value to the identification of failing performance. Standardization requires that performances can be graded numerically according to a scale to which statistical calculations can legitimately be applied. Setting aside the difficulties in manipulating percentage and other grades, some aspects of valued performance (for instance, in employability, the ability to demonstrate emotional intelligence when working with clients, to work collaboratively, and so on) may not sensibly map on to the grading scale used for other aspects of the curriculum. What, then, might be the solution? The achievements could be simply graded as pass or fail, but experience from the US shows that pass grades are ignored in computations of GPA (they do not fit the dominant grading paradigm). In the UK, institutions have found difficulty in giving more than token weightings to the outcomes of sandwich placement years in some programmes, though where professional practice is formally built into a programme (for example, in nursing, social work and teacher education) there has been less of a problem. Standardization, then, cannot be viewed as the cavalry coming over the hill to rescue a position that is all but lost. There is a tension between desires for homogeneity in grading and the heterogeneity of subject disciplines. Dialogue and debate may clarify the extent of diversity in subject disciplines, but may be unable to narrow this to the point

62 Grading and its limitations where consistency can properly be proclaimed. Quality assurance processes face a considerable challenge.

The motivational value of grades The potential motivational value of grades was noted in Table 2.3. In their study of student behaviour at the University of Kansas, Becker et al. (1968) illustrated the power of grades to influence students’ behaviour, even in aspects of college life not immediately connected to academic work, such as fraternities and sororities. The influence of grading on behaviour is widely acknowledged in the literature (e.g. Snyder, 1971; Ramsden, 1992; Laurillard 1997; Rust, 2002) and is implicit in accounts such as those of Newble and Jaeger (1983) and Newstead (2002) that point to the instrumentality of student engagement in higher education. Although academics might downplay the importance of grades, students often through their behaviour testify to their motivating power. With grades being important motivators for students in higher education, ‘playing the game’ to best advantage is a consequence. ‘Cue-seeking’ (Miller and Parlett, 1974) is one relevant way of behaving, as one student, responding to an investigation by Laurillard (1997), seems to have done: I decided since X was setting the question, block diagrams were needed. (Laurillard, 1997: 132) However, as Maclellan (2001) found in her survey of students and staff at one institution, the students’ perceptions of assessment may be rather different from those of the staff, so the cues that students identify may not be what staff think they ought to be identifying. Maclellan’s study echoed that of Pollio and Beck (2000), who found that a sample of psychology students was more oriented towards getting grades than they would have preferred, and that this related to their perceptions of the focus of their professors’ attention. However, their professors were found to espouse a much more learning-oriented perspective than their students had picked up. It is unclear whether the discrepancy might be due to misperception by the students or to the professors’ espoused theory relating to the development of student achievement not being borne out in practice. Covington (1997), reviewing a number of earlier studies, notes that, when competitive-based grades are used as motivators for students, task engagement and performance may be adversely affected, especially amongst students at risk of failing. The ‘at risk’ students construe their poor performance in terms of a basic inability to cope with academic demand. If higher grades are awarded, these are often not credible to such students because they do not expect to be successful, or the grades are attributed to extrinsic factors such as luck. The work of Dweck (1999), Pintrich (2000) and Elliot (2005), amongst others who have studied the contrasting effects of performance and learning goals,16 aligns with Covington’s

Grading and its limitations 63 analysis in that students who are concerned with performing (i.e. ‘getting the grade’) rather than learning are more likely to find their self-esteem threatened by failure.17 Lest the reader be tempted to tut-tut about the inappropriateness of students’ instrumentalism and short-termism, a change of focus to the behaviour of institutions faced with funding opportunities, research assessments and ranking tables might engender an enhanced appreciation of students’ responses to the pressures that they perceive to be bearing on them. It is often asserted that, in addition to grades having motivation value for students, if there were no grades students would be less likely to exert themselves in their studies. Many students arrive in higher education directly from school or college systems in which grading is the norm, and so – the argument goes – this is the approach to assessment which they expect on arrival. Hence the system of grading reproduces itself. The prevalence of grading in higher education means that evidence for the motivational value of grading is hard to find since there is a relative dearth of studies comparing non-graded with graded higher education. Deci et al. (1999), in a substantial meta-analysis of studies of reward-contingent behaviour that covered school pupils and college students,18 found inter alia the following: • • •

rewards undermined free-choice intrinsic motivation and self-reported interest; positive feedback enhanced free-choice behaviour and self-reported interest; verbal rewards were less enhancing for children than for college students.

When people who were in a performance-contingent rewards group obtained less than the maximum rewards, their free-choice intrinsic motivation was undermined to a greater extent than that of people in any other reward-contingent group. In a study of Jewish Israeli children, Butler (1987) found that the giving of marks led children to compare themselves with others (exhibiting ego-involvement), whereas those who received only comments were more stimulated to improve (exhibiting task-involvement). The latter group out-performed the former. There is a striking similarity with the work of Dweck and co-workers whose work with schoolchildren on performance goals and learning goals showed the latter to have greater educational benefit (see Dweck, 1999). There are hints in these findings that the assessment regime adopted for a

16 Some – e.g. Harackiewicz et al. (1998) – use the term ‘mastery goals’ as an alternative. 17 Pintrich (2000) was able to show empirically that there were circumstances under which learning goals and performance goals gave rise to no significant difference in student performance, provided that the performance goals were construed in terms of achieving mastery rather than avoiding being shown up as inadequate. 18 Their review sparked off a sharp argument in the subsequent pages of the same issue of the Psychological Bulletin.

64 Grading and its limitations programme may be more supportive of long-term achievement if it downplays the importance of grading and uses feedback to encourage learning. Alverno College in the US is well known for its avoidance of grading and its commitment to the provision of feedback designed to assist students in their learning. Mentkowski and Associates (2000) have brought together a number of studies of student learning at the college, amongst which can be found comment to the effect that some students found it easy to take to the use of assessment criteria and feedback, whereas others took a relatively long time to appreciate the virtues of the non-graded assessment regime being used. A few students were recorded as going through their programmes of study wishing that their work had been graded, in the belief that grading would have informed them more clearly of the extent to which they were succeeding in their studies. Yet the reliance on a grade awarded by an authority implicitly encourages dependence and discourages the metacognitive internalization of standards and criteria. The absence of grades was not just a feature of the relationship between the student and Alverno College. One older student, Wanda, is recorded as noting the external expectation that grades would be awarded: My mother would still ask me, ‘Did you get your report card? What was your grade point?’ My boss says the same thing. (Mentkowski and Associates, 2000: 90) In discussing grading, the expectations of, and implications for, those outside higher education cannot be left out of consideration. The Student Assessment and Classification Working Group (SACWG) in the UK took a different tack in examining the possible motivation value of grades. Dave Scurry, a member of SACWG and Dean of Undergraduate Modular Programmes at Oxford Brookes University, suggested that one way to investigate the issue would be to set the difference between the students’ mean percentages for the final and penultimate years against those of the penultimate year. The ‘Scurry hypothesis’ was that students whose mean performance level fell just below an honours degree classification borderline at the end of their penultimate year would make a special effort to gain the marks that would enable them to move to the higher classification band. If the students did this, the effect might be sufficiently evident to suggest that the problems associated with ‘gain scores’ (see e.g. Lord, 1963) were being to some extent overridden. SACWG was able to test the Scurry hypothesis by drawing on data from a new university that covered performances from 791 students who had been awarded honours degrees. The scattergram of performances is presented in Figure 2.1. If the Scurry hypothesis were true, it would be expected that those students whose mean marks fell just below a classification boundary (and who would know this before entering their final year) would tend to show greater gains than others for whom the chances of improving their classification would be smaller. The mean performances of the students in their penultimate year were ‘sliced’ into bands that were two percentage points wide, and the mean increase or decrease

Grading and its limitations 65 20

Position of gain mean for each 2% band up to 74%

Gain 15

10

5

0

-5

-10

Loss

-15 40

42

Penultimate year mean 44

46

48

50

52

54

56

58

60

62

64

66

68

70

72

74

76

78

80

Figure 2.1 Scattergram of gain/loss in percentage mark, for 791 students in a new university in the UK.

in the students’ means was calculated for each of these bands. These ‘means of means’ provide no evidence for the Scurry hypothesis, as is readily apparent from Figure 2.1. Despite the considerable scatter, the Figure shows that a majority of students made gains, as judged by their marks in the two years, but that this tendency became weaker the higher the mark in the penultimate year. Indeed, the means for two of the bands at the upper end of the performances in the penultimate year were lower in the final year. The data are only suggestive that the Scurry hypothesis should be rejected. The data subsume a variety of single-subject and combined programmes, and it is possible that the hypothesis could be valid for sub-groups of student performances. The hypothesis implicitly treats the students as rational actors whose behaviour is influenced by their calculation of the odds in respect of improving their position, which may apply for some but not others – and some students may be rational actors on a broader front, satisficing in respect of their studies because of competing demands on their attention, as McInnis (2001) suggested. Further, rational action – if present – may not be limited to those who perceive a reasonable chance of improving their position. Some students whose results are just above a classification boundary at the end of their penultimate year may act ‘defensively’ to safeguard their position. The data may well be affected by ‘regression towards the mean’ – i.e. those who begin with high scores may not always maintain them, and those with low scores may increase them. This is a well-known problem with ‘gain scores’. Figure 2.1

66 Grading and its limitations shows a negative correlation between mean mark for the penultimate year and gain/loss score, which is suggestive that such regression may have played a part. The potential motivation value of grades is likely to be influenced by the kinds of goal that a student holds in respect of their time in higher education. As noted earlier, the work of Dweck (1999), Pintrich (2000) and Elliot (2005) testifies to the significance for student performance of ‘learning goals’ and ‘performance goals’. For present purposes it is sufficient merely to note that learning goals are primarily adopted by students who see learning as their main reason for engaging in higher education, whereas performance goals are primarily adopted by those who value grades and their social implications (such as being believed to be clever, or not appearing to be weak intellectually).19 One can act rationally in respect of either kind of goal, and work by Harackiewicz et al. (1998) indicates that the context of the action is important, and that the adoption of learning goals does not necessarily lead to superior outcomes to the adoption of performance goals. Particularly if the assessment approach is norm-referenced, an emphasis on performance goals may offer the prospect of a better pay-off. The Scurry hypothesis – if valid – might gain stronger support in respect of those whose motivation lies primarily in the gaining of grades. Human capital approaches to the relationship between higher education and national economies tend to press students towards an instrumental approach to their time in higher education. ‘Getting the grades’ may, for some rational actors, overshadow the desirability of learning. Possible consequences are an aversion to taking risks with learning and hence ‘playing safe’, and focusing narrowly on the expected learning outcomes for modules at the expense of the broader studying traditionally expected in higher education. Claims for the premium deriving from the acquisition of a degree focus attention on the benefits and costs of higher education, though the benefit/cost ratios of today may not be sustained in the longer term, as international competition increases (Sennett, 2006).

Fuzzier than acknowledged? Grading is a complex process which is subject to a number of influences, including the curricular sample being assessed, the assessment method, the nature of the assessee’s response, the assessor(s), the time available for the assessment, and so on. Broadfoot (2002) also touches on the complexity of the assessment process when she writes that Any kind of data on student attainment . . . is the product of the interaction of people, time and place, with all this implies in terms of a complex web of understandings, motivations, anxieties, expectations, traditions and choices. (Broadfoot, 2002: 157) 19 Performance goals have been divided into ‘approach’ and ‘avoidance’ varieties, with the former concerned with excelling and the latter with avoiding being shown to disadvantage (see inter alia Pintrich, 2000, and Elliot, 2005.).

Grading and its limitations 67 Anyone who has been faced with assessing a pile of student assignments or examination papers, especially when deadlines are tight, knows how challenging it can be to give full consideration to what is in front of them unless assessment of the material is straightforward in character. The (perhaps controversial and unpalatable) consequence of the various pressures on assessment is that the quality of the assessment often meets the criterion of ‘good enough’, but not that of ‘ideal’. H.A. Simon (1957) referred to this as ‘satisficing’ – being prepared to accept something as being satisfactory under the prevailing circumstances even though it was not the best outcome that could possibly be achieved. The unrelated bridge player S.J. Simon (1945: 90) had previously said much the same thing when advising players to aim for the best practical result in the circumstances (bearing in mind one’s partner, the opposition, the state of the game, and so on) rather than the perfect theoretical play: ‘The best result possible. Not the best possible result.’ The criterion of ‘good enough’ implies a greater amount of fuzziness in the assessment process than many of the procedures that operate on grading outcomes acknowledge. The implications of fuzziness in assessment are discussed in Chapters 8 and 9.

Chapter 3

Variations in assessment regulations Three case studies

Orientation In this chapter, the potential for variation in assessment outcomes is examined from the perspective of overall performance at bachelor’s degree level in three contrasting countries – the US, the UK and Australia. This examination scratches beneath the surface of apparent similarity within the approaches to assessment in the three countries in order to reveal some of the variation that underlies them – a variation that is probably unappreciated by many users of grade-point averages and honours degree classifications. Consistent with the ‘macro’ perspective taken in this book regarding assessment, the variability that is inherent in the actual grading of assignments and examinations is not explored.

Grading in the US Grading practice in the US is more complex than many may appreciate. Numerical grades (normally percentages) are converted into the well known five-letter (A B C D F) scale or its variants. Percentage ranges are generally much higher than are typical of the UK, where the minimum passing grade is normally 40 per cent. Walvoord and Anderson (1998), whose work on grading is well regarded in the US, suggest one model (amongst others) for assessment in which up to 40 points can be awarded for test performance; 30 for a field project; 20 for the final exam; and 10 for class participation. The suggested conversion of the total points into letter grades is as follows: 92–100 = A; 85–91 = B; 76–84 = C; 69–75 = D; 68 and below = F. These kinds of figure are not untypical of grade-point conversions in the US. However, Ebel and Frisbie (1991) argue that such assignment of marks to grades needs to take the circumstances of the module into account. An ‘A’ might be merited for a mark of, say, over 80 on a relatively difficult module – but the cut-off points for grades are a matter of professional judgement. Although the five-letter scale is widely known, there are many other grades

Variations in assessment regulations 69 representing different kinds of achievement or non-achievement. The University at Buffalo (part of the State University of New York) uses the five-letter grading system, together with + and – affixes. The other grades used are indicated in Table 3.1. There are variations in grading schemata, with not all letters easily conveying a meaning. Clifford Adelman, formerly of the US Department of Education, tells of being puzzled by a grading of Z until he was informed that it merely meant ‘zapped’ or ‘zeed out’ (Adelman, forthcoming). The most recent study of institutional grading practice in the US was conducted by Brumfield (2004) for the American Association of Collegiate Registrars and Admissions Officers (AACRAO). A survey was carried out via e-mail and the World Wide Web, attracting 417 usable responses from some 2,400 higher education institutions. The previous survey, conducted in 1992 (Riley et al., 1994) using a paper-based questionnaire, achieved a far higher response rate of 1,601 usable responses.1 Although the response rate for the 2004 survey is disappointingly low, the data nevertheless probably give a reasonable indication of the extent to which the various practices are used. The data reported below are derived from the AACRAO study, but exclude institutions that are solely concerned with professional education or graduate

Table 3.1 Grades outside A to F which can appear on students’ records at the University at Buffalo Grade (grade) H I/default grade J N P R S U W *** @ #D+ #D #F

Interpretation Failure for Reason of Academic Dishonesty Grade points for the grade indicated prior to the H Honors Incomplete* Reporting error (temporary grade) No Credit-Official Audit (arranged at time of registration) Pass Resigned Officially Satisfactory Unsatisfactory Administrative Withdrawal No Credit/No Points Course Repeated for Average Fresh Start Program-Credit Hours Not Counted Fresh Start Program-Credit Hours Not Counted Fresh Start Program-Credit Hours Not Counted

Adapted from http://undergrad-catalog.buffalo.edu/policies/grading/explanation.shtml, accessed 2 August 2006. * There is a set of rules which relate to the rectification of a grading of ‘incomplete’ (see the cited URL). 1 The AACRAO surveys take place roughly every ten years.

70 Variations in assessment regulations programmes. Hence there are some minor differences between the data reported here and the summary data made available on the World Wide Web.2 Grading approach The vast majority of institutions use either a letter scale (A B C D F) only or inflect the letters by using + and – affixes (Table 3.2). The conversion of letter grades to grade-points is A = 4, B = 3, C = 2, D = 1, with + and – affixes adding and subtracting, respectively, approximately one third of a grade-point. Some institutions use one decimal place, so rounding gives for example B+ = 3.3 and A– = 3.7, whereas others use two decimal places, giving 3.33 and 3.67, respectively. At least two institutions (the University of Dayton and Indiana University – Purdue University Indiana (IUPUI)) work with the apparent precision of four decimal places, the latter rounding to three in calculating grade-point averages. Overall, the trend since 1992 has been towards affixing + and – to letter grades. Narrative reporting of achievement is employed by a tiny minority of institutions. Table 3.2, however, conceals the variation between two-year and four-year institutions. Nearly three quarters of two-year institutions use a simple letter system whereas only around a quarter of four-year institutions do. The vast majority of institutions cap their grade-point range at 4.0 (in other words, A and A+ each count as 4.0 grade points), with only a handful allocating more to an A+ (usually 4.3). In a small minority of institutions, grading practice was reported as varying between their component undergraduate sections – a finding that is consistent with the survey of 1992. Assessment on a pass/fail (or equivalent) basis Around two thirds of institutions (235 out of a responding total of 360) allowed students to have their performance on a module assessed on a pass/fail basis. Institutions vary in the extent to which they permit this option to be exercised. The trend since 1992 has been to increase the use of student option in respect of grading. A student could aim merely to pass on a pass/fail assessment (tactically astute if the module is a difficult one) whilst concentrating on getting high grades in other modules. Table 3.2 Approaches to the reporting of student achievement

Grading approach Number of institutions

Letter with Letter with Narrative Letter only + or – + and – only 146 11 191 2

Narrative in addition to letter or numeric 2

Numeric only 3

2 At www.aacrao.org/pro_development/surveys/Grades_and_Grading_Practices_Report_2004. pdf (accessed 18 May 2006).

Variations in assessment regulations 71 Institutions vary in the extent to which they permit students to exercise the pass/fail assessment option. The trend since 1982 has been in the direction of liberalism regarding the availability of this option. Tables 3.3 and 3.4 indicate the extent of institutional variability in respect of the shorter and longer terms. A majority of reporting institutions (147 out of 243) limited the option of assessment by pass/fail to elective modules only. Again, the trend since 1992 has been towards liberalization. When the pass/fail option is invoked, work may nevertheless be graded before the determination of whether the student has passed or failed. There is variation between institutions regarding what they count as the minimum passing grade (Table 3.5). Since 1982 the use of D– as the minimum passing grade has risen from about 1 in 14 institutions to roughly 1 in 4. Institutions generally record a failing grade on a student’s academic record, but there is a near-even split as to whether the fail grade is incorporated into the grade-point average. As part of a much larger study of student achievement based on cohorts starting in 1972, 1982 and 1992, Adelman (2004) compared results from cohorts starting in 1982 and 1992, and showed that there was an increase over time in the proportion of ‘pass’ grades and ‘withdrawn/no credit’ outcomes. These increases may have contributed to apparent rising levels of GPA because the (possibly or actually) weaker outcomes were not drawn into the GPA computations. The point being made here is that the GPA is influenced by not only those grades that are included, but also those that are excluded.

Table 3.3 Number of modules that can be assessed on a pass/fail basis for each term/ quarter/semester Number of modules per term/quarter/semester 1 Number of institutions 102

2 22

3 2

4 4

No limit 101

Table 3.4 Number of modules that can be assessed on a pass/fail basis during a student’s academic career Number of modules during an academic career 1 to 3 Number of institutions 40

4 to 6 85

7 to 9 31

10 and up to a stated limit 11

Unlimited 69

Table 3.5 The lowest passing letter grade Minimum passing grade C or above C– Number of institutions 56 50

D 55

Note: Presumably D+ was not a plausible option to offer in the survey.

D– 64

Instructor determines 23

72 Variations in assessment regulations Opportunity to repeat the module in order to raise the grade Students generally have the opportunity to retake a module in order to raise their grade. Some institutions however limit this to the lower grades of pass (Table 3.6). Institutional practice varies regarding the number of times a module can be repeated (Table 3.7). There has been a slight shift since 1992 towards limiting the number of repeats. Repeating modules has implications for a student’s GPA, since institutions may include the latest grade only (183 out of 357 responding institutions), the highest grade only (91) or more than one grade (52).3 These data are consistent with previous surveys. Disregard of past grades In 2004, close to one half of the responding institutions possess a policy under which past grades can be disregarded after a period of time – the somewhat pejorative term ‘academic bankruptcy’ and the more generous term ‘forgiveness’ are variously used in this connection. The proportion of institutions possessing such a policy has grown from roughly a quarter in 1992. Incorporation into the GPA of grades from other institutions In a credit-oriented academic culture such as that of the US, it is perhaps surprising to find that in computing GPAs only around one sixth of institutions (60 out of the 354 that responded) include grades from institutions from which the student has transferred. Table 3.6 Institutional rules regarding the retaking of modules

Repeat not permitted 11

Only failed modules 18

If lower than D at first attempt 87

If lower than C at first attempt 39

Only modules required for the major 1

Except modules required for Any the major module 0 204

Table 3.7 The number of times a single module can be repeated Number of repeats permitted 1 Number of 61 institutions

2 49

3 21

More than 3 21

Unlimited 201

3 In the AACRAO report, the table from which this figure was drawn referred to ‘both’ grades, implying only two attempts at the module. Table 3.7 however indicates that some institutions allow more than two attempts. There may have been a minor flaw in the design of the survey.

Variations in assessment regulations 73 Graduation with honors A large majority of responding institutions (326 out of 357) allow students to graduate with honors: a majority of this order has obtained for at least 20 years. The modal ranges of GPA for the three grades of honors are • • •

cum laude: 3.50 to 3.59; magna cum laude: 3.70 to 3.79; summa cum laude: 3.90 and above.

However, the GPA needed to graduate with honors at the different levels varies between institutions, and there is in practice considerable overlap between the distributions of the ranges of the three categories of honors. For example, the lowest reported GPA for the award of summa cum laude is within the range 3.25 to 3.49 whereas three institutions are reported as requiring at least 3.90 for the award of cum laude. Of 338 responding institutions, only 69 included grades awarded by other institutions when determining a student’s eligibility for honors, with a further 36 including them provided that they met certain criteria. The proportions are consistent with the previous survey in 1992. What lies beneath the GPA? What is left almost entirely out of consideration in the literature is the way in which the numerical grade (percentage) is reached. Walvoord and Anderson (1998), in offering teachers a variety of models for the assessment of a module, imply that much is up to the teacher(s) concerned. In his study of grade inflation, Johnson (2003) suggested that a variety of considerations might influence actual marking, including subject difficulty; subject discipline tradition regarding grading (as is also apparent in the UK – see Chapter 5); and the teacher’s desire to receive good student evaluations (especially where tenure might be a consideration). In order to gain an appreciation of the subtleties of grading, it would be necessary to delve into the specifics of the grading of assignments, in-class quizzes, tests and examinations – an undertaking that is potentially enormous if broadly generalizable results were to be sought. Students’ selection of modules can influence GPA. Students may choose an easier module in order to enable them to concentrate on modules in which it is less easy to obtain high grades.

Assessment regulations in UK higher education Those whose work brings them into detailed contact with institutional assessment regulations are aware of the variation that exists between institutions in this aspect of their work. Work by the Student Assessment and Classification Working Group (SACWG), an informal group of academics and administrators that

74 Variations in assessment regulations has been studying assessment in the UK since 1994, showed a decade ago that a student’s honours classification at the level of the bachelor’s degree would be influenced by the rules adopted by an institution regarding classification (the ‘classification algorithm’): in an early study Woolf and Turner (1997) suggested that perhaps 15 per cent of honours degree classifications could be affected. A number of subsequent modelling studies using real data sets have demonstrated in various ways the potential variability inherent in the outcomes of different institutional honours classification algorithms – i.e. their methods of determining classifications (Simonite, 2000; Yorke et al., 2002, 2004; Heritage et al., 2007). Whereas the potential influence of the classification algorithm has been identified and described, the potential inherent in assessment regulations to influence award outcomes has attracted less attention. A survey conducted for the Northern Universities Consortium for Credit Accumulation and Transfer (NUCCAT) showed that there were some marked inter-institutional variations in the ways in which honours degree classifications were determined (Armstrong et al., 1998). A subsequent survey (Johnson, 2004) showed that there had been some convergence in practice, but that some of the variation noted by Armstrong et al. remained. Johnson’s (2004) study covered a wide range of issues relating to academic credit, and hence gave matters of relevance to this book only a rather cursory examination. The examination by the Burgess Group of the robustness of the UK honours degree classification provided an opportunity for SACWG (mainly in the persons of Harvey Woolf, formerly of the University of Wolverhampton, and Marie Stowell of the University of Worcester) to explore, in greater detail than Johnson did, the variation that existed between institutions as regards their assessment regulations in respect of programmes at bachelor’s degree level.4 An aspect of assessment regulations that was not examined, however, was the variation deriving from the requirements of a range of professional and statutory regulatory bodies, in which pass marks and the retrieval of failure, to give two examples, may be treated differently from the way that they are treated in the general run of institutional assessment regulations. An opportunity sample of 35 varied institutions from across the UK provided details of their assessment regulations. These were analysed with reference to three main themes: • • •

the classification algorithms; the rules used for compensation or condonement of performances that fall narrowly below the level appropriate to a pass; regulations regarding the resitting of assessments, the retaking of modules and ‘personal mitigating circumstances’.

4 This study was funded by the Higher Education Academy, whose support is gratefully acknowledged. The findings are reproduced with permission. I am also grateful to Harvey Woolf and Marie Stowell for their willingness for me to use this material.

Variations in assessment regulations 75 Interpretations were, where possible, confirmed with the respective institutions. Some assessment regulation documents were found to be complex and difficult to interpret, and some were written in relatively generic terms that appeared to rely on experiential and tacit knowledge for interpretation and action. Classification algorithms The majority of institutions had institution-wide approaches to the determination of honours degree classifications, in part because of pressure from the Quality Assurance Agency for Higher Education (QAA) for this (and perhaps as a consequence of occasional high-profile comment in the educational press when diversity of approach has been criticized in QAA audits). However, some institutions permitted their organizational units to choose between methodologies for determining classifications. Percentages were generally preferred to other forms of grading. In credit-rated programmes, the expectation was that the student would have gained 360 credits for the award of an honours degree, of which 240 would normally need to be earned at a level consistent with the final two years of fulltime study (120 credits in each).5 An unclassified degree could be awarded when the total number of honours-level credits gained falls a little short of 240. Honours degree classifications were normally based on student performances in the final 240 credits gained (corresponding to the penultimate and final years of full-time study, or their part-time equivalents). The weighting given to results from the penultimate and final years varied from 1:1 to 1:4, and in some institutions varied between organizational units. A high weighting towards the final year’s performance favours what is often termed the student’s ‘exit velocity’: a few institutions have chosen to base classifications on the final year performance alone. Classification algorithms were typically based on the aggregation or averaging of percentage marks in which the following classification bands were used: • • • •

first class: 70 per cent and above; upper second class (or 2.1): 60 to 69.9 per cent; lower second class (or 2.2): 50 to 59.9 per cent; third class: 40 to 49.9 per cent.

Unclassified degrees could be awarded for performances whose average is a little below 40 per cent but for which a sufficient volume of credit has been gained. In some institutions, ‘rounding upwards’ of average percentages means that the category boundary was, de facto, 0.5 of a percentage point below the formal classification boundary. Most institutions determined ‘borderline’ categories below the boundaries of each classification band. The norm was for the borderline to be set at 2 percentage points below the boundary, but the borderlines in the institutions sampled ranged 5 For part-time programmes, the durations have to be adjusted appropriately.

76 Variations in assessment regulations from 0.5 to 4 percentage points below. Atypically, one institution calculated borderlines individually for each programme each year, and in another, the borderline extended both below and above the threshold mark for the class. In some institutions the borderline was either discretionary or variable. When the student’s performance fell just below a classification boundary (having taken any automatic rounding into account), a second part of the algorithm was often invoked, in which the profile of a student’s performances was subjected to ‘mapping rules’ which decided the issue. For example, such a student could be required to have a majority of module marks in the higher classification band (in some institutions, with a further stipulation about the lowest module mark) if they were to be moved up to the next classification band. Where an institution used grades other than percentages, the classification algorithm was normally based on a profiling approach. Roughly half of the institutions permitted some ‘discounting’ of modules in the classification process: in other words, the classification was based on ‘the best X from Y modules’ although the student was expected to have gained the full tariff of credits required for the honours degree. Institutions varied on whether they permitted students to take more modules than the minimum required for the award. Compensation and condonement ‘Compensation’ refers to the offsetting of weaker module performances against stronger ones, so that the unevenness in performance is, in effect, flattened out in the averaging process.6 Probably the most significant compensation occurs when a marginal fail is offset by a pass at above threshold level elsewhere. ‘Condonement’ refers to the practice of disregarding marginal failure in the context of a student’s overall performance, but there is no attempt to average out the marks. These terms are not always clearly differentiated in institutional assessment regulations. Most institutions offered some form of dispensation in respect of failed modules but some – mainly in the new universities and colleges sector – offered neither compensation nor condonement. In the determination of the honours classification, some institutions included the original, failed, grade (though, where the algorithm allowed the weakest performance(s) to be dropped from the calculation, the fail grade(s) could be eliminated). Two of the responding institutions reduced a student’s classification by one level if condonement had been applied. Institutions varied in the number of credits for which compensation or condonement could be considered, and whether such action was permitted in respect of the final year of full-time study (or its equivalent). In some institutions there were tightly specified rules for determining whether compensation and condonement could be considered, whereas in others the principle relating to the dispensation of failure was expressed in general terms (thereby leaving it to assessment

6 Compensation can be applied both within and between modules.

Variations in assessment regulations 77 boards to decide whether to invoke compensation or condonement, or possibly to determine the outcome through the build-up of case law). Resitting, retaking and ‘personal mitigating circumstances’ Failure may be retrieved by resitting some or all of the formal assessment for a module, or by retaking the whole module (with attendance). Almost all institutions permitted students to retrieve failure in a module by resitting the failed assessments, and in most cases the grade or mark achieved by the student in a successful resit was ‘capped’ at the minimum pass grade. There were however a small number of institutions in which ‘capping’ was not imposed. Where a student retook the whole module, the use of ‘capping’ was more varied. Most institutions did not permit students to resit module assessments or to retake modules already passed in order to improve grades, although a small number did permit this if the students were able to produce convincing evidence that their performance had been adversely affected by unforeseen events such as personal illness or misfortune (in the language of assessment regulations, these mishaps are termed ‘personal mitigating circumstances’). Institutions varied in the way that they treated students who claimed successfully that there were mitigating circumstances in respect of their underperformance. Some had discretionary powers to raise a student’s classification, whereas others simply offered students a further opportunity to take failed assessments as if for the first time (i.e. without the penalty of ‘capping’). Some other influences The assessment regulations provided by the 35 institutions contained some other aspects of the assessment process that are likely to have some impact on the honours degree classification, amongst them being: • • • • •

the rules for rounding of marks at the various stages of the assessment process; penalties that can be levied for late submission or non-submission of assignments, for poor attendance and for academic misconduct; the weighting and/or prominence given to final year dissertations or projects; whether grades obtained from other institutions, or from the assessment of prior learning (APL), are included; the possible use of oral (viva voce) examinations in the final examination process (though this is nowadays used exceptionally).

Residual uncertainties Other than some general perspective on the standards attained by the student, there is no clear collective view of what the honours degree classification represents.

78 Variations in assessment regulations One might ask whether it is intended to indicate the student’s ‘best’ performance or some conception of ‘average’ performance. Moreover, it is by no means clear how or why these differing approaches to the regulations for classification have come to be as they are.

The classification of honours degrees in Australia In Australia, three-year programmes lead to a non-classified degree at bachelor’s level, with the fourth year being the ‘honours year’. This is akin to the structure for bachelor’s degrees in Scotland, but not to that generally in use in the rest of the UK. The classification system is superficially very similar to that used in the UK, but in practice exhibits some marked differences. Thirty-three out of 38 universities provided data for a survey which was conducted in 2002 for the Australian Vice-Chancellors’ Committee (AVCC). The overwhelming majority offered honours programmes for which the classification categories were similar to those used in the UK (first class; upper second class; lower second class; third class; fail), though the terminology used for the second class degrees varied. Around one third of the responding universities delegated classification to faculty level, and hence there was no university-wide banding of classifications. A small minority of universities used grade-point average systems which are different from that used in the US (one example is given below), but grade-points were converted into classifications in broadly the same way that percentages were. The percentage bands for classifications varied between Australian universities, as shown in Table 3.8. The percentage bands fell roughly midway between those used in the US (which tend to be higher) and the UK (where a first class degree is typically awarded for a mean percentage of 70 and fail marks fall below 40). It is evident from Table 3.8 that three universities indicated a particularly nar-

Table 3.8 Degree classification in some Australian universities Number of universities 4* 8* 2 1 1 4

First Upper second Lower second Third Fail 85 and above 75 to 84.9 65 to 74.9 50 to 64.9 Below 50 80 and above 70 to 79.9 60 to 69.9 50 to 59.9 Below 50 80 and above 75 to 79.9 70 to 74.9 65 to 69.9 ** 80 and above 75 to 79.9 65 to 74.9 50 to 64.9 Below 50 Percentage ranges not specified Other grading systems used (in 2 instances, GPA explicitly stated)

Source: www.avcc.edu.au/documents/universities/key_survey_summaries/Grades_for_Degree_ Subjects_Jun02.xls (accessed 22 November 2006). * The Law School in one university from this group used GPAs, rather than percentages, in determining classes. ** One of these two universities awarded a pass degree for the percentage range 50 to 64.9; the other failed students achieving less than 65%.

Variations in assessment regulations 79 row range for the upper second class degree (75 to 79.9 per cent), and two of these for the lower second class degree as well (70 to 74.9 per cent). The rationale for these narrow divisions was not apparent and might have been forgotten with the passage of time. The GPA approach in the Australian universities that use it is different from that used in the US, and varies between institutions. The GPA system in use at the University of Canberra is outlined in Table 3.9 below.7 The GPA at Canberra is calculated on subjects undertaken since the student enrolled on the module, but also includes the grades for any prerequisite modules and grades from assessments not specifically required to satisfy module requirements. Grades from all attempts at assessments are included. Module assessments are weighted by their credit point value, but no weighting is applied in respect of the year or level of study. To count in the calculation of GPA, grades must be recorded on a transcript of the University; grades obtained from other sources (e.g. from the granting of advanced standing, or from modules external to the University) are only included if they are recorded at the University. Ungraded passes are excluded from the GPA calculation, as are those for grades that are withheld for various reasons (and whose codings are omitted from Table 3.9).

Commentary The three examples of grading practice in this chapter show that there are variations within national approaches to assessment that are quite subtle and may easily pass unnoticed. Large differences in performance, as signalled in grades and classifications, will of course be robust. However, the ‘error variance’ (or statistical ‘noise’) in overall gradings makes of doubtful significance the smaller differences that are thrown up by methods of academic signalling. This is of particular importance to

Table 3.9 Grade points at the University of Canberra Grade HD DI CR P PX (or P*)

Description High Distinction Distinction Credit Pass Conceded Pass (Does not meet pre-requisite requirements) UP Ungraded pass NX NC NS NN Codes for various aspects of failure

Grade points 7 6 5 4 3 Excluded 0

7 See The policy on the grade point average (GPA) at the University of Canberra at www.canberra. edu.au/uc/policies/acad/gpa.html (accessed 19 May 2006). I am grateful to John Dearn of the University of Canberra for providing the descriptions for the grades quoted.

80 Variations in assessment regulations those, such as employers, who may be tempted to screen applicants on the basis of their GPA or honours degree classification. The analysis contained in this chapter points to the need for more detailed information about a student’s achievements than is encapsulated in a single ‘measure’, a matter that is discussed in Chapter 9.

Chapter 4

UK honours degree classifications, 1994–95 to 2001–02 A case study

Orientation This chapter presents a set of analyses of official statistical data relating to UK honours degree awards over an eight-year period. The analyses, which focus on the proportion of ‘good honours degrees’ awarded, are presented by broad subject area and by institutional type. They show an overall rising trend in the proportion of ‘good honours degrees’ but that the trend varies between institutional types. The trends foreshadow the discussion of grade increase and inflation that appears in Chapter 5.

The importance of a ‘good honours degree’ First degrees in the UK are typically awarded ‘with honours’ which are – as noted in Chapter 3 – classified in four bands: first class; upper second class (2.1, for short); lower second class (2.2); and third class. Degrees are also awarded without honours in a variety of circumstances. In Scotland, where school-leavers typically enrol at the age of 17 rather than 18, many students choose to leave with a bachelor’s degree after three years of higher education rather than to stay on for a fourth year in order to gain honours: a rather similar structure exists in Australian higher education. Elsewhere in the UK, some programmes – particularly part-time programmes – are designed as non-honours programmes leading to unclassified bachelor’s degrees, but the number of these has declined markedly in recent years. The bachelor’s degree may be awarded without honours (this is also termed a ‘pass degree’) for a performance that falls short of meriting honours, either because the profile of grades is too low or because the student has not gained sufficient credit points to enter the honours classification process. In official data recording student achievements during the period of the empirical work reported in this chapter, the distinction between ‘pass’ and ‘unclassified’ degrees is not as

82 UK honours degree classifications sharp as it might be because there has been an inconsistency in the use by HESA of the term ‘unclassified’.1 The honours degree classification is important in the UK. A ‘good honours degree’ – a first class or upper second class degree – opens doors to opportunities that those with lower classifications can find resistant to their pressure. Such opportunities include research for a higher degree and first jobs in prestigious ‘blue chip’ companies. In other words, the boundary between a 2.1 and a 2.2 is of high significance. The robustness of this dividing line is discussed in Chapter 6; in the present chapter, data from English, Welsh and Northern Irish higher education institutions are analysed with reference to the proportion of ‘good honours degrees’. Although similar patterns are evident in data from Scottish institutions, the different higher education system in Scotland makes the assimilation of these data into a UK-wide analysis difficult. The primary questions addressed in this chapter are: • •

In England, Wales and Northern Ireland, is there a trend discernible in the percentage of ‘good honours degrees’ in (a) subject areas and (b) institutions? And, if so, why?

In the UK, the percentage of ‘firsts and 2.1s’ combined (also labelled as ‘good honours degrees’) appears as a variable in ‘league tables’, or rankings, of institutions that appear in the press and in guidebooks offering advice to prospective students.2 Institutions are, understandably, sensitive to their position in such tables, and may be tempted to find ways of presenting their performances in ways that enhance their ranking (even though smallish shifts in ranking are of little significance, a rise can be trumpeted as a success). The percentage of ‘firsts and 2.1s’ is, however, an ambiguous indicator, as is indicated in the possible interpretations shown in Table 4.1. Table 4.1 Conflicting interpretations of the percentage of ‘good honours degrees’ High percentage Positive High level of achievement and/or interpretation good quality of student experience Negative Standards have been eased interpretation

Low percentage High standards have been maintained Poor level of attainment and/or poor quality of student experience

1 When submitting data to HESA, institutions were required to code unclassified honours degrees as a separate category, yet the term ‘unclassified’ in HESA’s tabulations of awards subsumed all degree awards that did not fall in the four categories of honours degree. 2 League tables attract a level of attention far beyond their methodological merit or their limited practical utility, and their technical quality has been severely criticized by a number of authors (among them McGuire, 1995; Machung, 1995; and Gater, 2002, in the US, and Morrison et al, 1995; Yorke, 1997, 1998a; Bowden, 2000; and Yorke and Longden, 2005, in the UK) – but to little practical effect. After all, league tables sell newspapers and magazines (which is where their real value is to be found), so why should the proprietors change an apparently successful product?

UK honours degree classifications 83 The interpretation given to this indicator is likely to depend upon the interpreter’s standpoint regarding the elusive concept of academic standards. The issue of standards is not likely to be at the forefront of press or political attention when the percentage of good honours degrees remains relatively stable. In 2000, The Times Higher Education Supplement (THES) carried a story on its front page about a letter from a vice-chancellor to his university’s external examiners, in which he expressed concern that graduates from his university were being awarded a lower percentage of ‘good honours degrees’ than those in other similar institutions which had comparable intakes (Baty, 2000). The THES story opened up the question of whether the vice-chancellor’s letter was a coded invitation to the university’s external examiners to be more lenient. The vice-chancellor responded vigorously in the THES’s letters page, rebutting the charge that he was inviting external examiners to be more generous in awarding classifications, and noting that one of the purposes of his university’s external examining system was to ensure that the standards adopted in his institution were congruent with those in cognate institutions (Cooke, 2000). This exchange reinvigorated discussion about grade inflation in UK higher education. In the following year, Bright et al. (2001) showed that, across the sector as a whole, there had been a gentle rise in the proportions of ‘good honours degrees’. Some institutions seemed to be showing a trend upwards, whereas it was difficult to discern a substantive trend in others. Their analysis, based on whole institutions, was of limited value since it made no allowance for ‘subject mix’, which Johnes and Taylor (1990) had found to be an important variable in their analysis of institutional performance indicators, and ignored the signals from earlier work (HEQC, 1996a) suggesting that disaggregation by subject area might prove fruitful. This HEQC study was based on data from English, Welsh and Northern Irish universities,3 and looked at trends over the period from 1973 to 1993 in eight varied subjects: Accountancy, Biology, Civil Engineering, French, History, Mathematics, Physics and Politics. It concluded that the modal class of honours degree had risen from lower second to upper second class, with the trend having steepened since 1980. The reasons for this shift were unclear, but may have included changes in approaches to assessment (Elton, 1998). Yorke (2002b) examined data on degree performances collected by the Higher Education Statistics Agency (HESA) in the UK for the five academic years 1994–95 to 1998–99. His analyses covered universities in England, Wales and Northern Ireland, and 16 of the 18 broad subject areas designated by the Higher Education Funding Council for England and in addition the category that covered combined degree programmes of various sorts.4 Statistically robust upward trends in the percentage of ‘good honours degrees’ were found in seven subject areas (in descending magnitude of trend: Education; Engineering and Technology; Architecture, Building and Planning; Languages; Physical Sciences; Humanities; and 3 It left largely out of consideration the then polytechnics and colleges. The 1992 Education Act enabled the polytechnics and a few large colleges of higher education to become universities. 4 Clinical subjects (Medicine and Dentistry) and Veterinary Science were excluded because it is typical in those subject areas not to award first degrees with honours.

84 UK honours degree classifications Mathematical Sciences). Yorke also found variations between universities in the pattern of subject area trends, with a few universities showing a marked predominance of rising trends. Whereas his analyses supported the HEQC’s finding that the modal classification of honours degree was ‘upper second’ in what by now had become to be known as the pre-1992 university sector, they suggested that this was the case for a minority of subject areas in the post-1992 universities.5

Trend analyses The availability of data from HESA for an eight-year span offered the opportunity to extend the original analyses. The data set used for this chapter covers the academic years 1994–95 to 2001–02, and all higher education institutions in England, Wales and Northern Ireland. For the academic year 2002–03 the categorization of subjects by HESA underwent substantial changes, preventing any straightforward further extension of the trend. The analytical methods adopted followed closely that adopted by Yorke (2002b), and are based on the percentage of good honours degrees that were awarded.6 (a) Subject areas The first set of analyses focuses on the 16 subject areas noted above, plus combined subjects, for the totality of institutions, irrespective of the number of awards made in individual institutions. These provide a picture of trends in broad subject areas. The numbers of awards in most subject areas are very large, and any errors in reporting and collating data are likely to be insignificant in respect of a sectorwide analysis.7 (b) Institutions The second set of analyses took a similar general form, but were undertaken at the finer level of the institution. For each institution, the criterion for inclusion in the analyses was 40 awards per subject area per year in at least six of the eight

5 The Education Act of 1992 increased the number of universities by dissolving the binary distinction between the then universities and the polytechnics (and a few large colleges of higher education). ‘Pre-1992 universities’ would describe themselves as research-intensive, in contrast to those designated as universities in or after 1992. The distinction is, however, slowly becoming blurred as the higher education system in the UK evolves. 6 These analyses make no provision for student non-completion of programmes of study. 7 Some institutional ‘runs’ of data exhibit fluctuations that are improbable unless, for the year(s) in question, low numbers of awards are being compensated by reportage of awards under different subject categories. At the gross level of the subject area, these fluctuations are of little significance: however, they prejudice analyses at the level of the individual institution, reducing the reliance that can be placed on the computed trend.

UK honours degree classifications 85 years covered by the data, lower numbers of awards being disregarded.8 Trends were computed in respect of those years in which the number of awards was 40 or more. Data quality In the analyses that follow, the measure used is the proportion of good honours degrees awarded, i.e. the ratio of the top two categories of honours degrees to the total number of degrees falling into the four categories used by HESA to report awards of degrees with honours (first class, upper second class, lower second class8 and third class/pass). ‘Pass’ degrees are degrees awarded without honours although the student has followed an honours programme, and can be considered as ‘fallback’ awards. ‘Unclassified’ degrees in the HESA data sets cover all other degrees and, like degrees whose classification was labelled as ‘unknown’, have been eliminated from the analyses. Doubts arose regarding the accuracy of the reporting by institutions of unclassified and pass degrees when the raw data were inspected. There were very occasional instances in the raw data of implausible ‘runs’ of awards in an institution, exemplified by the following numbers of awards in one subject area (Table 4.2). The atypical numbers for 1996–97 and 1999–2000 may have arisen from differences in reporting practice or from errors somewhere in the process of data entry. Examples of this kind of discrepancy were not frequent, but sufficiently in evidence in the HESA data sets to blur the recorded distinction between ‘pass’ and ‘unclassified’ degrees. When inquiries were made of a few institutions that appeared to have reported unusually high numbers of unclassified degrees, they revealed a number of possible reasons beyond an institutional decision simply to report data in a different way from one year to the next. The possible reasons for the high numbers of unclassified degrees included the following: the unclassified degree was Table 4.2 An implausible ‘run’ of awards Year N awards

1994– 95 273

1995– 96 304

1996– 97 47

1997– 98 286

1998– 99 235

1999– 2000 393

2000– 01 243

2001– 02 269

7 This methodological choice strikes a balance between two opposing needs – for large numbers to ensure reasonable reliability in the trends, and for maximum coverage of the institutions. The chosen criterion also offers a conservative approach to the institutional statistics that allows for minor errors or oddities in the HESA data, and compensates to some extent for the uncertainty inherent in the data (as described above). 8 ‘Undivided’ second class honours degrees are treated as lower second class honours degrees in the HESA data. Since only one or two universities use the undivided second category (and hence the impact within subjects will be small), and because within-university trends would be unaffected, no attempt has been made to partition the undivided seconds between the upper and lower second categories.

86 UK honours degree classifications • • • •

the appropriate award for a student who had opted for a non-honours route in the final year; an award from a part-time programme which did not have an honours option; a fallback award for a student who had failed an honours course (however, HESA’s specification for the submission of data indicates that these should be categorized as ‘pass’ degrees); an award from a non-honours programme franchised out to a further education college, after which the student might ‘top up’ the award with honours at the franchisor higher education institution.

There might have been a marked change in an institution’s provision in a particular subject area. For example, in successive years two institutions reported the following numbers of unclassified awards in a particular subject area (Table 4.3). It would seem likely that Institution A introduced at least one new programme whose first intake graduated in 1999–2000, whereas Institution B discontinued at least one non-honours programme after 1997–98. During the period covered by the data, some institutions were amalgamated with others, and some franchised programmes out to other institutions. The occasional trend in a subject area was clearly influenced by institutional changes which could be reflected in numbers of awards and/or a step-difference in the proportion of good honours degrees awarded. Where one institution was assimilated into another, it was not possible to incorporate its hitherto separate award data into the trends shown by the assimilating institution. Nor was it possible to allow for franchises and other collaborative engagements that may have had a bearing on the trends, since the relevant data were not available. The trends summarized in this chapter will, in some cases, have been influenced by institutional mergers, and by the engagement in, or disengagement from, franchise partnerships with other institutions whose students should appear in the lead institution’s returns to HESA. For this collection of reasons, the ratio of good honours degrees awarded to all degrees awarded would be of doubtful validity for some sequences of data. Omitting the unclassified degrees from the denominator does not remove all of the uncertainty inherent in the data, but minimizes the chance that a major error will be incorporated into the analyses. The baseline numbers of students who were studying for honours in the final year cannot be determined with accuracy: however, intra-institutional trends will be little affected provided that the institutional portfolio of programmes is assumed to have remained much the same over time.10 As a consequence of the problems with data quality (which are believed not to be large), ‘noise’ will have been introduced into a few trend analyses, since the data are insufficiently finely structured to enable their effects to be brought under statistical control. Where there are step-changes in provision (or in the reporting 10 This assumption has been made in the analyses and discussion that follow.

1994–95 0* (358) 70 (304)

1995–96 5 (309) 112 (352)

* 42 awards, however, were recorded as ‘unknown’.

Institution A B

1996–97 4 (408) 98 (337)

1997–98 22 (401) 78 (312)

1998–99 13 (394) 0 (305)

1999–2000 59 (369) 0 (300)

Table 4.3 Sharp changes in the number of unclassified degrees awarded (total numbers of awards in brackets) 2000–01 53 (412) 0 (332)

2001–02 77 (384) 0 (338)

88 UK honours degree classifications of awards), these are likely to weaken the robustness of computed trends, and hence to increase the conservatism of the analyses (see later discussion relating to Figures 4.1a and 4.1b). The computation of trends Linear regression was used to determine the trends, according to the equation: Percentage of good degrees = constant + m * (year of award) where m is the slope of the regression line, and is the trend (positive if the percentage of good degrees is rising, negative if it is falling). A trend was taken to be statistically significant where the probability that the computed trend could be attributed to chance was no higher than 1 in 20 (i.e. p < 0.05). The analyses reported in this chapter are institutionally ipsative – that is, they compare the percentages of good honours degrees over time for each university separately, like with like – as far as institutional evolution allows. The same applies in respect of the totality of relevant English, Welsh and Northern Irish universities. Data provided by HESA show that, for the pre-1992 universities, there has generally been a rise in entry qualifications over the eight-year span of the analyses reported here (which reflects the rise in recorded performances at A-level in the schools and further education colleges). For the post-1992 universities and colleges of higher education, such a comparison is of doubtful validity, simply because A-levels are used to a lesser extent as entry qualifications in those institutions. Results In reading the following tables, it should be noted that the statistical significance of a trend is influenced by the closeness of the data-points to a straight line. In Figures 4.1a and 4.1b, the two hypothetical series of data-points show the same linear trend. Each figure shows the ‘line of best fit’ with the set of data points. However, it is readily apparent that the points for Subject Area B are much closer to the line of best fit than those for Subject Area A. Whereas the trend for Subject Area B reaches a high level of statistical significance, that for Subject Area A does not, simply because, although it is evidently rising, the points make too much of a zigzag for a high level of confidence to be placed in the trend line. When the numbers of awards are small, the data points are particularly vulnerable to random fluctuations, as therefore is the trend derived from them. (a) Subject areas Across all higher education institutions in England, Wales and Northern Ireland, the trends over eight years in the subject areas analysed are shown in Table 4.4. Fifteen of the 17 trends are rising, statistically significantly. The sole falling,

% good honours degrees

UK honours degree classifications 89 56 54 52 50 48 46 44 ’95 % good honours degrees

Line of best fit

Subject Area A

’96

’97

’98

’99

’00 ’01 ’02 Year of award

56 Line of best fit

54 52 50 48 46 44 ’95

Subject Area B

’96

’97

’98

’99

’00 ’01 ’02 Year of award

Figure 4.1 Top: Illustration of a rising trend, but at a relatively weak level of statistical significance. Bottom: Illustration of a rising trend, but at a much higher level of statistical significance.

non-significant, trend is in Agriculture and related subjects, in which numbers of awards are quite small. Subdividing the results by institutional type shows that the rising trends are more often to be found, and are more pronounced, in the pre-1992 universities (Table 4.5). There are 11 statistically significant rising trends in the pre-1992 university group compared with six in the post-1992 university group, and eight rising trends and one falling trend in the colleges of higher education. Further, when the pre-1992 universities are subdivided into two categories (the elite ‘Russell Group’ and the other pre-1992 universities), the rising trends are stronger in the former (Table 4.6). Fifteen of the 17 subject areas show statistically significant rising trends in the Russell Group universities compared with nine in the other pre-1992 universities.

90 UK honours degree classifications Table 4.4 Trends in the percentage of ‘good honours degrees’ by subject area, 1994–95 to 2001–02. The pattern of enrolment in the subject areas is also shown Subject area Subjects allied to Medicine Biological Sciences Agriculture and related subjects Physical Sciences Mathematical Sciences Computer Science Engineering and Technology Architecture, Building and Planning Social, Economic and Political Studies Law Business and Administrative Studies Librarianship and Information Science Languages Humanities Creative Arts and Design Education Combined

Trend 0.66** 0.22 –0.41 1.44** 1.78** 0.33* 1.37** 1.15** 0.61** 0.61* 0.56* 0.85** 1.02** 1.19** 0.63** 0.95**

8-year trend in number of awards Doubled to 19,141 50% rise to 16,144 Modest rise to 1,867 Slight decline to 10,946 Essentially flat at 3,264 Near-doubling to 12,905 Slight decline to 18,162 25% decline to 5267 10% rise to 20,325 Essentially flat at 9243 Near 25% rise to 28,906 135% rise to 4,674 Very flat at 14,224 Small decline to 8,449 Rise of two-thirds to 21,999 Recovery to 11,213 following slight decline 0.91** Essentially flat at 22,285

Notes Stated numbers of awards in the final column are for academic year 2001–02. Trend is the change in the percentage of ‘good honours degrees’ per year, in percentage points. **Significant at p < 0.01; *significant at p < 0.05. Significant trends are in bold.

The trends summarized in Tables 4.4, 4.5 and 4.6 cannot be expected to continue indefinitely. As the percentage of good honours degrees approaches 100, there will be no room for further increases, and concerns expressed about ‘grade inflation’ are likely to exert some influence on institutional behaviour, as has occurred with grading at Princeton and Harvard Universities (see, respectively, Faculty Committee on Grading, 2005, and Lewis, 2006). (b) Institutions The disaggregation of Table 4.4 into Tables 4.5 and 4.6 points towards a further disaggregation – to the level of the institution. Table 4.7 shows, for the three main groupings of institutions, the number of statistically significant rising and falling trends. It presents a different analysis of the data that led to Table 4.5. A pattern in the trends? Were the trends concentrated in particular institutions? Two large institutions had four statistically significant falling trends out of 14 computed trends, and one had three out of 15. No other institution had more than two falling trends. If the

UK honours degree classifications 91 Table 4.5 Trends in ‘good honours degrees’ by institutional type, 1994–2002 Trend Pre-1992 Post-1992 Colleges Subject area universities universities Notes 0.86** Subjects allied to Medicine 0.57 0.28 Biological Sciences

0.81**

–0.09

–2.09**

Agriculture and related subjects Physical Sciences

–0.06

0.12

–0.05

1.60**

0.25

–0.02

Mathematical Sciences

1.76**

0.55

Computer Science

1.28**

0.19

2.56**

1.55** Engineering and Technology Architecture, Building and 1.75** Planning

1.29**

0.87**

1.10**

1.48

Social, Economic and Political Studies Law

0.80**

0.47*

–0.16

0.70*

0.26

1.33

Business and Administrative Studies Librarianship and Information Science Languages

0.62

0.29

1.17**

–0.05

0.26

1.31*

1.30**

0.27

–0.36

Humanities

1.19**

0.32

1.04**

Creative Arts and Design

–0.11

0.33*

0.85**

Education

0.18

1.42**

1.03**

Combined

0.95**

0.19

1.02**

A handful of awards in colleges No awards in colleges Few awards in colleges Few awards in colleges Mainly in post-1992 universities; very few in colleges

Few awards in colleges Dominated by post1992 universities Dominated by post1992 universities Dominated by pre1992 universities Dominated by pre1992 universities Dominated by post1992 universities and colleges Dominated by post1992 universities and colleges

Note: Trend is the change in the percentage of ‘good honours degrees’ per year, in percentage points. **Significant at p < 0.01; *significant at p < 0.05. Significant trends are in bold.

92 UK honours degree classifications Table 4.6 Trends in ‘good honours degrees’ in pre-1992 universities, 1994–2002

Subject area Subjects allied to Medicine Biological Sciences Agriculture and related subjects Physical Sciences Mathematical Sciences Computer Science Engineering and Technology Architecture, Building and Planning Social, Economic and Political Studies Law Business and Administrative Studies Librarianship and Information Science Languages Humanities Creative Arts and Design Education Combined

Trend Russell Group universities 1.16** 1.52** 1.46 1.70** 1.75** 1.87** 1.96** 1.89** 1.15** 1.01** 1.69** # 1.58** 1.47** 1.00** 1.23** 0.89**

Other pre-1992 universities –0.03 0.27 –1.19 1.26** 1.74** 0.96** 1.00** 1.59** 0.53* 0.36 0.09 0.14 0.97** 0.80** –0.61 –0.30 0.80*

Note: Trend is the change in the percentage of ‘good honours degrees’ per year, in percentage points. **Significant at p < 0.01; *significant at p < 0.05. Significant trends are in bold. #Very low number of students, hence the trend is not reported

criterion for rising trends in an individual institution is taken as half or more of the trends rising to a statistically significant extent, no college of higher education with five or more computed trends met the criterion; two post-1992 universities out of 37 met it; one pre-1992, non-Russell Group, university out of 34 eligible institutions met it; and 10 of the 16 included Russell Group universities met it. This analysis excludes the smaller, specialist institutions which offered programmes in relatively few subject areas, and in which there were a number of statistically significant rising trends. The striking feature of the analyses is the concentration of the statistically significant rising trends amongst the Russell Group universities, with the ratio of the number of rising trends to that of all computed trends in five of them exceeding 0.7.11 Why might the percentage of ‘good honours degrees’ be accelerating to a greater extent in this particular group of universities than in others? The Russell Group,12 established in 1994, is generally regarded as comprising what are often called ‘top’ universities. Its establishment might have led to a greater concen 11 In descending order, these five exhibited: 6 such rising trends out of 6; 8 out of 10 (though one falling trend has to be offset against these); 8 out of 11; 7 out of 10; and 10 out of 14. 12 For details, see http://www.hero.ac.uk/sites/hero/uk/reference_and_subject_resources/groups_ and_organisations/russell_group3706.cfm (accessed 13 September 2006).

UK honours degree classifications 93 Table 4.7 Rising and falling trends in the three different types of institution Pre-1992 universities Subject area N ↑ ↓ Subjects allied to Medicine 27 8 0 Biological Sciences 42 18 1 Agriculture and related 6 0 0 subjects Physical Sciences 38 19 1 Mathematical Sciences 28 8 0 Computer Science 34 9 0 Engineering and 39 14 3 Technology Architecture, Building and 14 3 0 Planning Social, Economic and 46 14 1 Political Studies Law 33 9 1 Business and 30 8 0 Administrative Studies Librarianship and 4 0 0 Information Science Languages 44 20 0 Humanities 37 12 1 Creative Arts and Design 20 7 0 Education 11 4 2 Combined 45 12 0 Total 498 165 10 % of Total N 33.1 2.0

Post-1992 universities N ↑ ↓ 32 8 2 31 6 3 4 0 0

Colleges N ↑ 10 3 3 0 2 0

↓ 2 1 0

25 5 36 34

1 1 3 12

1 0 4 0

1 0 2 7

0 0 0 1

1 0 0 1

23

6

3

0

0

0

33

6

3

10

2

2

31 37

3 5

3 5

2 10

0 1

0 1

15

5

1

6

3

0

4 3 1 0 6 1 6 2 2 2 75 33 17.3 7.6

5 7 27 25 14 131

0 0 5 5 0 20 15.3

0 0 1 1 1 11 8.4

26 15 33 23 30 433

tration of very well qualified entrants which in turn could have led to a greater proportion of ‘good honours degrees’. HESA provided the mean entry score (in terms of A-level points) for a large number of institutions, disaggregated by subject area.13 A-level points are a reasonable index of entry standards in pre-1992 universities, but are much less useful for post-1992 universities and colleges since these institutions tend to enrol large numbers of students who have qualified for entry in other ways.14 Studies by HEFCE (2003; 2005) testify to the positive relationship between A-level grades and performance at graduation: the issue here is the whether differential trends in these variables might suggest an explanation for the marked difference between 13 Some institutional data were not available. 14 The adoption, by the Universities and Colleges Admissions Service (UCAS), of a ‘tariff’ in which various qualifications are awarded points has been too recent for the historical data considered in this study.

94 UK honours degree classifications the two groups of pre-1992 universities that is observed in honours degree classifications. The differences in entry qualifications and graduation performance between the Russell Group universities and the other pre-1992 universities, where sufficient data were available, are indicated in Table 4.8. It is readily apparent that the Russell Group attracts students with considerably higher qualifications at entry, and that the performance on graduation is commensurate. Where data were available for both categories of pre-1992 university, and there were sufficient numbers of these universities to allow for meaningful analysis, trends were computed for both the run of entry qualifications over the five years 1994–95 to 1998–99 for each of the subject areas15 and the run of percentage of ‘good honours degrees’ in those subject areas for the five years 1997–98 to 2001–02. These computations enabled the relationship between entry and exit performances to be indicated in a rough and ready way (‘exit’ was presumed for the purpose of this analysis to be three years later than entry).16 If strong increases in entry qualifications were associated with strong performances at graduation, Table 4.8 Entry and exit data for Russell Group (RG) and non-Russell Group pre-1992 universities

Subject area Biological sciences Physical sciences Computer science Engineering and technology Social, economic and political studies Law Business and administrative studies Languages Humanities

Non-RG entry (mean A-level points) 19.7 18.6 18.5 16.6

Non-RG exit (percentage of ‘good degrees’) 65.0 50.4 47.1 47.0

RG entry (mean Alevel points) 23.9 23.2 22.7 21.5

RG exit (percentage of ‘good degrees’) 73.5 59.0 56.4 58.1

19.6

57.8

24.7

71.4

23.6 20.3

56.4 58.6

27.1 24.2

70.6 64.3

20.5 20.0

70.0 66.5

24.8 24.1

78.5 76.2

Notes Entry data were unavailable for all universities. A-level points totals were calculated on the basis that an A grade counted 10 points, B counted 8 and so on down to E counting 2. For any intermediate AS level results (if relevant) the pointages were halved. Means are at institutional level and are unweighted for numbers of students. 15 Biological Sciences; Physical Sciences; Computer Science; Engineering and Technology; Social, Economic and Political Studies; Law; Business and Administration; Languages; and Humanities. In some instances, entry ‘pointages’ were not available for the whole of the five-year period. Trends were calculated when there were three or more pointages available: however, the reliability of trends decreases with decreasing number of data-points. 16 This does not take into account programmes whose length exceeds three years, and hence it can only very roughly approximate the connection between entry and exit.

UK honours degree classifications 95 Trend in percentage of ‘good honours degrees’

6 5 4 3 2 1 -0.6

-0.4

-0.2

-1

Trend in A-level points

0.2

0.4

0.6

0.8

1.0

1.2

-2 -3 -4 Russell Group university

-5

Other pre-1992 university

Figure 4.2 Trends in entry qualifications and exit performance in Biological Sciences for Russell Group and other pre-1992 universities.

then it would be expected that the Russell Group institutions would tend to have larger positive trends for both variables. Figure 4.2 illustrates the findings for Biological Sciences, which was the subject area that exhibited the greatest tendency for separation between the Russell Group and other pre-1992 universities. Figure 4.2 shows that there is a tendency for the Russell Group universities to cluster in the upper right hand quadrant, and for the other pre-1992 universities to be scattered across the figure.17 This tendency however is not particularly strong. Although a similar tendency was found in the other subject areas for which sufficient data were available, it was in each case weaker than that for Biological Sciences. In these findings, then, there is a hint that an influence on the enhancement of student performance in Russell Group universities is a rise in the level of qualification of their entrants.18 However, the hint is sufficiently weak to suggest that there may be other influences in operation. Care has to be taken in assessing the contribution of entry performance to grades on exit from bachelor’s degrees. Burton and Ramist (2001) collated a number of studies in which the predictive value of SAT scores and/or high school record (HSR) has been studied. When they combined the results of these studies they found raw correlation coefficients in the range 0.4 to 0.5, suggesting that around 17 Some of the trends used in the construction of Figure 4.2 were not statistically significant, and so for this and other methodological reasons the figure should be treated as suggestive rather than indicative. 18 A much more rigorous multi-level analysis involving universities, subject areas and individual students would be needed to establish whether the hinted relationship stands up. However, it would involve a considerable amount of work, which might not make a good use of resources.

96 UK honours degree classifications 20 per cent of the variance in performance in higher education can be attributed to SAT performance and/or HSR.19 They note the reduction of the correlation coefficient in elite institutions due to the restriction of the range of students’ entry scores: an elite institution draws from a very selective sample of the universe of students, and hence the students’ SAT scores will be broadly of the same order. In such specific circumstances, other factors are likely to exert much more influence on student outcomes than entry scores. In studies conducted by HEFCE in the UK, the relationship between entry and exit performance is rather stronger (see for example graphs in HEFCE, 2003, para 26 and HEFCE, 2005, para 16), but this may well reflect the closer tie between entry and exit qualifications than exists in the analyses conducted by Burton and Ramist. Within the depicted general trends in the HEFCE studies there is variation with type of school attended. Across all subject areas, the proportion of good degrees awarded is higher in the pre-1992 universities and lower in the post-1992 universities and general colleges.20 This would be expected from differences in entry qualifications, though statistical analysis is not possible because the new universities and colleges tend to admit many students with qualifications other than A-level. Where enrolment numbers were reasonably high, the differences in 2002 between pre-1992 and post-1992 universities exceeded 15 percentage points in Law; Languages; Humanities; Biological Sciences; Physical Sciences; Business and Administrative Studies; and Social, Economic and Political Studies. The differences were less than 10 percentage points in Subjects allied to Medicine; Engineering and Technology; and Education. The proportion of good honours degrees exceeds 50 per cent for all subject areas in the pre-1992 universities, whereas it is above 50 per cent in only nine of the subject areas (six of which appear in Table 4.9) in the post-1992 universities. These data are consistent with the findings of HEQC (1996a) that the modal level of award in the pre-1992 universities was the upper second honours degree, but suggest that, for the post-1992 universities, the same conclusion holds for only about half of the subject areas.

Influences on the proportion of good honours degrees There are many influences on the proportion of good honours degrees, some operating predominantly at the level of the higher education sector, and some predominantly at the level of the institution. Individual institutions are likely to have been affected differentially by the influences considered below. Changes in one or more of the following may have been influential at the level of the institution: the student cohort and related demographic variables; curriculum structures and processes (including pedagogy and assessment methodology; the algorithm (i.e. the computational mechanism) through which honours degree 19 Correcting the correlation coefficients for restriction of range and other variables produced a highest coefficient of 0.76, implying a maximum variance explained of around 50 per cent. 20 Specialist colleges by their nature tend to be selective regarding entry, though A-levels are not necessarily a key criterion.

UK honours degree classifications 97 Table 4.9 A comparison of the proportion of good honours degrees awarded in selected subject areas in three groups of institutions Subject area Subjects allied to Medicine Biological sciences Physical sciences Computer science Engineering and technology Social, economic and political studies Law Business and administrative studies Languages Humanities Creative arts and design Education

Pre-1992 universities 66.1 68.9 62.9 57.0 59.3 65.8 67.6 63.7 78.2 76.0 69.5 61.1

Post-1992 universities 59.8 49.2 45.3 47.3 52.8 49.3 41.5 46.6 53.6 54.7 56.4 54.0

Colleges 54.1 40.6 38.5 48.6 46.0 42.6 Not applicable 37.2 51.8 53.3 57.8 49.6

Note: Numbers in some types of institution are too small for reliable comparisons in Agriculture and related subjects; Mathematical Sciences; Architecture, Building and Planning; and Librarianship and Information Science.

classifications are determined; and aspects of national policy. Supra-institutional considerations will have had some influence upon developments within institutions. The student cohort Student performance in the A-level examinations has steadily improved over the years, accompanied by the annual ritual of argument about whether the improvement is genuine. It might be expected that rise in entry ability accounts for the rise in proportion of good honours degrees. Simonite (2004) analysed data for Mathematics, covering the years 1994–2000 and focusing on the characteristics of the individual student, and found that there was no statistically significant evidence of a rise in degree performance (first class or ‘good’ degrees) after making allowance for entry characteristics (including A-level grades). Although Simonite included data from 98 institutions (including those from Scotland), the trend in enrolment has, over the period 1994–2002, been to concentrate enrolments in Mathematics in the pre-1992 universities, and so Mathematics is untypical of the UK higher education sector as a whole.21 However, Simonite’s work is a reminder that the relationship between entry qualifications and degree performance may be of importance in understanding trends: the problem is that similar analyses covering other subject areas, where numbers are considerably greater and entry qualifications are more varied in institutions outside the pre-1992 universities, will be 21 The same applies in respect of the physical sciences, where Physics and Chemistry as main subjects have steadily become concentrated in a decreasing number of pre-1992 universities.

98 UK honours degree classifications complex and time-consuming. Further, the pool of data will be rendered turbid by those students whose programmes involve more than a single subject. Since the Labour government came to power in the UK in 1997, there has been a political initiative to increase the numbers of students entering higher education from disadvantaged backgrounds: numerically, those from lower socio-economic groups have been the major target in this respect. Data from the Department for Education and Skills (DfES, 2003) show that, although the participation of young people from disadvantaged backgrounds has increased, the gap between their enrolment rate and that of their more advantaged peers has remained much the same. A component of this initiative that has attracted particular attention is the political desire for more students from disadvantaged backgrounds to be enrolled in ‘top universities’. This was given a sharp push by the widely publicized rejection in 2000 by Magdalen College, Oxford, of a very well qualified school leaver from a state school, and dispute as to the merits of the case. Although the numbers of disadvantaged students in ‘top universities’ has grown in recent years, they are shown by HESA statistics22 to be both generally lower than the calculated benchmark expectations for them and also relatively small in comparison to the numbers enrolling in other institutions (and particularly the post-1992 universities and the colleges of higher education). Institutional cohorts are likely to have been affected by the consequences of the decision by the Labour government on taking office to reshape the financial aspects of student enrolment in higher education in England. From 1998, students were asked to pay ‘up front’ a contribution (which was means-tested) to their tuition fees, and the system of maintenance grants was phased out in favour of one of loans. (The system changed again in the autumn of 2006, with ‘top-up fees’ at roughly triple the size of the ‘up front’ fees they replaced, but payment being deferred until after graduation and when the graduates’ earnings exceed a given annual amount.) The economics of studenthood seem to have resulted in a greater proportion of students living at home, where part of the cost could be met by the household – a more attractive option for some than incurring all the costs associated with living in a flat or shared house. The changes have indirectly strengthened the regional character of the post-1992 universities and the colleges offering a broad range of higher education programmes, especially in respect of the enrolment of students from disadvantaged backgrounds. Two opposing influences may also be playing a part. The pressures to ‘do well’ in higher education have increased in a context in which the mere possession of a degree is no longer an almost certain passport to the kinds of job to which graduates aspire. Students therefore are perhaps more diligent in their studies than they were two or three decades ago, when they found time for engagement in activities (sometimes quite radical) unrelated to their academic work. On the other hand, students are, compared with their predecessors, more involved in ‘earning 22 See, for example, the statistics on widening participation at www.hesa.ac.uk/pi/0405/ participation.htm.

UK honours degree classifications 99 as they learn’ because of the need to fund their progress through the higher education system to a greater extent than was the case when fees were paid for them in full and grants were available for those from low-income backgrounds. The need to earn means that students are increasingly attending their institutions for their formal commitments but not staying to engage more widely in the academic and social opportunities on offer. Students who come from relatively unfavoured financial backgrounds are likely to need to devote a greater proportion of their day to part-time work. A survey conducted early in 2006 of the experience of full-time first year students23 showed that students from managerial and professional backgrounds tended to undertake less part-time employment whilst studying than those from supervisory, technical, manual and other similar backgrounds (Table 4.10). The difference between the two groups reflects that of a survey by Brennan et al. (2005) (though in their study, which surveyed final-year students’ situation between 2000 and 2002, the levels of paid work were rather lower), and also findings from earlier studies (Barke et al., 2000; Connor et al., 2001). Recently, Salamonson and Andrew (2006) found that, for second-year students of nursing in a regional university in Australia, part-time working for more than 16 hours per week affected their academic performance adversely. The puzzle of the widening gap between the Russell Group and other institutions may in part be explained by the synergy resulting from a concentration of academically gifted, predominantly young, and relatively privileged students in a relatively resource-rich environment. The synergy may be accentuated by such students’ lesser need to engage in part-time employment (because of the support available from relatively well-off families) and their considerable opportunities for interaction on and around the campus. This picks up a point made in Australia during the late 1990s, where the ‘Group of 8’ most prestigious universities were concerned that the Course Experience Questionnaire (CEQ), used by the then Graduate Careers Council of Australia to solicit graduates’ opinions regarding their courses, did not cover the full spectrum of the student experience. This led to the development of a revised instrument (McInnis et al., 2001) in which the concept of ‘the student experience in higher education’ was treated more broadly. Table 4.10 Levels of part-time employment reported by first year full-time students from contrasting socio-economic backgrounds Number of hours of part-time work per week Background None 1–6 7–12 13–18 More than 18 Managerial/professional 49% 9% 18% 14% 10% (N = 1272) Supervisory etc. (N = 434) 35% 7% 21% 23% 14%

23 This study, conducted by the author and Bernard Longden of Liverpool Hope University, was funded by the Higher Education Academy, and attracted 7,109 usable responses from 23 institutions across the UK (see Yorke and Longden, 2007).

100 UK honours degree classifications Demographic variables Demographic variables play some part in the differences that are observed between the award of good honours degrees in different types of institution. Readily available performance statistics from HESA show the variation in enrolment from students of different socio-economic backgrounds across the higher education system in the UK, and this is evident in groups of institution of broadly similar type. The HESA performance indicators, however, do not highlight the correlation of A-level examination qualifications with social class.24 Smith and Naylor (2001) analysed data from the cohort of students who graduated from the then university sector in 1993 and found that women had outperformed men at both ends of the performance spectrum (in terms of good honours degrees and failure). They also cut their data into four age-bands, finding that age was broadly associated with superior outcomes, and that, for younger students, women’s performances outstripped those of men. McCrum (1994) had earlier found a bias in favour of male students at Oxford and Cambridge (but not across other universities), which he attributed inter alia to the particular circumstances prevailing in those two universities at the time. Hoskins and Newstead (1997), having analysed student performance data from the University of Plymouth, came to the conclusion that gender was a relatively weak predictor of degree outcome compared with age (where older students showed up with superior classifications). Woodley and Richardson (2003) undertook a larger analysis of UK-resident students who graduated with honours degrees (203,104 in number) from the whole of the UK higher education sector in 1996. They found that age tended to be associated positively with degree classification (save for some countervailing evidence from the smallish number who graduated at ages lower than 21 and over 50).25 The distribution of classifications by gender showed that the male profile was wider and flatter than that for women (i.e. men obtained proportionately more first class and third class degrees than women, but fewer second class degrees) but that 57.7 per cent of women obtained a good honours degree against 51.2 per cent of men. The odds ratio in favour of women obtaining a good honours degree was 1.30, the gender disparity being statistically significant at all ages save the 24 This correlation is strongly implicit at the level of the local education authority in England, where the proportion of pupils entitled to free school meals (a proxy for low family income) is inversely related to the average number of ‘points’ gained in the A-level examinations: in 2001 the correlation was –0.54 for the whole of England. The coefficient was of much greater magnitude in areas in which there were many metropolitan authorities, reaching –0.76 in the north-east of England (Yorke, 2003). 25 In Smith and Naylor’s study age-bands (age under 24 at graduation; 24–27; 28–33; 34 and above) were arguably not the most useful choices. It would have been preferable had the lowest age-band been ‘age under 22’. This would have limited the group to those who had entered higher education mainly at 18 or 19, and been fairly close to restricting it to those who entered higher education direct from school. A similar point can be made with respect to that conducted by Woodley and Richardson, where the ‘cut’ at the lower end of the age range was made between under 21 years at graduation (who were presumably very highly talented students and exceptional in that they will have enrolled at an age younger than 18) and 21–25 years of age.

UK honours degree classifications 101 extremes. There were also some variations of the general pattern of gender-related outcomes in some subject areas. In their review of the earlier literature, Woodley and Richardson (2003) noted the lack of differences in outcomes according to gender. However, Simonite (2005) showed that there were differences between institutions in terms of gender-related performance in Mathematical Sciences, something she termed an institution’s ‘gender added’. Woodfield et al. (2005) take issue with the often expressed opinion that female students, in contrast with males, both favour and are favoured by the use of coursework as opposed to unseen examinations in mode of assessment arrays, showing that in a sample of 638 students at the University of Sussex females outperformed males in both forms of assessment.26 There may be two gender-related changes over time that impacted synergically on honours degree classifications in UK higher education: a shift towards female predominance in participation coupled with a shift towards proportionately greater success (with the gaining of a good honours degree as the criterion). Curriculum structures and processes Curricula in institutions in the UK have increasingly become modularized and/ or semesterized, leading to summative assessments of modules (or study units) which take shorter periods of time than the normative year-long curricular components of yesteryear.27 This places pressure on students, once they have entered that part of the programme that enables them to qualify for honours,28 to achieve the best result of which they are capable. The honours degree classification is often determined in the first instance by the mean mark for the qualifying modules (study units), which implies the undesirability of letting performances slip appreciably from the level of which the student is capable, and to which they can reasonably aspire. The emphasis on learning outcomes in programme and module specifications is helpful to students in one respect: it provides them with information that can be very helpful in guiding them to what they need to do to obtain a good grade.29 If the teaching focuses narrowly on these, then the effect of learning outcomes is accentuated. At its narrowest, the situation could approach ‘teaching to the test’, which is likely to bias results upwards, though possibly without leading to the insightful performances that are supposed to characterize a ‘first class’ performance. The use of learning outcomes as a basis for curriculum design shifts assessment 26 Some of the findings of Kornbrot (1987), who drew on official statistics from the then UK university sector in the early 1980s, show that women outperformed men in areas in which the opposite might have been expected. 27 This discussion focuses on full-time students, though it can be extended (with adaptations) to part-time study. 28 In England, Wales and Northern Ireland, usually after the first year of full-time study. The situation in Scotland was noted at the beginning of this chapter as being different. 29 Though the strictures of Wolf (1995) regarding the need to elaborate precepts in terms of examples of practice need to be borne in mind.

102 UK honours degree classifications further towards a criterion-referenced approach, in which the performances are quite likely to be skewed towards the upper end of the distribution of grades, as would be expected in respect of a curriculum focusing on ‘competence’ or ‘mastery learning’. The impact on assessment outcomes of mastery, performancebased, and competence-based curricula has been acknowledged for at least two decades in the US (see, for example, Goldman, 1985; Sabot and Wakeman-Linn, 1991; Lanning and Perkins, 1995), but has received little attention in the UK. The emphasis on learning outcomes may be less helpful in another respect, in that a tight focus on what is necessary to succeed in a module may blinker the student from the broader learning that is desirable in the longer term (and beyond higher education), and which might help them to make the leaps in learning that are necessary to obtain first class honours. A tactical approach to studying and assessment, then, can be the enemy of more strategic aims. In the UK there has, over the last couple of decades, been a shift away from examinations and in favour of ‘coursework’ (something of an umbrella term for work completed outside examination-room conditions), though there are signs that this is being reversed because of growing concerns about plagiarism. Elton (1998) suggested that the increased reliance on coursework may have contributed to the rise in honours degree classifications. It was noted in Chapter 1 that empirical evidence has emerged that broadly supported Elton’s suggestion (Bridges et al., 1999, 2002; Yorke et al., 2000; Simonite, 2003). In the increase in use of learning outcomes and of coursework in assessment, there are two influences that are potentially powerful in raising the grades achieved by students, and hence their honours degree classifications. Award algorithms The algorithms or mechanisms for converting students’ performances into honours degrees classifications vary from institution to institution. This variation, discussed more fully in Chapters 3 and 6, can be seen, for example, in the basic classification methodology; the weighting given to final year performances; and the number of performances that can be ‘dropped’ in arriving at the classification. To these should be added less obvious contributory features such as the grading practices being adopted; institutional policy on condonement of, and compensation for, performances not meeting threshold expectations;30 policy regarding resitting failed assessments; the ‘capping’ of grades for repeated assessments; and the way in which claims that performances have been affected by extraneous factors (‘personal mitigating circumstances’) are handled. Following up the early work of Woolf and Turner (1997), a study by SACWG modelled the effects of dropping different numbers of modules from the classification process, and of changing the weighting of final year performances. This indicated that such changes could influence a substantial number of classifications across UK higher education (Yorke et al., 2004). 31 See Johnson (2004: 17ff) on this issue.

UK honours degree classifications 103 Anecdotal evidence has indicated that institutions benchmark (with varying degrees of formality) their classification practices against those of comparable institutions. Where there has been a belief that institutional practices are disadvantaging their students, these have been adjusted in the interests of equitable treatment. The consequence, therefore, is that awards will be edged upwards. Institutional considerations The recruitment of students in the UK has become increasingly competitive during the last couple of decades. The more prestigious institutions have found recruitment buoyant, whereas some of the less prestigious institutions have had to work harder to enrol the numbers of students that they need for viability. The competitiveness has been particularly evidenced in research, where successive research assessment exercises (RAEs) have had the effect of concentrating research activity and funding council funding in the more prestigious institutions. Institutions that have been keen to develop their research profile from a relatively limited base at the beginning of the 1990s have found their ambitions at first encouraged by the fact that they could enter the RAE process and gain rewards from so doing, but later discouraged (save for specialist niches) as the implications of the tightening of funding based on RAE performance began to take effect. The RAE has had effects on institutions’ portfolios of subjects. In recent years there has been a steady trickle of departmental closures where institutions have judged that their RAE performances and/or recruitment have not met expectations. Those in Mathematics and the physical sciences have been highlighted because of their implications for the national science base. In a number of cases, the provision has been reassigned to, and given a more applied emphasis in, areas in which the institution’s performance has been stronger. The effect on the analyses in this chapter may have been to concentrate the most able students (as judged by entry qualifications) in those institutions in which these subjects are strongest – i.e. the most elite institutions – and hence to accentuate the rising trend of ‘good honours degrees’ in them. The past decade or so has seen the inception and spread of ‘league tables’ (rankings) of institutions in the UK. Strong student performances, now typified by the ‘good honours degree’ although in earlier tables the proportion of first class degrees was used as a measure, are a feature. The fuss that arose following the THES’s revelation, noted above, that a vice-chancellor had written to external examiners about the proportion of ‘good honours degrees’ has to be seen in this context. Institutions understandably want to be seen in the best light possible, and there is anecdotal evidence that some have sought to improve their position by taking actions that will lead to this – for example, by spending more on library and computer provision. The scope for improving the profile of student classifications is more limited, since the external examiner system is charged with ensuring that standards are upheld. However, external examiners have to work within institutionally determined parameters, and have become more remote from students’ actual work as programmes of study have become modularized and their role has

104 UK honours degree classifications shifted away from looking at samples of students’ work and towards assuring the integrity of the processes for assessment and award.

No simple explanation The data in this chapter demonstrate a general rise in the proportion of good honours degrees, and that there are differences between universities as regards the trends of their degree results. The reasons for these findings are likely to be complex, deriving from a number of changes in the higher education system. There is always a temptation to make the simplistic leap of concluding that the rise in proportion of good honours degrees is a manifestation of an unwarranted inflation in the awarding of grades, but this should be resisted. In particular, it should be resisted because of the changes in demand on students that have taken place over a couple of decades – the modularization of curricula, the emphasis on learning outcomes, and the shift towards coursework and away from examinations are three influences that could have contributed in considerable part to the observed trend. The emphasis that is given to honours degree classifications by both the academy and other stakeholders is misplaced. What is really important is what students have learned from their time in higher education, and what they can do as a result. The honours degree classification is a poor index of both.

Chapter 5

How real is grade inflation?

The perception of grade inflation Grade inflation is perceived internationally to be a problem. It turns up frequently in press reports of educational performances at all levels, usually with ‘shock, horror’ connotations, perhaps prompted by reports of various kinds that gradepoint averages, school examination grades or honours degree classifications are rising. Rising grades are almost automatically equated with ‘inflation’, which is seen as undesirable. This chapter discusses perceptions of rising grades as inflationary before moving on to consider factors that could account for rising grades, some of which may not deserve the pejorative connotation of ‘inflation’.

When B+ stands for bad . . . In the year 2000 the Education Policy Committee at the University of North Carolina at Chapel Hill submitted a report to Faculty Council on grade inflation. The report, unusually for such documents, began with an exchange between a faculty member and a former student who had been operating a cash register at Foster’s Market: I took your course last year and it was the worst experience of my life. Oh? Well, I mean, I enjoyed the course and I learned a lot, but it just about destroyed my GPA. (Fearing the worst) What grade did you receive? A B+. (Education Policy Committee, 2000: 1) Clearly, for this student anything less than an ‘A’ grade of some sort was a near-disaster.

106 How real is grade inflation?

Rising grades Grades in the US school sector have generally been rising. Carr (2005) notes that, from 1990 to 2000, the mean GPA of high school graduates increased from 2.68 to 2.94, and this was visible across a range of demographic variables. However, scores in Mathematics and Science had declined since 1996. In the UK, school pupils’ performances in external examinations have risen steadily since the 1990s, with a widening ‘performance gap’ favouring girls over boys. The ‘marker’ indices are the percentages of school pupils aged 15 gaining five General Certificate of Secondary Education (GCSE) passes at grades A* to C, and the percentages of students aged 16–18 gaining at least two passes at A-level or equivalent at school or further education college.1 What often escapes notice, however, is that curriculum structures and assessment approaches have evolved over time. There is an annual tussle regarding the A-level results from schools and further education colleges: are they an improvement in performance or evidence of an easing of standards? The popular press prefers the latter as an explanation, but perhaps simplistically. The same general kind of argument takes place about grades and standards in higher education. A number of writers in the US have pointed to a general upward trend in grades. For example, Juola (1980), who analysed data from 180 colleges, found that the mean GPA increased by nearly half a letter grade between 1960 and 1974, but levelled off during the following four years. Levine and Cureton (1998), who compared data from surveys conducted in 1969, 1976 and 1993, found an upward skew in grading which increased the percentage of As from 7 to 26 and decreased the percentage of Cs by a roughly corresponding amount (from 25 to 9). Kuh and Hu (1999), who analysed 52,000 grades from the mid-1980s and mid-1990s, found an upward shift of close to a third of a letter grade, though the rise varied between particular educational contexts – in sciences, for example, the rise was lower than for the humanities. Rojstaczer’s compilation of data from 29 institutions2 shows an increase of around 0.6 of a letter grade over the period 1967–2001, though the sharpest increases appear in the 1960s to mid-1970s and after the mid-1980s, with little increase in the intervening period. The Education Policy Committee (2000) at the University of North Carolina – Chapel Hill [UNC–CH] produced a graph almost identical in shape to that on Rojstaczer’s website, using data drawn from the university’s archives. Rojstaczer reported that the rise in GPA was greater in private institutions than in their public counterparts – a matter of concern to UNC–CH, whose Education Policy Committee commented: 1 Data can be found on the Trends in education and skills website of the Department for Education and Skills, www.dfes.gov.uk/trends, by following the link to ‘Attainment and outcomes’. 2 Institutions whose data are included on Rojstaczer’s website www.gradeinflation.com: Alabama; Cal-Irvine; CSU-Hayward; CSU-San Bernardino; Carleton; Dartmouth; Duke; Florida; Georgia Tech; Hampden-Sydney; Harvard; Harvey Mudd; Kent State; Minnesota; Nebraska-Kearney; N Carolina; N Iowa; N Michigan; Pomona; Princeton; Purdue; Stanford; Texas; Utah; U Washington; UW LaCrosse; Wheaton; Williams; Winthrop.

How real is grade inflation? 107 The flood of As from the private colleges could well have an impact on our graduates’ success in gaining graduate school admission or prestigious jobs. We will never be able to win a grade escalation war with our brethren in private colleges and universities, who clearly must be using high grades as a partial justification for their students’ paying tens of thousands of dollars for an education available at UNC–CH for a fraction of the cost. (Education Policy Committee, 2000: 10: emphasis as in the original; footnote omitted) Rosovsky and Hartley (2002) noted that Scholastic Aptitude Test (SAT) scores in the US had declined by 5 per cent and yet grades in higher education had risen: prima facie evidence, they claimed, that grade inflation was occurring. However, matters may not be as simple as they appear. Adelman (forthcoming) points out that SAT scores are an inadequate entry measure because they relate to general learning and not to the specifics of learning in a particular programme of study.3 Further, many students do not have SAT or American College Testing (ACT) scores because they have entered higher education via varied routes in which SAT and ACT scores play little part. Hence the use of such scores as baseline reference points will introduce a bias into considerations of the existence of grade inflation. Analyses showing a rise in grades have prompted host of writers in the US to argue (in some cases, rail) that grade inflation and ‘dumbing down’ are pernicious threats to the integrity of higher education, amongst them Bloom (1988), Stone (1995), Rosovsky and Hartley (2002), Johnson (2003), and Rojstaczer via his website www.gradeinflation.com. Marcus (2002) reported concerns over the bandwagon effect of grade inflation, and the desire of some institutions to bring it to a halt. Subsequently, he reported on efforts by Harvard University to rein back on the award of the highest grades (Marcus, 2003) – efforts which Lewis, former Dean of Harvard College, suggests quickly petered out (Lewis, 2006). Elsewhere, Brendan Nelson, then Minister for Education, Science and Training in Australia, referred to claims of courses lacking intellectual rigour, the dumbing down of universities, and ‘softmarking’ (Nelson, 2002: para 97). James (2003), also writing of Australia, referred to pressure from academic managers to pass underachieving fee-paying students – a charge echoed in a press report about overseas students being given ‘top-ups’ to their grades at an institution in the UK (Baty, 2006). Walshe (2002) reported that the National University of Ireland (NUI) was considering a small reduction in the threshold mark for upper and lower second class degrees in order to align classification bands with levels consistent with UK practice, having been informed that its graduates were being awarded fewer such classifications than their UK equivalents. The university felt obliged to deny that this would be equivalent to grade inflation. 3 Shavelson and Huang (2003) discuss, inter alia, the tension between general and domain-specific in so far as assessment is concerned.

108 How real is grade inflation? In the UK, the proportion of ‘good honours degrees’ (i.e. first class and upper second class degrees) has risen over the period 1992–93 to 2004–05, the recent flattening of the rise being accompanied by a slight shift towards a greater number of first class degrees.4 Earlier research had shown that the modal class of honours degree rose from lower second class to upper second class over the period 1973–93 (HEQC, 1996a). This study related largely to the university sector of the time, covering what are often now termed ‘pre-1992 universities’.5 Analyses not reported in this book show that the modal class of honours degree in post-1992 universities edged upwards between 1995 and 2002. For some, grade inflation is a reality that is beyond question. Walvoord and Anderson (1998) remark, as if with a shrug of the shoulders, that it exists and nothing much can be done about it: Grade inflation is a national problem and must be addressed by institutions in concert at the national level. Individual teachers cannot address the problem in isolation; all you can do is use the coin of the realm. (Walvoord and Anderson 1998: 12) Later in this chapter it will be argued that the situation regarding grade inflation is more complex than much of the proffered argument and commentary suggests.

Definitions of grade inflation What actually is grade inflation? The term has been given different colourings by different writers, and there are a number of definitions of grade inflation in the literature. The following give an indication of the range. McKeachie’s (2002: 103n) definition The fact that the average grades in American colleges are now higher than they were 40 years ago is clearly inadequate, and the inadequacy is exposed when, in another footnote on the following page, he acknowledges that there has been no inflation subsequent to the 1970s. Milton et al. (1986: 29) define grade inflation as ‘when a grade is viewed as being less rigorous than it ought to be’, with grade deflation being associated with ‘more’ rigour. This leaves completely open the nature of the judge and the criteria

4 See www.dfes.gov.uk/trends/index.cfm?fuseaction=home.showChart&cid=5&iid=34&chid=147.

How real is grade inflation? 109 against which the grade is awarded, and hence renders the definition all but meaningless. One might infer that, by ‘less rigorous’, Milton et al. really mean ‘more generous’,6 which leads to the kind of definition found in a number of sources, amongst them being an increase in reported grades unwarranted by student achievement (Stone, 1995) an upward shift in students’ grade-point averages without a similar rise in achievement (Kohn, 2002) a process in which a defined level of academic achievement results in a higher grade than awarded to that level of achievement in the past. (Birnbaum, 1977: 522) These miss out the possibility that, whilst grading might be steady, the actual level of student performance may be lower. This could be a consequence of the funding of institutions on the basis of the achievements of their students (‘outcomes-based funding’).7 Also missing from the definitions is the possibility of grades and student performances improving, but at differing rates. Prather et al. (1979: 14) had earlier defined grade inflation as a consistent, systematic increase in average grades received by students in a particular course, after taking into account the students’ major fields and background. They go on to explain their use of ‘systematic’ as a rising trend in grades, arguing that, if the trend is maintained over a period of five years, ‘there is stronger evidence of inflation in grading practices, as compared to annual fluctuations’ (ibid.: 14). The possibility of other influences on grades, such as better teaching, is not considered. There is a similarity between the findings of Prather et al. and those of Duke (1983), who in a detailed single-institution study found increases in GPA that were ‘large and stable across time’ (ibid.: 1026), though there were variations at the component college level and greater variations at the level of the subject discipline. Young (2003) defines grade inflation as a practice among universities and colleges to deflate the actual, real value of an A, so that it becomes an average grade among college and university students. 6 In any case, the rigour should refer to the process of grading, rather than to the grade itself. 7 It has been suggested by an anonymous, but authoritative, source that wˉananga [tertiary education institutions providing programmes in the Maori context] in New Zealand may provide a case in point.

110 How real is grade inflation? There is an implicit hint of conspiracy to deceive in Young’s definition, which may not have been intended. Indeed, it is doubtful whether institutions consciously seek to deflate the value of their grades (if an ‘A’ is deflated in value, then other grades will be similarly affected), since in the longer term it would serve their interests ill: a more plausible interpretation is that, where grade inflation exists, the grades suffer collateral damage arising from policy and practices adopted for other purposes. To get round the variations possible in grade inflation, it is better to construe grade inflation as an increasing divergence between the grade awarded and the actual achievement, with the former exceeding the latter. Of course, this carries implicit assumptions about demographic equivalence, the baseline for measurement, the relationship between achievement and grade, and the stability of what is being measured. Ideally, and impossibly, a judgement about the existence of grade inflation requires stability in students’ characteristics at entry into higher education, stability in curriculum content, stability in assessment instrumentation, and stability in assessors’ grading behaviour. Since all of these parameters change with time (and because society evolves, this should be an expectation and not an aberration), and because statistical control is problematic, judgements about the existence of grade inflation become highly inferential. A major problem with allegations of grade inflation is that what may have been an adequate yardstick in the past may be inappropriate for the present. The knowledge base of any subject develops over time and hence curricula have to respond. This means cutting out some material in order to permit the inclusion of new material. Such changes make comparisons over time difficult. For example, Kahn and Hoyles (1997) found that, between 1989 and 1996, the content of single honours programmes in Mathematics in England and Wales had been broadened at the expense of depth, and that the basis of assessment had changed to include more interim assessments. Curricula in the UK generally seem to have taken a more instrumental turn in which there has been a greater emphasis, pressed by government, on the development of students’ ‘employability’. This has changed, in an almost unnoticed way, the nature of many bachelor’s degree programmes towards greater breadth, with postgraduate study being used to develop specific aspects of the subject in greater depth.

Adverse effects If there is grade inflation, or even if there are merely unwarranted perceptions that grade inflation is occurring, the currency of grades and awards becomes distrusted, and likewise the educational system becomes distrusted. The adverse effects of grade inflation (real or perceived) include the following. Entry to higher education Adverse effects may occur at either end of the ability spectrum, using performances prior to higher education as the yardstick. Students entering higher educa-

How real is grade inflation? 111 tion from school (or equivalent college) will have greater difficulty if their credentials lead to institutions inferring greater ability than is warranted. Institutions catering for students with weaker, but apparently satisfactory, entry qualifications are increasingly having to put on classes designed to prepare students for the rigours of study in higher education, sometimes using pre-entry summer schools and sometimes using ‘level zero’ or ‘remedial’ study units. In respect of the US, for example, Carr (2005) noted some concern regarding the inconsistency between high school grades and students’ preparedness for higher education: it was paradoxical that the average high school student graduated with an average grade of B, whereas 28 per cent of college entrants were enrolled in remedial classes in reading, writing or Mathematics. A major concern of some is that, as grades rise (for whatever reason), they have less discriminating power, particularly as regards selection for a further stage of education or employment (e.g. Rosovsky and Hartley, 2002: 4). As regards grades obtained at school, the ‘top’ universities in the UK claim that it is getting harder to identify the very best performers at A-level because of the large number of students obtaining A grades. Their pressure for better methods of discriminating has been manifested in a prompt towards the inclusion of extra questions on the A-level examination papers that would allow the most academically talented to demonstrate their capabilities. In England, a number of elite law schools felt it necessary to propose a separate entry test in order to provide greater discrimination (making the questionable assumption that the most effective lawyers come from the ranks of the most academically gifted).8 Lessened motivation If those from a disadvantaged background who intend to enter higher education understand that some compensating allowance can be applied in respect of their entry qualifications, they might decide that a good performance is sufficient, and not push themselves to achieve the excellent level of performance that might be within their compass (see comments by Nancy Weiss Malkiel in interview with Hedges, 2004). This is part of a hotly debated issue in the US about access to higher education, as the courts have seen in high profile cases such as Allan Bakke’s challenge to the University of California Medical School at Davis that began in 1974, and the case of Barbara Grutter regarding her denial of admission to the University of Michigan’s Law School in 1997. Advocates of affirmative action (e.g. Bowen and Bok, 1998) are challenged by those who argue that affirmative action does not always deliver results consistent with expectations (e.g. Dale and Kruger, 2002). The same kind of argument bears upon entry to graduate school.

8 See Berlins’ (2004) tart comment to this effect on p. 165, which recalls Hudson’s (1967) findings about successful Englishmen (quoted in Chapter 1 above).

112 How real is grade inflation? Less meaningful information Grade inflation changes the codings of performances compared with those of the past. An A today does not necessarily convey what an A did in the past. Academics by and large understand the shifts of meaning that have taken place, but this understanding may not extend to students and interested parties outside the academy (primarily employers). Consequences of misunderstanding may be misplaced ambition on the part of the student and unsatisfied expectations on the part of employers, and ultimately disappointment. As Birnbaum (1977: 537) foreshadowed from the US, some employers in the UK hark back to a graduate’s A-level grades and other indicators rather than trust the degree classification that they have gained, on the presumption that because they are externally set and marked they are a more objective measure of a student’s abilities, and implicitly demeaning whatever gains they have made as a consequence of their time in higher education. An alternative open to the betterresourced employers is to conduct supplementary assessments of various kinds. One approach, used by a number of ‘blue chip’ companies, is to use an assessment centre in which applicants are subjected to a battery of tests and activities in order to judge their suitability. However, Brown et al. (2004, Chapter 7) suggest that chance plays more of a part in the selection process for candidates who are not outstandingly strong or weak than might naively be supposed. Another approach is to place greater weight on informal evaluations, references and letters of recommendation, but these are particularly subject to unreliability: informal evaluations may lack appropriate criteria, and expressions of support may also be susceptible to linguistic inflation – perhaps because the writer is operating a pre-emptive defence against possible legal redress. Completion and retention rates If grades are inflated, some students pass study units when their performances might not merit it. The unwarranted successes will carry over into retention rates, and ultimately into graduation or completion rates, which contribute to rankings in the institutional ‘league tables’ that appear in the press (Gater, 2002: 6). An additional issue arises when students are given good grades at the beginning of their programmes as an encouragement, only to find that their sense of security is misconceived when a more rigorous assessment regimen operates later on. Deferred failure is costly, in more than one sense.

How valid are perceptions of grade inflation? As far as grade inflation is concerned, it is often the case that percipi is esse – to be perceived is to be – in an inversion of Bishop Berkeley’s dictum. If grades are rising, two questions need to be asked of perceptions or claims of grade inflation: •

Are they simplistic, making the assumption that a rise is necessarily inflationary?

How real is grade inflation? 113 •

Are they inflected (or should the word be ‘infected’?) by value positions that could be political in character?

In much of the writing on grading that has been cited above, the answers are two fairly strong affirmatives. There is a need to subject the perceptions to analyses that are less intemperate than those presented by some rather shrill voices. A cooler approach to the evidence suggests that the claims of grade inflation may not be as sustainable as some advocates would have it. Adelman (forthcoming) sharply criticizes much of the work alleging grade inflation on the grounds of bias in the selection of evidence and other methodological inadequacies. He draws on data from his earlier report (Adelman, 2004) to show that, for the classes of 1972, 1982 and 1992, and using evidence from transcripts, there were relatively small movements in mean GPA in higher education in the US. The percentages of A and B grades are given in Table 5.1. Adelman’s data suggest that fluctuation in the percentages of Pass, Credit, Withdrawal and No-Credit Repeat grades appear to have contributed indirectly to the fluctuation in the percentages of A and B grades. A shift to pass/fail grading for a number of low-credit tariff and remedial classes may have eliminated a number of what might have been A grade outcomes. When Adelman compared data from different types of institution for the classes of 1982 and 1992, the situation was clouded by much higher percentages of Pass grades, coupled with withdrawals and no-credit repeats, for the class of 1992. The percentage of A grades went up for highly selective, selective and nonselective institutions, and fell slightly in the ‘open door’ institutions. In all four groups of institutions, however, the percentage of B grades fell (Adelman 2004: 80). However, if the Pass grades, Withdrawals and No-Credit Repeats are eliminated from calculations of the percentages of grades A to F, there is a 5 percentage point rise in the total proportions of A and B grades combined for the selective and nonselective institutions, and zero rise for the highly selective and open door institutions. The selective and nonselective institutions constitute around two thirds of the institutions in Adelman’s dataset. There are hints of a partial similarity with Birnbaum’s (1977) analysis of data collected at the University of Wisconsin Oshkosh, in which student performances in 1969 and 1974 were compared. Birnbaum concluded that the only plausible explanation in grade increase (the mean GPA rose from 2.44 to 2.86) lay in the increased use of pass/fail assessments and non-penalizing withdrawal.9 However, Table 5.1 Mean percentages of A and B grades of undergraduates who graduated from high-school in the stated years Grade A B

Class of 1972 27.3 31.2

Class of 1982 26.1 32.8

Class of 1992 28.1 29.9

Source: Adelman (2004: 78.) 9 Singleton and Smith (1978) likewise speculated that the introduction of pass/fail grading at the University of California Riverside might have contributed to a marked rise in GPA.

114 How real is grade inflation? the demographic context is likely to have played a part in shaping the data from these studies, and it should be noted that GPAs tended to be rising for the class of 1992 from a dip a decade earlier (Adelman, 2004: 81). Adelman (2004) subtitled Part 6 of his report ‘a complex story’, which seems accurately to sum up the situation. An important point, highlighted by Adelman’s analysis, is that a failure to include grades other than A, B, C, D and F (and, where appropriate, the + and – affixes) risks biasing interpretation and conclusions. For example, it is impossible to be sure, on the evidence presented by Manhire (2005), that grading has continued to rise at Ohio University between 1993 and 2004, or that it has flattened off in the Russ College of Engineering and Technology at the same university, since the analyses are limited to the five listed grade levels. A methodological artefact Adelman (forthcoming) was able to make a fairly close, but not exact, comparison between the student self-reported grades from between the mid-1980s and the mid-1990s that were analysed by Kuh and Hu (1999) and transcript data from the High School & Beyond/Sophomore cohort (students who were in the tenth grade of school in 1982, whose college transcripts were collected in 1993) and the National Education Longitudinal Study of 1988 (students who were in the eighth grade in 1988, whose transcripts were collected in 2000). Whereas Kuh and Hu reported GPAs of 3.07 and 3.34 for the mid-1980s and mid-1990s groups respectively, the GPAs from the contemporary transcript samples were 2.79 and 2.99. Though GPAs rose with time, Adelman draws attention to the differences between the self-reports and the transcripts, making the point that self-reporting is likely to lead to higher grades than transcripts show. Although the use of transcripts is not entirely free of problems, Adelman points out that, especially when the transcripts are collected in a way that precludes the screening-out of cases, their validity and reliability are higher than those of selfreported grades. In other words, considerable caution is needed when evidence is presented in the form of self-reported grades, even when studies are otherwise statistically robust.

Origins of grade inflation Rosovsky and Hartley (2002) suggest that there is a widely held view that grade inflation in the US began in the 1960s, when academics opposed to the Vietnam war and the associated draft into the armed forces were reluctant to award low grades to young men since, if low grades forced these students to leave higher education, they would be required to undertake military service. Whilst there may be an element of truth in this, it may be as accurate an observation as the poet Philip Larkin’s that sexual intercourse began in 1963. The view can be challenged from two directions. First, systematic data were not collected in the US until the early 1970s (Adelman, 2001; forthcoming), so any data purported to represent the

How real is grade inflation? 115 national picture preceding that time are suspect. Second, where long-term intrainstitutional trend data can be obtained, there is some evidence that grades have been rising since well before the 1960s – though this may apply only in particular institutions. Lewis (2006) quotes the concern of the Committee on Raising the Standard at Harvard in 1894 that in the present practice Grades A and B are sometimes given too readily – Grade A for work of no very high merit, and Grade B for work not far above mediocrity. (Lewis, 2006: 115) and he draws on archives at the University to argue that grade inflation had been occurring throughout the twentieth century (and implicitly that this had been true of other comparable institutions; see Lewis, 2006, Chapter 5). Adelman’s (2004) study of the classes of 1972 to 1992 draws attention obliquely to the low proportion of the total number of grades awarded that derive from highly selective institutions (3.8 per cent in 1992): much of the argument relating to grade inflation is fuelled by the grades in this small but high-profile segment of the US higher education system, with what is going on in the less selective institutions passing relatively unnoticed. Adelman (2001) cautions against using what may be happening in elite institutions and using this as a proxy for the national picture. Elsewhere (Adelman, forthcoming) he takes issue with commentators such as Rojstaczer for the selectivity and hence bias in their sampling. Earlier, he had made his point in the following terms: The situation is analogous to what happened in the Boston Marathon when the rules changed, and the pool expanded by lottery. The size of the field is nearly double what it was when my friends and I stood along Beacon Street near Washington Square in Brookline, with oranges and water and applause for those who made it that far. What do you think has happened to the mean time of completion of the race? To the standard deviation of that time? The answers are just as common-sensical as those about grading and student performance in a system of higher education that has expanded by 40 percent since the early 1970s. (Adelman, 2001: 25)

What could inflate grades? There are many possible influences bearing on the rise in students’ grades, which is often – and uncritically – taken to signal grade inflation. They are to some extent intertwined, and unravelling the tangle of causes is a challenge comparable to that of untying the Gordian knot, with no Alexandrian sword available to resolve the issue unequivocally. If grades have been inflated, then possible causes include the following:

116 How real is grade inflation? • • • • • • • •

grading practices; student choice; easing of standards; avoidance of low grades; giving students a helping hand; student opinion; politics and economics; relativism.

Grading practices An overall grade, such as a degree classification or a GPA, depends on factors such as the kind of work undertaken for assessment (e.g. coursework or examination), how work is graded, the weighting given to component grades, and what is included and what excluded from consideration. The more an assessment is prespecified (for instance, by statements of intended learning outcomes), the more likely the profile of grades is to be skewed towards the upper end because students can better gauge what is expected of them and adjust their efforts accordingly. Grading schemes for pieces of work also have a bearing, in that totting up the marks for different component items can produce a different result from a more holistic approach to grading – the whole may be more or less than the sum of the parts. There is evidence from the UK (Johnson, 2004) that institutional assessment practices at the ‘macro’ level of awards have evolved over time to suit contemporary needs, with, for example, the range of modules that can be legitimately ‘dropped’ from the determination of honours degree classification narrowing as ‘outliers’ realized that their practices were some way distant from the norm. There is also some reliable but informal evidence that institutions have adjusted their assessment regulations when they have perceived themselves to have been out of line with cognate institutions, for example, in the number of ‘good’ honours degrees awarded. Adelman’s (2004) study of the grades obtained by the classes of 1972 to 1992 showed that there had been a trend towards a greater number of pass/fail grades over time, and this has an effect on calculations of GPA. Another aspect of grading practice that affects outcomes (though probably more cross-sectionally than longitudinally) is the differences that exist between subject disciplines. Johnson (2003) noted that research (Goldman and Widawski, 1976; Strenta and Elliott, 1987; Elliott and Strenta, 1988) had shown that the mean GPA in Mathematics and the sciences tended to be lower than the mean for other subjects. Further, Goldman and Widawski had shown that students’ GPAs from higher education showed strong negative correlations with measures at high school level. Johnson inferred that grading was more stringent in subject areas that attracted students of higher general ability, and more lenient in areas attracting weaker students. He concluded from his study of grading that ‘Grading practices differ systematically between disciplines and instructors, and these disparities cause serious inequities in student assessment’ (Johnson, 2003: 237).

How real is grade inflation? 117 In addition to Johnson’s reportage, there is plenty of evidence for disciplinary variation, including observations by Dale (1959); the single institution study conducted by McSpirit and Jones (1999); data analysed by Kuh and Hu (1999); data from a study at the American University in Cairo (Berenger, 2005); the single module data in Adelman (2004)10 and in this volume, Chapter 4; and the profiles of first degree classifications produced by the Higher Education Statistics Agency in respect of UK higher education (Table 5.2 presents data for awards made in 2005, the general picture of subject differences varying little over time). The data in Table 5.2 draw attention again to differences between subject areas as regards the distribution of marks or grades in higher education in the UK (see for example Yorke et al., 1996; Bridges et al., 1999; Yorke et al., 2005; and Chapter 2). The unitization of curricula, coupled with flexibility of choice on the part of the student, has brought to the fore the implications, for the honours degree classification, of combining subjects which may differ (amongst other things) in both marking tradition and type of assessment demand. For example, modules that produce wide ‘spreads’ of grades will exert greater ‘leverage’ over aggregated results than those whose grade spreads are narrow. Hence there are obvious problems regarding comparability of performance. The variation between subject areas in Table 5.2 is quite marked. However, if a calculation along the lines of the US GPA is undertaken, giving the five classification levels weights of 4, 3, 2, 1 and 111 respectively, the averaging process diminishes the disparities quite considerably because the effects of spread (but not skew) are smoothed out.12 Languages, Historical and Philosophical Studies, and Mathematical Sciences have the highest GPAs, at 2.79, 2.78 and 2.71 respectively, whereas the lowest GPAs are found in Business and Administrative Studies, Architecture, Building and Planning, and Computer Science, at 2.36, 2.39 and 2.41 respectively. The point of this digression is to illustrate that comparing subject areas on the basis of their mean grade-points may hide significant distributional effects and thereby subdue the comparisons. Data from modular schemes in post-1992 universities shows that, at the finer level of the module, there is variation between and within subject areas regarding mean percentage mark, spread of marks, and skew of marks (Yorke et al. 1996; 2000; this volume, Chapter 2). These kinds of variation are well known by those who attend assessment boards in which marks from a range of subjects are presented. They reflect normative disciplinary practices and also the type of demand within a subject area (for example, in the area of Business Studies, modules on quantitative methods tend to exhibit wider mark spreads than those that set more discursive assignments). 10 See for example the comparisons of grades for modules with high enrolments in Adelman (2004: 82). 11 Both ‘pass’ and ‘unclassified’ degrees are treated here as being positive outcomes, though the former is typically awarded as a ‘fall back’ for students who do not satisfy the criteria for third class honours. 12 The subject areas which normally award unclassified degrees are not included in the following analysis.

Table 5.2 Percentages of bachelor’s degrees gained in the UK, by broad subject area, summer 2005

Subject area Medicine and dentistry Subjects allied to medicine Biological sciences

Class of degree, percentage of total number of awards Number First Upper Lower Third and Unclassified of awards second second Pass 7,445 4.5 13.1 2.0 5.0 75.4 27,880

11.4

40.7

27.8

6.7

13.4

27,200

10.6

48.2

31.9

6.4

2.9

690

4.3

8.7

3.6

2.2

81.2

Agriculture and related 2,225 subjects* Physical sciences 12,530

10.8

40.9

33.3

7.0

7.9

17.4

41.5

29.0

8.7

3.4

5,270

26.0

33.6

25.6

11.5

3.3

Computer science

20,095

13.0

34.5

32.7

12.9

6.9

Engineering and technology Architecture, building and planning Social studies

19,575

17.3

36.8

28.4

9.3

8.2

6,565

8.4

39.6

34.9

8.1

9.1

28,825

8.8

49.2

32.0

6.1

4.0

Law

13,735

5.0

49.2

36.6

6.2

3.1

Business and administrative studies Mass communications and documentation Languages

42,190

6.9

39.3

37.1

10.4

6.3

8,890

7.2

51.3

33.6

4.3

3.5

20,025

12.8

57.7

24.7

3.2

1.6

Historical and philosophical studies Creative arts and design Education

15,480

12.0

58.7

24.7

3.3

1.3

30,610

11.6

47.6

31.7

6.5

2.5

10,615

7.7

42.7

37.2

6.8

5.6

6,510

2.2

12.4

8.9

3.5

73.0

Veterinary science

Mathematical sciences

Combined subjects

Source: the Higher Education Statistics Agency at www.hesa.ac.uk/holisdocs/pubinfo/student/ quals0405.htm (accessed 15 August 2006). Note: Medicine and dentistry, veterinary science and combined subjects are atypical in that their norm is to award unclassified degrees. *Total does not sum to 100 per cent because of rounding.

How real is grade inflation? 119 Criterion-referenced assessment approaches may be thought to be relatively immune from idiosyncrasy because the criteria are laid out for all to see. Wolf (1995) and Webster et al. (2000), however, have shown that criteria are understood in different ways by different people, allowing more subjectivity into the assessment process than had been anticipated. Studies by Hawe (2003) and Baume et al. (2004) have demonstrated that assessors can, at times, interpret the criteria in ways that favour the students – occasionally to the extent of providing an outcome that they felt was appropriate to the student despite overriding the outcome that the strict application of the assessment methodology would have required. An aspect of UK higher education that has been little studied is the way in which performances near to the division of degree classification boundaries are treated. If a student’s average mark is just below a boundary, the student may be awarded the higher classification automatically (by effectively ‘rounding up’ an average such as 69.6 per cent to 70.0), or by applying a secondary methodology based on the profile of the marks gained (Chapter 3). These actions are properly codified in institutional assessment regulations. Where the ‘rounding’ issue is less clear is when a student’s marks within a module of study fall into a similar borderline category. Student choice Where the grades awarded in respect of modules are made available to students, the student may factor the awarding record for the module of study into the choices that they make. The Education Policy Committee (2000) at UNC–CH and Johnson (2003) both make this point. Johnson evaluated a considerable amount of empirical evidence gathered from studies at Duke University and concluded that ‘The influence of grading policies on student course selection is substantial’ (Johnson, 2003: 193). Easing of standards Populist politicians and commentators find opportunities for pejorative comment on the devaluation of standards in higher education and the introduction of programmes seeking to address contemporary interests. Margaret Hodge, then Minister for Higher Education in the UK Government, reported by Woodward (2003), referred to ‘mickey mouse’ courses and Phillips (2003) to ‘the proliferation of absurd degrees like golf course studies’. For some, like Phillips, these ‘absurd degrees’ are associated with the policy of widening participation in higher education, and represent a new twist on Kingsley Amis’ (1960: 9) vehement assertion regarding the growth of numbers in higher education that ‘MORE WILL MEAN WORSE’. A decade or so ago, degree programmes in Media Studies were similarly pilloried but have subsequently proved not only to be popular with students and academically respectable, but also to be a good basis for gaining employment. The moral is that commentators need to look beneath programme titles to see what the demand on students actually is – and it should be remembered by

120 How real is grade inflation? commentators that roughly half of the advertisements for graduate-level jobs in the UK do not specify a particular subject of study (Purcell and Pitcher, 1996), on the grounds that the first degree develops a broad set of capabilities that can be usefully applied to a range of employment opportunities. From time to time, commentators claim that the knowledge and understanding of contemporary students are not as high as they used to be. Bloom (1988), for example, bewailed the decline in appreciation of the background of western civilization and culture. Against this perspective must be set the different kinds of expertise possessed by those who are, for example, fluent in the use of electronic media, or who have developed to a high level the practical skills associated with the ‘people industries’. However, the teaching approach adopted could have implications for grade inflation. The Education Policy Committee at UNC–CH pointed to the possibility of teaching as ‘spoon-feeding’. That may be a little strong, but when there is a tight alignment between expected learning outcomes, subject content and assessment there is always a risk that the student will undertake sufficient work to perform well on the assessments whilst paying relatively little attention to the ‘broadening’ that is generally expected from time spent in higher education. Avoidance of low grades Hawe (2003) found assessors reluctant to award ‘fail’ grades, because to do so had consequences for the assessor such as additional work, being subjected to rancour of the failed students, being blamed for poor teaching and being perceived as trouble-making. Goldenberg and Waddell (1990) noted that retaining ‘failing’ students was a stressor for around half of the academics in their study. Hawe reported that not failing students involved elements of ‘bending the rules’ to produce a result with which the assessor was comfortable13 (for instance, by ignoring regulations in order to give students additional time to fulfil the assessment requirements, or letting students off handing in assignments on the grounds that they were good students anyway). She also reported institutional managers as stressing the importance of following procedures carefully in order to avoid litigious consequences, and also the implications for institutional funding if a student failed. Some assessments were apparently eased because of the ethnic grouping from which the students came – a kind of affirmative action – and some passes were awarded as an encouragement, which muddles the formative and summative aspects of assessment. There is also the fact that some students see grades as reflecting their personal worth (Becker et al., 1968), which adds to the psychological stress of assessing performances as weak. It is perhaps not surprising that hard-pressed assessors take the easy way out and pass students, or avoid awarding low grades, because of the consequent possibility of having to justify (perhaps at length) failure or weak performances.14 They may, for similar reasons, 13 There are some similarities with the findings of Baume et al. (2004) relating to the assessment of portfolios. 14 Johnson (2003: 235) makes the point that a professor cannot assign grades of C or D without giving a justification (perhaps in response to a complaint) – which requires time and effort.

How real is grade inflation? 121 turn a blind eye towards possible cheating: Newstead (2002: 72) observes ‘As anyone who has tried to pursue an accusation of cheating through the system knows, it can be very nearly a full-time job’. In professional environments, such as those experienced by students on various forms of placement or internship, a mentor/assessor is (in theory) in a much better position to comment on the mentee’s actual performance than an occasional visiting assessor from an educational institution. However, the socialization implicit in the mentor/mentee relationship has considerable potential for biasing summative assessment (see, for example, Jones 2001; Watson et al. 2002). Hounsell et al. (1996: 69) note the risk that untrained assessors may, because they are delighted to have the assistant of someone on placement, over-rate students’ performances. Where a student succeeds on placement, the inherent role ambiguity may not be a problem. In an unpublished study of students on foundation degree programmes in England, Yorke found that 13 of 87 who had had a mentor said that the mentor combined mentorship with the role of assessor, but that in only one instance this had caused a problem. Giving students on placement the benefit of the doubt is a persistent theme in the literature.15 This is done for a variety of understandable and perhaps sometimes interlinked reasons: • • • • • • •

a desire to encourage students’ growth by awarding a pass; a nurturing rather than a judgemental academic climate, perhaps in part reflecting the point made by Becker et al. (1968) that students often perceive grades as indices of their personal worth; providing students with a second chance; affirmative action in respect of disadvantaged groups; a culture of strong support for colleagues when the assessor comes from the same workplace as the students;16 avoidance of the hassle associated with the failing of students; to reduce the possibility of litigation.

A number of these points are not confined to assessment in a placement context. Where there is more than one placement, giving the benefit of the doubt tends to arise at first placement, where the assessor has to decide between an early fail and leaving the matter to be resolved at a subsequent placement. In a caring profession, where support for development is strongly embedded in the pedagogic approach, it is particularly difficult for an assessor not to ‘give the student a second chance’. The consequence is, though, that the student can reach a similar point at the next placement with the weaknesses unresolved, where the decision to fail becomes even more difficult because the student has already completed a substantial part of the programme. 15 See for instance Lankshear (1990); Baird (1991); Ilott and Murphy (1997); Hawe (2003); Furness and Gilligan (2004). 16 Lang and Woolston (2005) note that ‘mateship’ is (understandably) a strong part of police culture, which makes negative assessment difficult.

122 How real is grade inflation? Brandon and Davies (1979) sought to investigate the assessment of marginally performing students in social work programmes, and explained how difficult it was to address such a sensitive issue.17 Although their eventual sample of 35 such students exhibited a range of weaknesses in their performance (the modal number of weaknesses being four), 30 passed the fieldwork component of their programme, and a further two were awarded delayed passes.18 The default position seems to have generally been that, if there was no evidence of actual incompetence, the student should pass:19 an issue left hanging was whether there was sufficient time in the placements being studied for the students to demonstrate actual incompetence. Failing students is stressful for assessors: Ilott and Murphy (1997) refer to the mixture of emotions that can arise, including anxiety, guilt and relief. Failing a student is especially stressful when it has to be done face-to-face. As Ilott and Murphy note, therapists ‘look for the positive’ (p. 311). The result from a written examination is more distanced, psychologically, and the failing candidate can blame a variety of extraneous factors for the outcome. In addition to the possibility of challenges of various sorts, failing a student results in a loss of income to the institution. Giving students a helping hand The perceived significance of the Vietnam war for grading was noted earlier in this chapter. Some grade inflation in the 1960s was quite plausibly the consequence of acts of kindness to students in that they could avoid the draft if they obtained good grades. Giving good grades was also a way in which academics could quietly express their opposition to the war and also maintain their principles. McSpirit and Jones (1999) found that grade inflation at an open-access university in the US was greater for students with low entry qualifications (ACT scores; they were aware of the potential, noted also by Adelman, forthcoming, for bias in using measures of this sort as baseline data), and suggested that faculty were using grading to encourage a positive attitude to learning and/or students’ life-chances through the possession of the degree as an entry-ticket to employment. Teachers often evidence a concern for students’ life-chances, and with the recent slowdown in the economies of the western nations a concern not to disadvantage students could be manifested in generosity of grading. For example, Elias (2003) reported that the then economic crisis in Germany was generating pressure on professors to award inflated grades (Kuschelnoten, or ‘cuddle grades’) because 17 Their paper is essential reading for anyone seeking to study the assessment of marginal performance, irrespective of the subject area. 18 The assessments of the academic component of their programmes produced a higher proportion of failures. 19 Lankshear (1990), writing about nursing, suggests assessors do not fail incompetent learners unless there is very clear evidence of unsafe practice. However, Goldenberg and Waddell (1990), who surveyed a convenience sample of 70 nurse educators on baccalaureate programmes, found that more than half of their respondents failed students whom they deemed to be unsafe in practice.

How real is grade inflation? 123 they knew that students needed good grades if they were to compete successfully in the labour market. In a number of subject areas (Biology, Physics, Psychology, Mathematics, Philosophy and Chemistry) the mean grade was ‘very good’ on a scale ranging from 1 (excellent) to 5. However, in Law and Medicine, subjects in which students took written examinations, the grades awarded were lower, presumably because there was less scope in the assessment process for awarding Kuschelnoten. In the US enrolment of students from minority groups increased markedly around the year 1980. Grades seem to have held relatively steady at around that time (Adelman, 2004). Rojstaczer says that this undercuts that the argument that grade inflation arose from the operation of policies for affirmative action. Although this might be so, the possibility of inflation deriving from steady overall grades from a more disadvantaged enrolment cannot be discounted on this evidence. Student opinion It is in the US that students’ evaluation of academics’ teaching performance has particularly strong implications for tenure and promotion. Adjunct teachers may not gain tenure if their students give them poor evaluations. Johnson (2003) drew attention to the symbiotic ‘grade leniency theory’ in which students reward instructors with good evaluations in exchange for good grades (though not in such a crude way as might be inferred from the bald statement), and the interests of both parties are thereby served. Other mechanisms might also be in play. According to attribution theory, students attribute success to their own efforts, but poor performance to others such as their teachers; this could explain a positive correlation between students’ grades and ratings given to teachers. However, an appeal to attribution theory probably has to be more subtle than this. Johnson (2003) suggests variants in which student characteristics can play a part: • •

Students with high prior interest or motivation might devote more time to their studies and be more appreciative of their teachers’ efforts. Teachers who deal with difficult content would be more likely to be rated highly by the stronger students.

Another theory discussed by Johnson (2003) is that students learn more from effective teachers and obtain higher grades, and that the consequential positive correlation between grades and evaluations is desirable. However, his discussion of the evidence indicates either that there is such potential for mediation, or that students in the quoted studies were subjected to the same teaching, that the ‘effective teacher theory’ is difficult to substantiate. After reviewing a series of experimental studies20 and acknowledging the difficulty in inferring causality from correlational studies, Johnson (2003: 82) concludes that teachers’ grading behaviour is causally related to student ratings 20 Questions might be raised about ethics in some of these.

124 How real is grade inflation? of their teaching. A report by the Education Policy Committee (2000) at the University of North Carolina – Chapel Hill pointed to a similar relationship following an analysis of extant institutional data. However, Marsh (1983) had earlier concluded, from a path analytic study using his SEEQ (Student Evaluation of Educational Quality) instrument, that an observed correlation between grades and evaluations of teacher-courses was not a consequence of grading bias. In other national systems, although good teaching is incorporated into institutional policies and procedures for recognition and reward, student opinions tend to be used in a more indirect way: in some institutions, for example, teachers can opt to supply evidence from feedback surveys as part of a case for promotion. Politics and economics Higher education has been marketized to a greater extent in the US than it has in Australia or the UK, though the trend in the last two has been towards students, rather than the state, funding their studies (save, as happens in the US, where the state is intervening to offset disadvantage). The marketization is reflected in the production of rankings of institutions which purport to offer guidance to prospective students: rankings have been produced for some two decades by US News and World Report, and have spread to a range of countries and even international comparisons, as a report by Usher and Savino (2006) demonstrates. It is highly debatable whether such rankings or ‘league tables’ are actually of much value to student choice (Yorke and Longden, 2005) even though it has been claimed that students in the UK make some use of them (Lipsett, 2006).21 One can imagine ‘league tables’ exerting some inflationary pressure on grades if they include measures such as the proportion of ‘good’ honours degrees (as do the tables in The Times and The Sunday Times in the UK), since these are variables under the direct control of institutions. The following example, whose origins are obscure, illustrates one institution’s concern to indicate that its students’ performances were of the highest standard, though it is inflected with a mystique that is all but impenetrable to outsiders. The chairman of examiners wrote to markers that ‘this year beta/alpha, alpha/beta and alpha/alpha/beta are functioning, respectively, almost exactly as beta/beta/alpha, beta/alpha and alpha/beta did last year.’ Markers were asked to adapt to the new expressions in the interests of making ‘Firsts look more like Firsts to outsiders’, and not to stint their ‘leading alphas’. Understandably, institutions look at statistics of the awards made at other institutions and take a view on whether theirs are roughly on a par with those that they see as peers. If not, they may seek to come into closer alignment (there is some 21 However, Lipsett’s comments appear to relate to a presentation by Kate Purcell (available at www. hecsu.ac.uk/cms/ShowPage/Home_page/Conferences_events/Changing_Student_Choices/ Keynote_speakers__sessions/p!eXepmck, accessed 24 November 2006) in which, although 30 per cent of students claim to pay attention to league tables, this is a much lower percentage than those for actually visiting the institution (approximately 55 per cent), institutional reputation and the availability of the desired course at the institution (both just under 50 per cent).

How real is grade inflation? 125 anecdotal evidence to that effect – see earlier in this chapter). By and large, when the ‘parity argument’ is invoked, it tends towards the raising of grades, rather than their lowering. Rojstaczer, on the www.gradeinflation.com website, suggests that the resurgence of grade inflation in the 1980s can be attributed mainly to the emergence of a consumer-based culture in higher education. As students’ costs of studying have risen,22 the notion of ‘purchasing a product’ has strengthened, with the consequence that expectations of value for money (here seen in terms of the grades awarded) have been raised. Further consequences are increased pressure on professors to be more generous in grading and to make curricula less demanding. Rojstaczer bases his conjectures on personal experience and anecdotal evidence, and acknowledges that it would be difficult, if not impossible, to find evidence that would confirm them. Drawing on a number of commentators, Manhire (2005) observes that the provenance of grade inflation has shifted from the political left (a consequence of the Vietnam war, the rise of postmodernist critiques, and so on) to the political right (the influence of consumerism, free markets and branding). He takes the view that elite institutions are relatively little concerned by bachelor’s-level study, with this being seen as a mere stepping stone to graduate school. He questions the ultimate value of inflated grades to professional engineering practice (his particular subject area) and to society in general: to what extent do inflated grades undermine the ‘candid evaluation of academic and professional performance’ that integrity demands (Manhire, 2005: 5)? A rather different suggestion was made by the Education Policy Committee (2000) at UNC–CH. The Committee suggested that internal institutional politics, in the form of attempts to maximize enrolments and resources, could be a contributory factor in the rise of grades at the University. However, it was unable to provide evidence to support the suggestion, which therefore must be treated as speculative rather than substantive. Relativism Johnson (2003) suggests that elite institutions might benchmark themselves against an assumed ‘average university’ and argue that, because of their elite status and enrolment profile, their students ought to obtain better grades. This represents an asymmetric twist on the ‘parity argument’ noted earlier.

22 Table 5a of Trends in College Pricing, 2004 (College Board, 2004) shows strong rises in tuition and fees for four-year private and public institutions: in raw dollars costs in private institutions rose by roughly 10 per cent year on year during the 1980s, but the rate of increase dropped back to about half this figure thereafter, whereas public two-year and four-year institutions had bursts of cost inflation above 10 per cent in the early 1980s, early 1990s and in the two most recent years (2003–04 and 2004–05). In dollars adjusted for the consumer price index (to a baseline of 2004), the inflation is high for all types of institution during the 1980s, with high increases also for the public institutions in the early 1990s and in the first half of the current decade.

126 How real is grade inflation?

Non-inflationary increases in grades It is important to distinguish between grade inflation and grade increase. There have been hints earlier in this chapter that grades may increase for reasons that do not justify the label of inflation. Some increases in grade may be labelled inflationary or non-inflationary, depending on the value system that is being brought to bear. Possible contributors to what some would see as a non-inflationary increase in grades include: • • • • •

curriculum design; improved teaching; improved motivation and/or learning; changes in participation profile; students behaving ‘strategically’.

Some factors may be attributable to particular institutions, and are given no more than a passing mention here. Relatively elite institutions may be able to recruit students whose preparation at high school for higher education has improved over time; they may be able to be more selective in admission and enrolment, and they might want to argue that any rise in grades is testimony to an increasing richness of the intellectual environment that they provide. Clearly, these factors vary in the extent to which they are amenable to empirical investigation. Curriculum design Where the curriculum is based on learning objectives, and the pedagogy and assessment are aligned (Biggs, 2003), students have a clear indication of what is expected of them and can consequently focus their efforts. This avoids students, when being summatively assessed, trying to guess what the assessor really wants from them, and is therefore fair to them. A very likely effect is that grade distributions will be shifted upwards, and approach the kinds of skewed profile that is associated with mastery learning, and there is a hint of ‘teaching to the test’ in a tight alignment of pedagogy and assessment. It is surprising that Rosovsky and Hartley (2002) ignored the implications of shifts in the direction of a mastery learning pedagogy in their critical analysis of grading in higher education, particularly since the possible effect of these had been noted in vocational programmes by Goldman (1985), Sabot and Wakeman-Linn (1991) and Lanning and Perkins (1995).23 The possible deleterious effect on student learning of having a narrow focus on outcomes was noted earlier (p. 120). The assessment regime may also give rise to higher grades, as was noted earlier for a shift towards coursework at the expense of traditional examination papers. Most students will produce better work when not under the pressure of the exami 23 These sources also cite other possible influences on grades.

How real is grade inflation? 127 nation room, and so there should be little surprise at the generally higher level of performance which may be a better index of achievement than is generated by traditional examinations, and one that may be more relevant for activities outside the cloistered environment of the academy. However, a large shadow looms over the use of coursework, in that it is more vulnerable to plagiarism and other forms of cheating than is a traditional examination, which is why it has been proposed in the UK that coursework at high school level should be undertaken on the school premises rather than at home. A partial – and, for academics, practically demanding – response to the ‘plagiarism problem’ is to design coursework tasks that cannot be completed by recourse to the World Wide Web’s almost limitless bank of material that can be ‘cut and pasted’ or purchased. The ‘honor codes’ in operation at some US institutions may not always be sufficient to counter the lure of the Web. Improved teaching If there is a focus on improving teaching, then better teaching can be expected to lead to higher grades, albeit with a mediated causality (the connection is difficult to prove; see, for example, discussion of Johnson’s, 2003, effective teacher theory, above). The UK has seen a policy emphasis on teaching and learning during recent years, with institutions being expected by their main funding bodies to produce learning and teaching strategies, and to give greater recognition and reward to teaching. There has been a recent expansion in the provision of programmes for academics new to teaching, via postgraduate certificates in higher education, which focus on a range of aspects of curriculum design, learning and teaching, and assessment. These have been given a ‘mixed press’ by programme participants, some of whom question whether their first year as a teacher is the right time to be engaging in them since they have a lot of other new challenges to face. More significantly, perhaps, the research assessment exercise (RAE), run roughly quinquennially by the funding councils, draws academics’ attention towards ensuring that they have enough research output on their record to ‘count’ in the RAE; hence there is a tension between two policy objectives in which learning and teaching tend to come off second best. Improved motivation and/or learning If students are more motivated towards study than they were (for extrinsic or intrinsic reasons), then some rise in grade levels would be expected. If the curricular arrangements offer greater clarity regarding what is expected of them (as suggested above), then a multiplier operates. As higher education in the UK has been transformed from an elite to a mass system, so the value of a first degree in the labour market has changed. Half a century ago, when around 1 in 20 18-year-old students went on to university, a degree automatically put the holder in line for a ‘good job’ and the rewards that went with it. Nowadays there is an element of ‘defensiveness’ about the gaining

128 How real is grade inflation? of a degree – to get one ‘keeps the holder in the game’ of getting an appropriate job but no longer virtually guarantees it. Not to have a degree tends to eliminate the person from the graduate-level job market. In the UK, to have emerged from higher education without a degree is disadvantageous in economic terms (Blundell et al., 1997), representing a ‘deficit’ approach to achievement, in which partial achievement is regarded more as a failure than as a success. According to Weko (2004) the situation in the US is rather different, and reflects both the strength of the credit system and also the more relaxed way in which programme completion is regarded. The credit system is understood to the extent that employers and others see the gaining of credit towards a degree as a positive matter, and this is interpreted against a backdrop in which many students (especially the less advantaged) may need to build up their credits relatively slowly as they undertake employment in order to fund their studies. As Weko puts it: [Stakeholders] tend to think about a degree as something that consist[s] of discrete skills and capabilities, and they believe that there is some benefit to acquiring part of a degree. In the US view, completing a degree is better than not, but something is better than nothing. (Weko, 2004: 57, emphases in the original) In the UK, with the strong attachment to the (typically three-year) full-time degree and a generally less developed approach to credit, credits towards a degree are given a lesser weight than they merit. Reformatting Weko’s words, in the UK it is often the case that all or nothing is better than something. Macfarlane (1992, 1993) offered the view that in the UK the conditions had changed with the rise of ‘Thatcherism’, with students choosing to work harder to gain the best result they could, in preference to engaging in other kinds of activity (such as radical politics). In his words, the emphasis was on ‘militant self-interest rather than political causes’ (1993: 2–3). Following a pejorative reference to ‘grade creep’, he attributes the rise in upper second class honours degrees to the ‘persistent grafter’ who was not sufficiently imaginative to merit a first (1993: 4). Changes in participation profile In both the US and the UK some fifty years ago, males outnumbered females in higher education by roughly three to two. In contemporary higher education in these countries, the ratio is more or less reversed. With females on average tending to gain higher grades than males, this demographic shift is likely to influence general patterns of grading. (This may not hold at the level of individual students, where there is some evidence to suggest that males tend to be greater risk-takers in their assessments, leading them to be represented to a greater extent than females in the extremes of grading scales.) At a different level, some students may opt to take a lighter course load per semester with the consequences that they can devote greater attention to the par-

How real is grade inflation? 129 ticular courses that they are taking (and hence achieve higher standards), and that the time they take to complete the whole programme will be extended. The system in the US is amenable to choices of this sort, but that in the UK (historically having a different approach to the funding of studies) is steadily evolving towards the greater flexibility of the US. ‘Strategic’ students Students can be ‘strategic’ in various ways, though the word ‘strategic’ may be the wrong adjective for behaviour when it is contingent and relatively short-term in character. In many curricula students are given the opportunity to exercise choice. The exercise of choice can be formal, as in the case of choosing modules for study. Students may choose an easier course in order to enable them to concentrate on courses on which it is less easy to obtain high grades. As Rothblatt (1991: 136) put it: some of the modules contributing to the ‘A’ record may have been breathers, inserted into the game to allow even star players a chance to gain a second wind. Johnson (2003, Chapter 6) presents empirical evidence that students tend to opt for modules that are likely to give them a better grade profile. He concludes his analysis as follows: when choosing between two courses within the same academic field, students are about twice as likely to select a course with an A– mean course grade as they are to select a course with a B mean course grade, or to select a course with a B+ mean course grade over a course with a B– mean course grade. This fact forces instructors who wish to attract students to grade competitively, but not in the traditional sense of competitive grading. [. . .] At the institutional level, differences in grading policies among academic divisions result in substantial decreases in natural science and mathematics enrolments, and artificially high enrolments in humanities courses. (Johnson, 2003: 193–194) Johnson (ibid.: 239) goes on to observe that ‘Knowing the grading practices of the instructors from whom students took courses is as important as knowing the grades they got.’ The exercise of choice by students may be informal – for example, as in respect of their approach to study, which may be superficial, deep or ‘strategic’ (also termed ‘achieving’). A so-called ‘strategic’ approach to learning involves

130 How real is grade inflation? the student in using deep and surface approaches to learning as they deem appropriate. Hence a deep approach might be adopted, say, to parts of the study programme seen as vital to future employment, whereas a surface approach might be used for a more peripheral component.24 The student with a strategic or achieving approach seeks an optimal outcome in which the ratio of achievement to effort (if ever it could be computed) is high. One might characterize being strategic in this sense as ‘satisficing’ (Simon, 1957) or ‘playing the game’ to best personal advantage, using ‘cue-seeking’ (Miller and Parlett, 1974), taking short-cuts to the completion of assignments,25 and so on. Such behaviour, of course, may be not be optimal (or ‘strategic’ in a broader sense) for the longer term, but may produce ‘good enough’ outcomes for immediate purposes – perhaps as a consequence of the instrumentalism implicit in political linkages between higher education and obtaining an appropriate graduate-level job, and in curricula in which short-term achievements (in separate study units) are implicitly encouraged.

Grades do not inevitably rise Adelman’s (2004) analyses of transcripted grades showed that, even if grades have been increasing in some segments of higher education in the US, this is not true across the full spectrum of institutions. His work invites further study of the relationship between grading and achievement. In the UK, students’ ‘strategic’ behaviour does not always work to increase grades: in some circumstances grades can be adversely affected. Kneale (1997: 123–4), in a pilot study in selected subject areas, collected staff comments about students being ‘strategic’ about their commitment to academic study and other activities. Various aspects of behaviour – those directly relevant to grading – include • • • • •

late submission despite penalties; attendance at modules up to the point at which topics chosen for assessment are covered; non-attendance at exams when continuous assessment had already ensured a pass; non-submission of assignments when the pass mark had already been attained; ignoring modules in which failure would not damage the final degree result.

Such students are ‘satisficing’ in that they are opting for ‘good enough’ outcomes from the assessment, balancing them against other desired outcomes such as the earning of money in order to fund their studies and/or preferred lifestyle (see, for example, McInnis, 2001). An analysis of student attainment (end of term marks and degree outcomes both being used as measures) in seven varied uni 24 In my own case, learning the elements of ‘double-entry book-keeping’ which formed a (supposedly broadening) component of my degree studies in Metallurgy. 25 With plagiarism being one possible consequence.

How real is grade inflation? 131 versities in the UK showed that attainment was negatively related to the amount of part-time work undertaken by the student (Brennan et al., 2005). The analysis conducted by Brennan et al. gave substance to hints that had earlier appeared in a single-institution study by Barke et al. (2000).

Dealing with grade inflation The affixing of + and – to grades awarded in US institutions improved the reliability of grading in a study of data from the University of California Riverside (Singleton and Smith, 1978). This supported the proposition that reliability would be improved if the scale were longer, and that even if inflation had in effect reduced the scale to A–C, the affixes would provide sufficient scale length to enhance reliability. Where comparisons have been made between grading scales, the results have not shown a consistent pattern.26 • •

•

A study at the University of North Carolina Asheville showed a negligible change in GPA overall for the academic years 2002–03 and 2003–04 (Academic Policies Committee, 2005). A study of undergraduate grades over five semesters at North Carolina State University showed that, after the fall of 1994, +/– grading tended to lower the overall GPA, and that the effect was fairly consistent across subject areas (Table 5.3). Since the number of students eligible for +/– grading grew from Fall 1994, the pattern of findings from Spring 1995 would seem to be a ‘steady state’ (though the ratio of ‘losers’ to ‘gainers’ in Fall 1994 is much the same as for the succeeding semesters). However, an analysis of grades over four semesters at Clemson University showed that, as a consequence of +/– grading, just under 20 per cent of grades had risen, around 25 per cent had stayed the same and just over 55 per cent fell (Gibson and Senter, n.d.).

Table 5.3 The effect of +/– grading on undergraduate grades at North Carolina State University Semester Fall 1994 Spring 1995 Fall 1995 Spring 1996 Fall 1996

Lower with +/– (%) 10.6 32.7 33.2 34.4 34.1

No change (%) 81.7 44.5 44.8 43.0 43.1

Higher with +/– (%) 7.7 22.9 22.0 22.6 22.8

Source: Gosselin (1997). 26 Methodologically, this presents problems. The normal procedure seems to be to take grades on the A–F scale with + and – affixes, and to reduce the grades to a simple five-letter scale. It is open to question whether, if assessments had been made to a simple A–F scale in the first place, a similar outcome would have been obtained.

132 How real is grade inflation? If anything substantial can be constructed from these straws in the wind, it may be that +/– grading tends on balance to lower grades but that some students will gain by its implementation. Another approach to mitigating grade inflation is to constrain the distribution of grades. Johnson (2003) reports that, when this was tried at Duke University, it was only partially successful. The median was constrained for courses with 30 or more students. Johnson noted that clever academics worked out how to play the system by assigning just over half of the students’ grades to the median or just below, with the remaining grades being much higher. With smaller numbers in classes, the method had insufficient statistical robustness and could not be applied. The alert (and, by inference, statistically aware) student could opt for small or independent study classes where grades would not be constrained according to a mathematical formula. At the level of the bachelor’s degree, Harvard and Princeton Universities have made moves to reclaim the value of the A grade in the light of a steady increase in the proportion of such grades being awarded. Harvard made some reduction in the number of A grades awarded (Marcus, 2003). As reported in The New York Times, the intention at Princeton is that interested parties should become aware that graduating with an A from Princeton is a performance of real worth. Nancy Weiss Malkiel, Dean at Princeton, was quoted as claiming that admissions deans at graduate and professional schools, employers, and administrators of national fellowship competitions said that: if we made plain what we were doing they would understand Princeton grades in the context of our new grading policy. They would know that a Princeton A was a real A in contrast to more inflated A’s at some of our peer institutions. (Hedges, 2004: B2) Although there is undoubtedly a concern to make grading at Princeton more meaningful, there is also a presentational aspect to the shift in approach.

Back to values This chapter has shown that the issue of grade inflation is not simple. Grades can rise for a host of reasons. Some might be labelled ‘inflation’, whereas others reflect curricular developments whose by-products include an upward drift in grades. Further, the evidence in this chapter – and that of Chapter 4 – points to a variation between institutional types as far as trends in grades are concerned. Indeed, the trend seems to be essentially flat in many institutions that do not tend to attract national headlines. One aspect of the evidence – that relating to the effects of ‘mastery learning’ and its congeners – is worthy of further comment, since it presses debate back to the values undergirding higher education. Goldman’s (1985) article makes clear the tension between a normative approach to grading, in which students are ranked

How real is grade inflation? 133 against each other, and a mastery approach after which, in theory, it can be said, as the Dodo did at the end of the caucus race in Alice’s Adventures in Wonderland, that ‘Everybody has won, and all must have prizes’ (Carroll, 1999: 46, original emphasis). One’s approach to allegations of grade inflation will be substantially coloured by the values to which one subscribes.

Chapter 6

The cumulation of grades

Overview Tom Angelo, in a foreword to Effective grading: a tool for learning and assessment (Walvoord and Anderson, 1998), criticizes grade-point averages in the following terms: grade point averages – despite the patina of objectivity that quantification lends them – too often represent a meaningless averaging of unclear assumptions and unstated standards. (Walvoord and Anderson, 1998: xi) Thyne (1974) had earlier made a related point about meaning: the fact that in examinations measures of all kinds are commonly labelled ‘marks’ does not give us licence to add them without regard to what they mean. (Thyne, 1974: 157) This chapter, drawing on the discussion of grading in Chapter 2, shows that honours degree classifications and grade-point averages are vulnerable to challenge on various technical grounds, not least their reliance on mathematical manipulations that treat the raw data as having qualities that they manifestly do not possess. It also demonstrates that the distinction in the UK between upper and lower second class honours degrees (which has important implications for students) is made where the statistical distribution of student results is most vulnerable to fuzziness in marking.

A desire for accuracy Following the publication of the report of the MRSA Scoping Group (UUK and SCoP, 2004), some students were asked by the RISE Supplement of The Guardian newspaper in the UK to respond to the question: ‘Do you think that the degree

The cumulation of grades 135 classification system needs updating?’ The issue of the Supplement that was published on 18 December 2004 recorded amongst others the following responses (p. 4): the system we have now goes back centuries and seem to mean very little to employers who use their own tests to discover true academic level. The classifications are too general, there’s no way of showing if you were close to the grade above. A straight percentage mark would be fairer. Maria, Kent It needs a complete overhaul. I got a 2.2 [lower second class honours degree] and am finding that means few employment opportunities. My average marks were 59%, to get a 2.1 [upper second class honours degree] you needed 60%. Other courses and other institutions award those with over 58% a 2.1. It would be fairer to look at actual marks. Abi, London Some employers have expressed a similar view regarding the provision of actual marks. In the light of the discussion in Chapter 2, the desire for raw percentages to be recorded is naïve, whether a transcript containing marks for study units or some averaged mark is envisaged.

Mark distributions and awards Figure 6.1 is based on outcomes from a post-1992 university in the UK, and shows that the mean percentages gained by students going forward to the examining board charged with determining honours degree classifications form are roughly normally distributed but with a longer ‘tail’ towards the upper end of the mark range. The figure implicitly illustrates the low chance of obtaining a mean in the low 40s. This is because such a mean would probably have to incorporate some contributory marks below 40 per cent and hence the student would not have gained the number of credit points needed to qualify for honours unless some form of compensation were invoked. In many UK institutions, however, there is available to assessment boards a second approach to classification that is triggered where a student’s mean score is just below a classification boundary, though the magnitude of the gap varies between institutions (Chapter 3). This ‘second bite at the cherry’ approach typically involves looking at the profile of the grades gained in the different modules and using that as the basis of judgement. An assessment board might judge that a student whose better performances were obtained in the second year of full-time study (rather than the final year) had not demonstrated sufficient ‘exit velocity’ to merit the award of honours. The data underlying Figure 6.1 are derived from percentages awarded in respect of the equivalent of 24 modules each attracting ten credit points. If the grading

Frequency

136 The cumulation of grades

Figure 6.1 The distribution of mean percentages for candidates’ honours degree classifications. Source: data from one year in one university for successful final year students (N = 832).

per module is accurate only to a few percentage points either way (perhaps a bolder assumption than can really be justified), then a fair number of students may have been misclassified on the basis of their mean percentages (disregarding any subsequent considerations in the assessment board). Statisticians might wish to argue that the magnitude of the measurement error is in inverse relationship to the number of independent measurements that are made, but this is based, inter alia, on the assumption that the measurements are all related to the same aspects of performance. The variation between modules in respect of their expected learning outcomes means that the measurements may exhibit qualitative differences sufficient to undercut any such argument. A student who performs well in the writing of essays may not perform well where the focus of attention is on some form of practical activity or vice versa.

The combination of grades The reliability of the mean Like others, Simonite (2000) does not question the need for a single index of student attainment. She makes the point that the mean grade for a set of student performances is a more reliable measure than the grades given in respect of individual modules. Cresswell (1988) notes that the reliability of an overall grade depends on the number of individual grades that are being combined.1 Statistically, 1 Simonite (2000: 198) made a drafting slip when she wrote: ‘the reliability of an aggregate measure is inversely related to the number of items combined’.

The cumulation of grades 137 the mean of a set of measures is more reliable than any individual measure – but this assumes that the measuring instrument and its use are consistent and that the phenomenon being measured suffers from no more than random variation (for example, the dimensions of a widget). None of this applies in respect of grading, though the variability in the measuring process is less in the case of a subject being assessed by the same general methodology (say, through the writing of essays) than it is in a programme such as nursing in which there are discursive, quantitative and practical professional components. The point can be demonstrated through an analysis of real data from a post-1992 university by looking at the effect of the number of observations (here, module percentages) on the standard error of the mean score. Each student completed 24 modules, and the mean of the standard error of the 832 students’ mean percentage scores was 1.39 (range 0.39 to 2.65). When only 12 module results with a similar scatter and mean were taken into account, the mean of their standard errors of the mean rose to 2.00 (range 0.55 to 3.99). The greater the number of measures, the narrower becomes the band in which there is a probability of 95 per cent of finding the ‘true’ percentage.2 Cresswell (1988), implicitly acknowledging the unreliability of grading, notes that combining grades without some form of compensation for fail grades leads to asymmetric movement across the pass/fail boundary. Without compensation, an unjust fail grade would lead to failure of the programme, whereas an unjust pass obviously has no adverse consequences for the student. The issue of compensation is of course particularly important where the consequence of wrongly passing a student could be disastrous, as is the case wherever public safety is at stake. If compensation is permitted, then according to Cresswell it is necessary to have at least as many grades available for reporting on component performances as for overall performance (and preferably one or two more for components) in order to limit the chances of misgrading to one grade. Unlabelled grades? Foreshadowing the empirical observations of Wolf (1995) and others, Cresswell (1988) notes the imprecision of descriptors and criteria. He suggests that there is no need to describe every grade since, if one can define grades A and C, then grade B will necessarily lie somewhere in between A and C (in other words, B would be left fuzzy). Hence the number of grades used could be greater than the number of actual grade descriptors. An example of this can be found in Reznick et al. (1997), where only the first, third and fifth points on five-point rating scales are anchored by descriptors. This calls to mind a suggestion made by Nuttall (1982), who conjectured that Miller’s (1956) ‘magic number, seven plus or minus two’ might be applied to educational assessments – giving rise to, say, seven grade levels but requiring only four descriptors. Cresswell (1988: 371–372) notes the difficulty 2 This is only a demonstration of the value of having as large a number of measures as is practicable. It takes no account of a host of other variables which could influence the raw data.

138 The cumulation of grades experienced by the Scottish Education Department (SED) in identifying descriptors for more than three or four levels in courses at Standard Grade, and that the SED eventually decided on three.3 What should be counted? Simonite (2000) asks whether an overall grade (in her article, she refers to the honours degree classification) should summarize all of a student’s work, or simply their best work. This parallels the debate regarding what should be included in a portfolio (see, for contrasting views, Stecher, 1998 and Simon and ForgetteGiroux, 2000). The similar distinction made by some between ‘competence’ and ‘performance’ was noted in Chapter 1. The underlying point is that the person’s performance on test is not a perfect predictor of how they perform in real life. A case can be made for taking either course of action in assessing a student’s performance, depending on whether the focus of interest is on the overall performance or on the student’s performance on those areas of particular success. It is, however, a mistake to attempt to combine both types of performance in a single index. A system for unifying marks Morrison et al. (1997), basing their article on a study of GCSE (General Certificate of Secondary Education) grading (Thomson, 1992), argued that information was lost when grade combination was undertaken with a small number of grades, in comparison to the aggregation of marks such as percentages. They work through the aggregation of students’ modular performances on an indicative and hypothetical six-module course, demonstrating that, when the distribution of a student’s grades or marks is extreme, the overall outcome can vary considerably. (In practice, it seems rare for students to have performance profiles as dispersed as 90, 90, 98, 46, 53 and 40 per cent, the figures used to illustrate the argument.) Morrison et al. explore the use of the ‘Unified Marks System’ in which each raw mark is located at a point with reference to the threshold of a defined band. For example, if the threshold of first class performance is taken as 72 per cent and the student achieves 90 per cent, the student’s mark is treated as first class plus 18/28 of the distance between the threshold level for first class and the top of the class band. The score of 90 per cent is 18 percentage points above the threshold level of 72 per cent; the top of the first class band is 100 percentage points, i.e. 28 percentage points above the threshold. It is open to assessors to determine different scores as threshold levels in different course components, reflecting the relative difficulty of the assessment demand. The adoption of a particular set of grades and boundaries, such as in the European Credit Transfer and Accumulation System, allows marks or grades awarded in one system to be translated, on the basis of what might be termed ‘threshold-plus scores’, into a common reference scale. 3 Cresswell cites SED (1986) as the source.

The cumulation of grades 139 The trouble with what seems at first sight to be a plausible and useful methodology is that it makes implicit assumptions about the validity of the original marks or grades – i.e. that they are accurate signals of student performances. In fact, marks or grades are dependent on the way in which the assessment methods sample the curricular objectives, and ‘threshold-plus scores’ depend on the way in which the threshold levels are determined. In terms of a distinction well understood by postmodernists, the methodology discussed by Morrison et al. operates on signifiers (marks or grades), making the assumption of a close relationship between signifier and what is being signified (student performance). A further difficulty is that the methodology favoured by Morrison et al. assumes that all aspects of the curricular expectations can be assessed validly, reliably and with reasonable economy through the award of marks or grades. Knight (2002) and Knight and Yorke (2003) suggest strongly that this is not the case. External examiners surveyed by Warren Piper (1994) were reported as finding aggregation to be ‘a complex issue presenting some confusing problems’ (ibid.: 167). Issues with which they had engaged – by inference, not entirely satisfactorily – included • • • •

the exercise of student choice in curricula, with the consequence that comparability was difficult; programmes with joint and major/minor combinations of subjects, where a student might perform better in one component than in another; the combination of outcomes from different kinds of assessment (e.g. examination and coursework marks); students’ performance profiles – steady as opposed to fluctuating, and trending up or down over time.

There is a generally unexamined issue, noted earlier, regarding whether the idea in assessment is to signal something of the student’s best performance or some kind of average. This is an issue that rears its head particularly high in the development of professional practice – should the record be that of the best performance of which the person is capable (perhaps produced under atypical conditions) or that of the more typical day-to-day performance? Parlour (1996) pulls no punches when he opens his critique of degree classification in the UK by asserting: the degree classification procedures followed by most British institutions of higher education fail to conform to the basic principles of comparative justice, waste valuable resources and serve no useful purpose. (Parlour, 1996: 25) Parlour says, in effect, that student performance is multidimensional and hence too complex to be reduced to a single linear scale. Following HEQC (1994), he acknowledges the problems posed in respect of classification by combined, joint and particularly modular programmes (and, by implication, those posed in respect

140 The cumulation of grades of grade-point averages as well). Although his critique is forceful, it is undermined by an implicit assumption of unidimensionality which breaks through at various points, including his treatment of decision rules by which letter grades can be combined into an overall grade. An artificial example used by Parlour is that, if a student’s grade profile is A A A A B, the resulting overall grade would be ‘first class’. Whilst that example might occasion little dissent, difficulties are more apparent when the grade profile is markedly heterogeneous, such as A B C C D, since it becomes difficult to construct ‘mapping rules’ to cater adequately for every possibility. This issue is discussed further in Chapter 8. Parlour sees examination boards as an inefficient use of resources, criticizing the amount of attention given to borderline performance profiles and what he sees as the tendency for subjectivity in allowing examiners to change or ignore marks as they think fit, seeing in such practice an unfairness to those who are not given such consideration. His preferred solution, to automate the procedures for computing the arithmetic mean and the honours degree classification, would – unless it were very sophisticated – be unable to accommodate the variation that exists between mark distributions for different kinds of activity and/or in different subject areas. It is a matter for conjecture whether Warren Piper’s finding from his study of external examining, that the majority of interviewees showed little awareness that the nature of untreated examination marks renders them unsuitable for treatment by simple arithmetic (Warren Piper, 1994: 169) would have altered his view. A further weakness in his argument is his failure to deal adequately with issues such as the claims that students make regarding ‘personal mitigating circumstances’ which might have led to a level of performance below that of which they were truly capable.

The challenge of multiple assessment methods The combination of assessments, at both module and programme level, becomes particularly challenging when the assessment methods vary. The Higher Education Quality Council in the UK, drawing together findings from institutional audits, noted as an issue (and implicitly as a problem) variation of assessment methods between modules sometimes within the same programme and the same department, and the extent to which institutions understood the effect of such variation in the balance of assessment methods on overall student performance. (HEQC, 1996b: 74) At the module level, one could be faced with determining an overall grade

The cumulation of grades 141 from – in a science-based module, say – a short review of an innovation, a test involving numerical calculations, and laboratory reports. Assessments of a similar heterogeneity can be envisaged for any subject area. At the programme level, there may be a need to combine grades from modules that are primarily discursive, primarily quantitative and oriented to professional practice. Producing a single index to reflect the individual’s profile of achievements is a daunting task. Shumway and Harden (2003), drawing on a hierarchical conceptualization suggested by Miller (1990), demonstrate that assessment methods should be varied in respect of different kinds of performance. Although they had medical education in mind, their contention – summarized in Figure 6.2 – is more widely generalizable. McLachlan and Whiten (2000) demonstrate empirically that very different distributions of scores can be obtained from differing types of assessment, including essays, multiple choice tests, short answer questions, ‘mastery tests’, continuous assessment, and objective structured clinical examinations (OSCEs). They recommend converting all raw scores into grades before any attempt is made to aggregate performances, though squeezing raw scores into a small number of bands not only introduces error due to the awarding of the same grade for performances of differing merit but also risks the introduction of a spurious precision once the GPA is calculated. McLachlan and Whiten note that mastery tests (in which it is theoretically possible for all students to obtain very high scores) could compress the range of scores to such an extent that differentiation becomes unmeaningful. In such circumstances, they suggest faute de mieux that the conversion of scores from such a top-ended distribution should not unwarrantedly prejudice an aggregated grade. Shumway and Harden (2003) focus on the assessment of professional competence in medicine, and Figure 6.2 implies that this requires a cluster of assessment methods whose outcomes may be expressed in a variety of ways (not necessarily Figure 6.2 A or hierarchy professionaland practice, based on Miller (1990) in numerical terms, even inofgradings), whose combination may therefore and Shumway and Harden (2003), and related to possible assessment methods.

Level 1

Level 2

Level 3

Level 4

Knows

Knows how

Shows how

Does

Written assessment

Written assessment

Clinical and practical assessment (e.g. OSCE)

Observation; portfolio; log; peer assessment

Figure 6.2 A hierarchy of professional practice, based on Miller (1990) and Shumway and Harden (2003), and related to possible assessment methods. Note: OSCE is the objective structured clinical examination in education. medical education. Note: OSCE is the objective structured clinical examination usedused in medical

142 The cumulation of grades not be straightforward. The same combinatorial challenge is evident in reports of assessment in workplace environments including the following three, extracted from a longer list in Yorke (2005) in which there is an extended discussion of assessment in the specific context of practice-based professional learning. •

•

•

Hager and Beckett (1995) describe an assessment approach used by the Law Society of New South Wales in which the student is videotaped whilst undertaking an interview with a person adopting the role of client:4 the videotape is assessed by examiners. Other parts of the assessment include dealing with tasks in a mock file; referees’ reports and examinations testing legal knowledge. Rickard (2002), writing about a short (six-week) work-based learning placement in the field of health that was interposed between two blocks of formal teaching, describes assessment in terms of a short reflective piece on the placement, accompanied by a portfolio; and a longer critical discussion of health issues in East London, drawing on theory and the placement. McCulloch (2005) describes Kajulu Communications, a student-operated agency in which students of advertising can spend their final year in what amounts to an ‘internship on campus’. Kajulu is a standalone agency on the campus which fulfils commissions from external organizations. Assessment of students’ performance consists of a combination of group, peer and client assessments.

Using assessment statistics McLachlan and Whiten (2000) advocate the use of the nonparametric median and the interquartile range as measures of central tendency and dispersion in preference to the mean and standard deviation, both of which make more assumptions regarding the data and in any case are sensitive to outliers. They prefer the notion of profiling, since this can better display the spectrum of student performances. In a profile, grades both from subjects within the programme and from similar kinds of exercise (laboratory report, essay, etc.) can be aggregated, though to do this requires the parametric assumptions with which the authors are clearly unhappy. Aggregation by subject discipline is arguably the more problematic, since the mark distributions may vary more within a subject discipline than within an assessment mode. Essays are often marked with an implicit assumption that this is to an interval scale, though in reality it is ordinal. Over short ranges of marks, the ordinal scores may approximate interval scores, and taking the arithmetic mean may not do very much violence. However, if the student has performed inconsistently, then the approximation is weaker. 4 This is similar in some respects to the taking of a medical history. The videorecording of practice is unlikely to be feasible save where the professional is practising in a single location, such as the doctors’ surgeries covered in a study by Ram et al. (1999).

The cumulation of grades 143 McLachlan and Whiten are pointing to many assessors’ ‘theory in use’ regarding assessment – that numbers awarded in respect of student performances are genuinely quantitative and hence can be treated arithmetically: what is required is that there is, in respect of the assessment task, an underlying variable that conforms to the requirements of order (larger numbers are given to greater amounts of the entity in question) and additivity (for example, that a score of 8 represents twice the achievement of a score of 4). The problem is that few – if any – assessments in education can meet the requirements. Assignments or examination questions often demand discursive responses which are multidimensional (as an examination of marking criteria – see for example Table 8.1 – quickly reveals). Tests in which answers can be scored unambiguously right or wrong would seem at first sight to escape the criticism – but these are influenced by the sampling of questions and by the degree of difficulty of the items (which, unless the test has been thoroughly piloted, is often subjectively determined by the assessor). As Dalziel (1998) puts it: If [the variable being measured] is not quantitative, then numerical scores are misleading, because they imply that operations such as addition, averages and scaling can be used with the scores when there is not sufficient evidence to justify these procedures. These problems are most evident in the practice of aggregating marks to determine final scores. (Dalziel, 1998: 356)

Determining the honours degree classification: an illustration As was described in Chapter 3, honours degrees in the UK are often classified according to a two-stage process with the first being an averaging of marks according to a particular protocol, and the second (based on a profile of results) being invoked for borderline cases. In order to illustrate some of the issues that can arise in a two-stage process of this sort, a set of hypothetical percentage marks was constructed such that their mean over the final two years (full-time equivalent) of study fell a shade below 59.50 per cent, assuming that the two years’ marks are weighted equally. This figure was chosen because, for some institutions (and as adopted for this illustration), ‘rounding’ of marks would automatically treat 59.50 as equivalent to 60.00 per cent without invoking the second stage process, and the student would be awarded an upper second class honours degree. The mean percentage of 59.44 per cent would, however, trigger the second stage process under the regulations in place at some institutions regarding ‘borderline’ performances. (Note that one more percentage point somewhere in the mark sets would have led to rounding and automatically to upper second class honours.) The mean for the penultimate year (coded 2 below) is 58.875 per cent, and for the final year (coded 3) is 60.00 per cent. The ‘original’ sets of marks were adjusted to reflect different possible distributions whilst remaining consistent with

144 The cumulation of grades the means for the two years: a narrow spread of marks; a skew with the majority of marks below 60 per cent (‘low skew’); a skew with the majority of marks above 60 per cent (‘high skew’); and a wide spread of marks (Table 6.1). The further second stage assumption is made that more than half of the marks have to be at or above the ‘cut point’ of 60 per cent for the higher classification. In practice, the rules for applying the profiling method to borderline candidates are often more complex than those adopted for this illustration. For example, there may be a requirement that no mark should fall more than two classification bands below the higher of the two levels under consideration – the example avoids this kind of complication. The original distribution produces an upper second class (2.1) classification, as does the ‘high skew’ set. In contrast, the ‘low skew’ set produces a lower second class (2.2) classification. The narrow and wide spreads of marks produce situations that do not meet the criterion for a 2.1 classification, yet it is likely that assessment boards would take note of the better performance in terms of both mean and profile of percentages for the final year and consider whether, on balance, a 2.1 should be awarded because of the student’s ‘exit velocity’ (see below). Weighting the means according to the year of study5 (here, of course, the actual distribution is immaterial) such that the ratio of weights is 1:2 or 1:3 in favour of the final year marks would, with this particular set, tip the mean above 59.50 per cent and hence trigger automatic upgrading of class under the classification rules adopted for this example. If, for each distribution type, the penultimate and final year sets of marks are transposed, then the awards would be ‘original’, 2.1; ‘narrow’, 2.2; ‘low skew’, 2.2; ‘high skew’, 2.1 and ‘wide’, 2.2. This artificial example was constructed to highlight the problem of cumulation where its effects are most sharply felt, at points just below classification boundaries. What it does not demonstrate is the effects of the sources from which percentage marks are drawn. If, for instance, a module produces a high mean percentage mark and the student obtains 60 per cent, one cannot say that the performance is equivalent to the same percentage obtained from a module for which the mean mark is low. There is not enough information to determine the relative merits Table 6.1 Variations on an ‘original’ set of marks, and their implications for the honours degree classification

Mean Possible award

Original 2 64 56 60 55 56 61 63 56 58.875

Original 3 66 61 57 55 63 60 60 58 60 2.1

Narrow 2 61 57 58 59 54 61 63 58 58.875

Narrow 3 64 61 57 57 63 60 60 58 60 2.1/2.2

Low skew 2 63 57 55 57 54 64 66 55 58.875

Low skew 3 High skew 2 High skew 3 66 61 64 64 57 61 57 61 61 57 55 56 61 53 63 59 61 60 58 63 60 58 60 55 60 58.875 60 2.2

2.1

Wide 2 65 54 56 54 51 61 73 57 58.875

Wide 3 69 61 54 55 63 66 60 52 60 2.1/2.2

5 For a part-time student, it would be the level of study since their learning is spread out over a longer time.

The cumulation of grades 145 of the performances (even if the assessment board is provided with the relevant statistics). Table 2.5 illustrated some kinds of variation in mark distributions that are possible, and in Chapter 2 it was argued that standardizing marks would not solve the problem. When a student takes a joint programme involving two subjects, the respective marking patterns in the two subjects may influence the overall classification of their honours degree (Chapter 5). Subjects in which spreads of marks are wide exert more ‘leverage’ on the final result than those with narrow spreads. A student taking, say, Computing (which typically produces wide spreads of marks) with History (narrow spreads) would be more likely to gain a higher classification if they were strong in Computing and produced moderate achievements in History, rather than if the levels of performance were reversed. This is a matter whose implications are followed up in Chapter 9. The second arena in which the actual mark is of importance is the marketplace in which graduates present themselves. Where the subject studied is of little importance to an employer (as has been shown to be the case in more than 40 per cent of advertisements for ‘graduate jobs’ in the UK (Purcell and Pitcher, 1996), and the employer uses the degree classification as an initial sifting device, the norms against which student performances are graded could have a critical influence on the classification that the student obtains. Comparability At the beginning of the twentieth century, a committee at Harvard summed up the problem of comparing one student with another when students were taking dissimilar sets of courses: The men [sic] are not running side by side over the same road, but over different roads of different kinds, out of sight of one another. (quoted in Lewis, 2006: 136) The more that students are able to exercise choice in respect of their studies, as is particularly the case with modularized curricula, the more problematic comparisons between them become. Discarding weak performances A further complexity in the determination of the honours degree classification is that some institutional classification algorithms allow a student to discard from the computation a small number of modules. A survey by the Northern Universities Consortium for Credit Accumulation and Transfer (NUCCAT) (Armstrong et al., 1998) showed that, at that time, there was considerable variation between institutions in the number of module performances that could be discarded from the honours computation (though of course the required number of credits had to be gained). The largest proportion of discards amounted to 22 per cent of the assessment for the honours part of the programme. One effect of the survey (implicitly

146 The cumulation of grades given some support by Johnson, 2004) may have been a reduction in variability as institutions that were some distance from the norm as regards their rules for discarding module results reviewed their assessment regulations. Discarding modules from the computation of honours classification has the effect of raising the mean percentage: a study by the Student Assessment and Classification Working Group (SACWG) using institutional data showed that, in determining the honours classification band, dropping 15 or 206 out of the 240 credits required at honours level (a further 120 credits are needed at the pre-honours level) would be likely to result in roughly 1 in 10 classifications being raised. If 30 credits were dropped from the classification computation, the proportion of awards that would increase would be about one in six (Yorke et al., 2004). ‘Exit velocity’ The data in Table 6.1 point to a more general problem in combining grades: what are the relative merits of the student who has worked consistently at, say, upper second class standard and the student who performed at lower second class standard in the penultimate year but in the final year made a leap to first class standard?7 The difference is rarely as marked as this, but the question nevertheless contrasts two underlying models of learning and achievement whose relative influence on the classification process is rarely taken into account. Hence the limited information content of an honours degree classification (and, by extension, a GPA) is further constrained. The GPA: precision without accuracy? In contrast with the broad categories of the honours degree classification, the US system of grade-points involves the conversion of marks (typically percentages) into grades on the five-point A–F scale or, more often nowadays, on the five-point scale extended by affixes of + and –. The letter grades are converted into gradepoints which are then cumulated progressively into the grade-point average. The grade-point is doubly distant from the raw score, which itself is a coded index of the standard of the actual performance. Lewis (2006) is rightly critical of a system that takes finely divided measures (however, he ignores their inherent inaccuracies), converts them into a small number of bands, and then averages the band-scores to two or more decimal places – thus apparently reintroducing a level of precision that was lost when the A–F banding was implemented. As he writes: ‘The result is precision without accuracy’ (ibid.: 136). Noting that GPAs are calculated to five decimal places at Harvard in the determination of honors (ibid.: 136), he remarks that the introduction of an A–E scale at Harvard in 1886 was intended to eliminate the unwarranted fineness of distinctions that were used when 6 In one university, modules attracted 15 credits; in another, 10 credits – hence the citing of two figures. In both institutions it would in theory be possible to ‘drop’ 30 credits. 7 In this illustration the complexity in declaring achievements to be at a particular standard is sidelined.

The cumulation of grades 147 the ‘Scale of Rank’ – representing the maximum number of points that could be gained by a student – ran to 27,493 points.

Cut-points between levels of achievement In most higher education systems there are cut-points intended to differentiate between levels of achievement. The side of a cut-point on which they end up can matter a great deal to a student. However, the distinctions between levels of performance, as measured by grades, are fuzzier than is often acknowledged. The fuzziness is illustrated here by using percentage marks from real data, and making some statistical assumptions,8 but is in principle replicable using other grading systems. The only variation being considered here is that of the ‘scatter’ of an individual student’s actual grades: other influences on performance, such as the grading norms in different subject areas, are ‘bracketed out’. The data in this illustration are from an institution in the UK in which, for the honours degree classification, 240 credits are taken into account, with the student also having had to have gained an earlier 120 credits which are excluded from the classification process. The honours degree classification depended, in the first instance, on the mean percentage obtained by the student (Figure 6.3 summarizes the distribution of mean percentages). For present purposes, auxiliary considerations open to examination boards are not considered. All of the 832 students whose results are represented in Figure 6.3 achieved sufficient credit to enter the honours degree classification process: there are no results for students who failed to attain the honours threshold. For each of the students’ sets of percentage grades, the observed mean and standard error of the mean (SE(M)) were calculated. The magnitude of the SE(M) reflected the amount of dispersion in the individual student’s percentages for the modules: the percentages of some students were more dispersed than those of others. The means were plotted in ascending order, and bands of twice the SE(M) above and below the mean were superimposed on the graph (Figure 6.3), representing marginally greater than 95 per cent confidence limits. In other words, if the true mean could be computed from a very large number of assessments (impossible in practice, of course), there is a touch more than a 95 per cent chance of finding it between the plotted upper and lower confidence limits. The statistical procedure makes an allowance for the possibility that the sampled observed grades are atypically high or low compared with the universe of all possible grades obtainable by the 8 Three assumptions are made here: first, that each individual mark is a sample from the population of possible marks for assessments on the particular module’s specified outcomes; second, that the underlying distributions of marks for the separate modules (from which the observed marks have been drawn) are similar; and, third, that the observed marks are randomly drawn from the population. The extent to which each of these assumptions holds is a matter of judgement: the weaker assumption of a lack of systematic bias in the observed marks may be more tenable and sufficient to justify the illustration of uncertainty inherent in each of the mean marks that is depicted in Figure 6.3. I am grateful to Bill Brakes of the University of Northampton for discussion on this and other related points.

148 The cumulation of grades

Figure 6.3 Mean percentages for 832 students, with confidence limits set at 2 times the SE(M) either side of each observed mean.

student – in other words, for the possibility that the student has ‘struck lucky’ or the opposite in their recorded grades. As noted earlier, in the UK system an important cut is made between upper and lower second class honours degrees, the boundary normally being at 60 per cent. Figure 6.3 shows that, subject to the indicated 95 per cent confidence limits, some students with mean percentages as low as 56 per cent could have ‘true’ means of 60, and that some with means as high as 64 per cent could have ‘true’ means as low as 60. This fuzziness extends across a span of roughly 8 percentage points, which is nearly the width of an honours degree class band. If the confidence limits are placed at 1 SE(M) either side of the observed mean, rather than 2 SE(M), the band of fuzziness narrows, roughly to between 57 and 61 per cent. The cost of doing so is to reduce to roughly 68 per cent the chances of the ‘true’ mean lying within the confidence limits. This example is based on 24 ten-credit modules, but programmes in some institutions involve fewer, but larger, modules. Reducing the number of modules would increase the SE(M), and hence the fuzziness surrounding the observed means. The idea that the observed mark is subject to a host of influences, and hence may represent a student’s achievement only fuzzily, is acknowledged in the field of educational measurement but rarely escapes into the broader prairies of routine educational practice. The introduction of the National Student Survey in the UK was accompanied by arguments about whether ‘uncertainty intervals’ should be

The cumulation of grades 149 included in the results. Two eminent statisticians on the steering committee for the work argued strongly for their inclusion, but in the end the decision was taken not to include them, on the grounds that they added a layer of complexity that would be confusing for the non-specialist readership (taken to be intending students and their advisers, particularly at the point when decisions were imminent regarding their applications to institutions).9 The ‘cut’ between upper and lower second class honours degrees is made at the point where the student performances tend to cluster (the mean for the whole cohort is 59.2 per cent), which in this representation is where the graph is at its flattest. Discrimination is, clearly, particularly difficult at around 60 per cent once the fuzziness surrounding the observed mean is taken into account. Taking the 95 per cent confidence interval as the criterion, and ignoring any other considerations, 389 of the students’ performances fall within one of the honours classification bands. The remaining 443 cannot be assigned unequivocally to a classification band, since their confidence intervals extend to either side of a classification boundary. Figure 6.3 is too condensed to illustrate the point clearly, and the data underlying the figure are summarized in Table 6.2. If the grades attained by the students are taken as representative of their achievements (in other words, if other considerations regarding assessment are put to one side), then it is possible to argue that an overall index of individuals’ performances could be related to a scale in which some uncertainty is permitted: • • •

first class; borderline first and upper second class; upper second class;

Table 6.2 The distribution of overall percentages of 832 students between bands of the honours degree classification, assuming that each individual’s percentage lies within a confidence interval of approximately 95 per cent Percentage range Above 70 per cent (i.e. unambiguously first class honours) Spanning the first class and upper second class bands Within the upper second class honours band Spanning the upper and lower second class honours band Within the lower second class honours band Spanning the lower second class and third class band Within the third class honours band Total

N of students 13 59 139 300 235 83 3 832

9 Richardson (2004) showed that, in pilot work, only the most extreme institutional mean scores on the various items of the survey had uncertainty intervals that did not overlap with some others – for most of the scores, the amount of overlap was enough to cast doubt on whether the differences in the means were substantive. The argument about the inclusion of uncertainty intervals spilled into the educational press: see Baty (2005).

150 The cumulation of grades • • • • • •

borderline upper and lower second class; lower second class; borderline lower second and third class; third class; non-honours degree (for the gaining of sufficient credit for a degree but insufficient for honours); fail.

In 1997 a survey was undertaken at Georgia State University regarding the preferences of staff in respect of approaches to grading.10 The intention was to ascertain whether there was a significant wish to switch from the simple five-point scale to other possibilities. Responses were received from 459 faculty (54 per cent of those surveyed), of which roughly two thirds were in favour of including + and – affixes to the letter grades. One in five favoured the status quo, and one in eight preferred a scale in which borderline performances were signalled by letter combinations in a scale whose levels were A, AB, B, BC, C, CD, D and F. Earlier, Please (1971) had examined the possibility of pupils being misgraded in the A-level (Advanced Level of the General Certificate of Education) examinations as they existed at that time. The specified grade distribution, in percentages of the examinees, was – rather curiously – as follows, with grades A to E being passing grades: A, 10; B, 15; C, 10; D, 15; E, 20; O (the award of a ‘fall-back’ ordinary level pass instead of an advanced level pass), 20; F (fail), 10. Please’s calculations led him to conclude that, if the reliability coefficient for the examination were less than 0.83, more than half of the examinees would be misgraded. He suggested that it would be more informative to give students multiple grades representing the uncertainty surrounding the grade to which their observed mark would be assigned. This would have produced the following, combinations of grades for each of the specified outcome bands: A&B, 10; A&B&C, 15; B&C&D, 10; C&D&E, 15; D&E&O, 20; E&O&F, 20; O&F, 10. Rather plaintively, he ends his article by surmising that the multiple grades would in practice be treated as if they were the single grades that they replaced. Had his proposals been enacted, he would probably have been proved correct. The same might be anticipated in respect of the use of borderline categories described in the preceding two paragraphs. Earlier still, Hartog and Rhodes (1936: 154) described a marking scale for university honours-level History which illustrates the probability of considerable ‘hedging’ over students’ examination performances. The scale11 ran as follows: α+ α?+ α α?– α– α–?– α= αβ βα β++ β+?+ β+ β?+ β β?– β– β–?– β= βγ γβ γ+ γ γ– δ. Could such finesse in the hedging of judgements be a hidden feature of contemporary scales? 10 See http://www2.gsu.edu/~wwwphl/adandst/plusmin.html (accessed 7 October 2006). 11 α= and β= should probably be read as alpha and beta double-minus respectively.

The cumulation of grades 151 This issue of representing the uncertainty inherent in grading is discussed further in Chapter 9.

The boundary of ‘pass’ performances The cut-off point between passing and not passing in whole programmes of study in the UK is typically 40 per cent.12 In practice, the figure is somewhat higher than that (as is implicit in Figure 6.3) because compensation in respect of module marks below 40 may not be permitted, or may be permitted to only a very limited extent. An overall classification of performance may, however, obscure the significance of a vital component for, say, a graduate’s employment. It may be necessary that the student attain a higher standard in some curriculum components in order to be considered safe to practise (see Chapter 2). The threshold for a pass in such a component might need to be higher than that in other components. Knowing in some detail the curriculum followed by a student and their performance on curricular components may be highly significant for an employer. A while ago, a higher education institution appointed to a lectureship a graduate in Business Studies. At the time, it was normal for Business Studies programmes to include quantitative methods, and the appointing institution assumed that the graduate had a working knowledge of this area. When the lecturer took up his post, the institution was dismayed to discover that he had not done any quantitative work and it had to make other arrangements to cover for the appointee’s lack of expertise. The primary fault lies, of course with the appointing institution which, by relying on normative expectations, failed to check that the applicant did indeed possess the expertise that it wanted. The situation would have been almost as bad had the applicant only achieved a bare pass in quantitative methods, since this would have been inadequate as a basis for teaching. This anecdote foreshadows the more extensive discussion in Chapter 9 of the need for the spectrum of a student’s performances to be available to interested parties. In some professional programmes, such as teacher education, the student has to demonstrate that they are competent in the classroom. If their lesson planning, organization of the classroom, teaching ability and/or capacity to maintain discipline do not meet threshold criteria, then excellence in academic study cuts no ice and they fail. No compensation between the practicalities of teaching and academic performance can be entertained. A student who is academically gifted, but inadequate as a teacher, would obviously be better off looking for a different career – and should have been given early advice to switch to a more suitable programme. For the relatively low achiever, consistency in performance level may be advantageous, whereas for the relatively high achiever, variability in performance is not a problem and could be to their benefit. Simonite (2003) observes that consistency in grades is important for students who obtain low average marks, since if 12 In some areas, such as Medicine, the bar is set somewhat higher.

152 The cumulation of grades their performances vary markedly the chances are that they will have failed one or more modules, with all the penalties that stem from failure. For students who obtain a high average, the likelihood of failure in any individual module is low. If the award algorithm allows the student to ‘drop’ weaker module performances (assuming that all the modules have been passed or that a narrow fail can be compensated by stronger performances elsewhere), then variation works to the student’s advantage since it cuts off the weak ‘tail’ of the distribution of their grades (Yorke et al. 2004). Lewis (2006) argues that the calculation of GPA favours the ‘drudge’ who does well in all areas but has performed exceptionally well in none. His view reflects that of Harvard’s President Quincy who wrote in 1831 that ‘the estimate of scholastic rank must depend, not upon occasional brilliant success, but on the steady, uniform and satisfactory performance of each exercise’ (quoted in Lewis, 2006: 127–128). Whilst it is appropriate to assess students as having failed a module when their performances are inadequate, the same is not generally true when a student’s overall performance is considered. Students will normally have some credit on their academic record, which should be valued and acknowledged. They may leave higher education with a set of credits, an intermediate award such as (in the UK) a Certificate or Diploma, or an unclassified degree. In the UK system, then, the contrast in other than exceptional circumstances13 is not a Manichaean cut between pass and fail but a more graded distinction spanning ‘pass with honours’ and the gaining of an alternative award.14

Loss of information The reliance on a single index of performance, such as an honours degree classification or a grade-point average, pushes into the background – limbo, in some cases – the variation in a student’s performance. An analogy might be made with statistics, in which one might consider (albeit a little loosely) that the single index is to the mean as a disaggregated set of achievements is to the variance. The loss of information begins further back in the assessment system, when an overall grade is awarded for achievements spanning a set of intended learning outcomes. The assessment specification for the relevant assignment (say) will weight certain aspects of achievement more than others. However, the student’s profile of achievements related to the specified learning outcomes may be very different from the weightings in the assessment design. For example, the assessment specification for an essay may value the logic of the writer’s argument more highly than the hunting down and inclusion of unanticipated, but highly relevant, material. The student might be extremely good at discovering material but less good at incorporating it into an argument, and hence suffer under the prescribed assessment specification. The award of, say, 63 per cent for the essay cannot on 13 These might include serious academic malpractice. 14 This point was made by Claire Donovan at one of the consultation meetings on the future of the UK honours degree classification which was run by the Burgess Group in the autumn of 2006.

The cumulation of grades 153 its own convey where the strengths and weaknesses of the student’s performance lie. This is schematically represented at the left hand side of Figure 6.4, in which each module of the programme is, for the purposes of illustration, assumed to be assessed in two ways, through coursework and/or formal examination. The student’s overall grade for the module is a combination (not necessarily equally weighted) of overall grades for components such as coursework and examination, which themselves are internally weighted by the designer of the assessment task.

Rounding up Cumulation, whether via grade-point average or honours degree classification, obscures all but the grossest differences between students’ achievements. Any tolerably satisfactory method of assessment will differentiate the outstanding from the barely passing performance. Difficulties appear once the inevitable errors of measurement are acknowledged. Figure 6.3 showed clearly how difficult it is to

Progressive loss of information LO Perf weighting

CW

Grade for Module P

weighting Level 2 v Level 3

Ex CW Grade for Module Q

Honours degree classification

Ex CW Grade for Module R

CW

Etc

Etc

Figure 6.4 An illustration of the loss of information as a consequence of cumulating assessment outcomes. Key: LO, learning outcomes specified for the module (the thickness of the lines symbolizes emphasis). Perf, the achievements of the student in relation to the specified learning outcomes (the thickness of the lines symbolizes strength of achievement). CW, coursework. Ex, formal examination. Levels 2 and 3 represent the penultimate and final years of study for full-time students.

154 The cumulation of grades differentiate between performances in the middle of a distribution – the region at which, statistically, small shifts or errors in marks are most likely to have a large effect. The finer distinctions of the grade-point average give the GPA an advantage over the honours degree classification, since errors of measurement produce less dramatic effects than they can do for the honours degree classification. If this were the only criterion for acceptability, the GPA would win a competition with the honours degree classification hands down. However, the GPA, like the honours degree classification, fails the multidimensionality test. The GPA does not take account of the trajectory of a student’s performance in a way that the honours degree classification can – though, as applied by institutions, the latter is often ambiguous about the kind of trajectory that is valued.

Chapter 7

Value added

A tempting concept Students make gains as a consequence of their engagement in higher education. Policy-makers and others want to know how much that gain actually is, and the challenge has been to find an appropriate measure with acceptable technical quality. However, stakeholders in higher education have differing perspectives on gain, or value added, thereby increasing the challenge of finding a way – or ways – of measuring it. This chapter discusses the concept of educational gain and the multiplicity of perspectives that colour it. The greatest amount of attention has been devoted to finding a measure that can be used to compare institutions, with potential implications for funding. However, although value added calculations at school level in England continue to develop in sophistication, they are advantaged by considerable commonality of measures within both entry and exit qualifications in a way that higher education is not. At the current state of development of the topic, there looks to be a long slog ahead. As hill walkers often find, the cresting of one ridge is accompanied by the sight of another ridge further on.

The importance of educational ‘gain’ A focus of interest in the US has been the gains made by students in higher education, often driven by concerns regarding the ratio of benefits to costs. Ewell (2002) notes projects going back to 1928, though the attention paid to the issue has ebbed and flowed over the years. A significant difficulty has been the operationalization of student ‘gain’. The Measuring Up reports produced every two years by the National Center for Public Policy and Higher Education (NCPPHE) have shown only a limited improvement in the recording by states of student learning over the period 2000–06. Data on student learning were recorded as ‘incomplete’ in report cards of 50 states in both 2000 and 2002, but five states in 2004 and nine in 2006 were recognized as having provided sufficient data to have merited a ‘plus’ on their report rather than an ‘incomplete’ (NCPPHE, 2000; 2002; 2004;

156 Value added 2006).1 The most recent assessments indicated that, in the considerable majority of states, formalized evidence of student gain was lacking, even though gains will certainly have occurred. It is perhaps not surprising that there have been renewed efforts to find measures of student achievement that are relevant across the whole of higher education and that can be relatively easily interpreted by policy-makers and intending students. The term ‘value added’ has been in use in UK higher education since at least 1987 when it was proposed as an indicator by the then National Advisory Body for Public Sector Higher Education (NAB). The proposal was designed to illustrate that students in the then polytechnics and colleges sector (in respect of which NAB had responsibilities) were being successful even though the academic backgrounds of many of them were, judging by entry qualifications, not as auspicious as those of students in the then universities. Something of the same thinking probably lies behind the call by the Campaign for Mainstream Universities (CMU, now renamed as the Coalition of Modern Universities) (2004) for ‘a government endorsed index of value added in higher education’. The political colouring given to value added in the UK is different from that in the US: in the UK it involves institutional status, whereas in the US there is a strong element of seeking the best value for money. The difference points to the need to consider the various meanings that have been attached to the concept of value added. Interested parties will understandably vary regarding the approach they take regarding value added. Students are likely to focus on the potential of the combination of programme and institution to achieve their aims (often, but not necessarily, those relating to the labour market). For them, past performances of students within the sector may have some predictive value. The potential economic return may also have some influence on the choices they make. An institution, on the other hand, may well be more interested in how the achievements of its students compare with those of its previous cohorts, and/or with those of cohorts from broadly similar institutions.

Measuring ‘gain’ is not simple The concept of ‘value added’ in education is deceptively simple. It is typically taken to refer to the gain accruing to students as a result of the time they have spent in their educational institution. Hersh and Benjamin put the case for value added in higher education in the following terms: Virtually everyone who has thought carefully about the question of assessing quality in higher education agrees that “value added” is the only valid approach. Excellence and quality should be determined by the degree to which an institution develops the abilities of its students. By “value added” we mean the value that is added to students’ capabilities and knowledge as a consequence of their education at a particular 1 All of these reports can be obtained via www.highereducation.org/catreports/evaluating_state_ performance.shtml (accessed 9 October 2006).

Value added 157 college or university. Measuring such value requires assessing what students know and can do as they begin college, assessing them again during college, and assessing them in the years after graduation when they experience the full benefit of their college education. Value added is thus the difference between the measures of students’ attainments as they enter college and measures of their attainments as they complete college. Value added is the difference a college makes in their education. (Hersh and Benjamin, 2001: 4) The difference between students’ attainment prior to and after they complete a course of study is, however, not the same as the ‘difference a college makes in their education’. Had the students not undertaken the course of study, they might have developed their capabilities and knowledge to a similar extent. Not all ‘growth’ is necessarily attributable to the time spent under institutional aegis: natural maturation and engagement in a variety of extra-curricular activities will also contribute. If the aim is to measure the relative ‘value added’ (say, between courses), there may be no need to consider these issues, but if the aim is to assess the value added by the higher education experience, it is essential. Hersh and Benjamin’s view represents one of a number of possible approaches to value added. It is supported by the National Governors Association (NGA) in the US, although the NGA acknowledged that progress on this front would not be rapid (see Shulock and Moore, 2002: 55–56). Shulock and Moore note that efforts in the US to assess student learning are years away from creating instruments that could reliably assess college-level learning across institutions or states. At present, assessment data are site-specific and qualitative. Policy makers will have to be patient. (Shulock and Moore, 2002: 59) Pascarella and Terenzini (1991), having reviewed a large number of studies, concluded that there may be too many problems in the reliability and validity of grade point average to consider it solely, or perhaps even primarily, as a measure of how much was learned during college. (Pascarella and Terenzini, 1991: 62) In other words, the GPA is insufficiently robust as an index of the gain in learning whilst in higher education. However, Astin (1993) took a more positive view: undergraduate GPA is positively related to nearly all measures of cognitive and academic growth . . . even after the effects of all other input,

158 Value added environmental, and involvement measures have been controlled. What this tells us is that GPA, despite its limitations, appears to reflect the student’s actual learning and growth during the undergraduate years. (Astin, 1993: 241–242: emphasis in the original) However, there is probably a considerable amount of unexplained variance lurking beneath Astin’s conclusion. Returning to the issue, Pascarella and Terenzini (2005: 66) took up a theme implicit, and easily overlooked, in Astin’s conclusion – that growth during the undergraduate years is not necessarily attributable to the institution, when they remarked: we treat student grades not as an outcome of college that stands for how much is learned but rather as an indicator of the of the extent to which a student successfully complies with the academic norms or requirements of the institution. Thus grades are viewed as one among a number of dimensions of the college experience (both academic and non-academic) where the student may demonstrate different levels of involvement, competence, or achievement. (Pascarella and Terenzini, 2005: 66) Grades, on this view, signal the degree of fit between institutional expectations and student attainment. They may be useful as indicators of performance (to the student, to the institution, and to other interested parties, including employers). Of significance to the argument developed in this book, grades do not tell the whole story of what a student gains from the higher education experience – and, indeed, from the course of general maturation and informal learning. Other, often ungraded and perhaps informal, aspects of development are also important, particularly to those who wish to assess the person’s suitability for a particular activity. A word of caution is necessary against the temptation to administer a test on entry and on exit and to use the difference as a measure of value added. The educational measurement literature shows how the simplistic use of ‘gain scores’ can mislead (for methodological discussions see, for example, the classic contributions of Campbell and Stanley, 1963, and Lord, 1963).

Multiple meanings The complexity of ‘value added’ is apparent once the plurality of interested parties is understood, since it can be approached from the point of view of the student, of the institution, or of society as a whole. ‘Value added’ can be used with at least seven meanings, five of which (1, 2, 3, 4 and 7) were listed by Raffe et al. (2001):

Value added 159 1 Comparative Learning Gain: students’ relative learning gains, estimated by comparing their qualification outcomes with those of students elsewhere with the same entry qualifications (or other measure of prior learning); 2 Comparative Institutional Effect: the relative amount of students’ learning gains that can be attributed to the college; 3 Distance Travelled: students’ learning gains, estimated by comparing their entry and exit qualifications in terms of a common scale; 4 Wider Benefits: the value of the college experience to the student, over and above the achievement of formal qualifications; 5 financial benefit to the student; 6 achievement of the institution; 7 Community Benefits: the value added by the college to the local community or wider society. These concepts overlap to varying extents: the overlap between wider and community benefits (4 and 7 above) is reflected in their coalescence under the heading ‘Wider benefits: private and public’ in the following discussion. Comparative learning gain The existence of substantial sets of data from the higher education sector makes the comparison of student gain feasible, though only with considerable judiciousness. The performance of a cohort of students can, with due cautions relating to variables such as cohort size, entry qualifications and curricular comparability, be compared with that of a broader population of students. From an early analysis of outcomes from higher education in the UK,2 Bee and Dolton (1985) observed that differences [between universities] do exist, that they can be large and that the pattern is consistent over time. Prospective students and their advisers would be wise to study carefully the relevant figures before making their selection [of universities to which they apply]. (Bee and Dolton, 1985: 49) Judicious comparisons may offer helpful information to aspiring entrants to higher education. An empirical feasibility study of the preparation of profiles of degree classification and employment was conducted,3 using real data, with reference to performances in Engineering in six institutions. A sample is provided in Tables 7.1 and 7.2, in which the expected outcomes for an entrant with a particular 2 However, the analyses conducted by Bee and Dolton, whilst demonstrating persistent differences between universities, included some variables that might not have a large influence on student achievement, such as the university’s annual institutional insurance premium; the number of volumes in the library; and the total number of undergraduates in the university. 3 The analyses were conducted by Richard Puttock of HEFCE and are used here with permission, but should not be taken as implying that the approach reflects HEFCE policy.

160 Value added Table 7.1 An excerpt from the feasibility study showing that, in the institution concerned, the performance of students as measured by their award classifications tends to be better than would be expected from an institution with similar characteristics

Mode of study Data type Full-time or Actual sandwichb Expected Difference

First class honours 275

Upper second class honours 850

Lower second class honoursa 540

260

695

670

160

160

15

160

–130

–100

55

Unclassified Other or Pass honours Degree 60 210

Note: Data are rounded to the nearest 5 according to HESA’s methodology, hence the minor discrepancies. a This category also includes ‘undivided’ second class honours degrees. b Sandwich (or cooperative) programmes incorporate one or more periods of placement, typically totalling one year.

Table 7.2 An excerpt from the feasibility study showing that, for the institution concerned, the employment profile of students in Engineering was less good than would be expected from an institution with similar characteristics Data type Actual Expected

Graduate job 1,000 1,180

Non-graduate job 280 205

Employed, type unknown 0 15

Further study 265 230

Unemployed 245 160

Note: Data are rounded to the nearest 5 according to HESA’s methodology, hence the minor discrepancies.

set of qualifications4 are set against what was actually observed for the institution in question. For the same institution, the performances of the graduates as regards employment were less good, with fewer than expected gaining a ‘graduate job’ and more than expected gaining a job below that which would be expected of a graduate, and also more being unemployed at the time of the destinations survey from which the data were drawn. Institutions with very high entry standards will, on average, appear weak in terms of degree classifications since the method of calculation assumes the equivalence of degrees. Their ‘expected’ awards will be higher than their actual awards since high entry qualifications would lead to an expectation of high classifications when the evidence from real outcome data shows that the distribution of classifications includes some that are relatively low. However, students from 4 These were limited to A-level grades. The approach could be developed to accommodate the tariff adopted by the Universities and Colleges Admissions Service (UCAS), which would broaden the range of entry qualifications accommodated. In the UK there is currently no equivalent of the SAT to provide a standardized baseline of student achievement prior to entry.

Value added 161 an institution enrolling those with high entry qualifications who end up with relatively low classifications may fare well in gaining employment and further study. The employment data would therefore go some way towards mitigating the bias inherent in the entry x classification matrix. The intention is that the prospective student could integrate this information into his or her decision-making process. Data such as the above implicitly require the student to make sensible comparisons. The Teaching Quality Information (TQi) website5 in the UK is intended to allow students to extract the information that they deem relevant to their situation – so in theory students could select the institutions in which they were interested and undertake a customized comparative analysis. It should be noted that examples in Tables 7.1 and 7.2 are of comparative value added, and reflect the performances of past cohorts. They are suggestive of the potential of the study programme in a particular institution for ‘adding value’ to intending students. The caution of the financial services industry (that past performance is no guarantee for the future) should, of course, be applied to any projected benefits from higher education (and not least because the student has to be active in generating success, and cannot afford to be passive in this regard). Comparative institutional effect As noted earlier, the difference between a student’s attainment prior to, and after their completion of, a course of study is not the same as the ‘difference a college makes in their education’. If the aim is to measure the relative ‘value added’ (say, between broadly similar courses in broadly similar institutions), it may be justifiable to assume that the extraneous variables balance out for cohorts of reasonable size. The problem is much greater if the aim is to make an assessment, in absolute terms, of the value added by higher education. The Council for National Academic Awards and the Polytechnics and Colleges Funding Council in the UK undertook a study of value added, in which ‘comparative value added’ (CVA) was prominent (CNAA/PCFC, 1990). The report itself disappeared from view relatively quickly, since the CVA approach was criticized for a number of weaknesses, including: • • •

its inability to deal adequately with combined and joint programmes; the fact that some studies in higher education are an extension of studies undertaken at school or further education college, whereas others are not; various methodological problems that arise when comparisons are made with normative data from subject areas.

Some years later, the general idea was resuscitated by Wagner (1998). Wagner’s argument was that developments in management information in the UK had enabled a substantial database of student performances to be constructed, and 5 http://www2.tqi.ac.uk/sites/tqi/home/index.cfm.

162 Value added hence it would be feasible to compute expected student performances for different A-level scores on entry (these could be disaggregated by subject area). It would thus be possible to see whether an institution performed above or below expectation. This calculation would presumably indicate which institutions, in which subject areas, were most effective in facilitating students’ academic achievements. The method would be applicable to other entry qualifications, provided that there was a sufficient bank of historic data. This kind of approach could be attractive to students, who might use the information to infer where they might gain most benefit from their chosen programme. It might also be attractive to the government, which might want to make use of it for accountability and perhaps funding purposes. A lot would hang on the robustness of the metrics used at entry and exit, both of which are problematic. Further, it is doubtful whether the output from this method would truly be a measure of (absolute) ‘value added’. Comparative departmental effect Chapman undertook a number of analyses of data from eight subject areas6 spanning the period 1973–93 (Chapman, 1996; HEQC, 1996a). Of relevance here is his attempt to relate entry qualifications and exit performances on a departmental level, using z-transformations of entry qualifications and the proportion of ‘good honours degrees’ to cater for general variation in cohorts over time. Chapman was well aware of the methodological weaknesses in his study, but argued that his analyses ought to provoke thought about why some departments seemed to achieve better student performances than others with similar entry profiles, though one possible explanation was an easing of standards in respect of degree classifications. The Collegiate Learning Assessment Project In the US, research sponsored by the RAND Corporation is exploring the use of broad cognitive tests which require students to demonstrate critical thinking, analytic reasoning and written communication. The tests reflect concerns expressed by the National Center for Public Policy and Higher Education and some major employers that graduating students have insufficiently developed these capacities. The new tests,7 developed under the Collegiate Learning Assessment (CLA) Project, involve free-response writing rather than completing the multiple-choice items typical of the Scholastic Aptitude Test (SAT), and marking the responses by computer has been shown to be feasible (Klein et al., 2005). The project is using a ‘matrix sampling’ approach in which different students take different but generically equivalent components of the new test. In 2005, roughly 19,000 students from 134 institutions took part in the CLA, with the numbers nearly dou 6 Accountancy, Biology, Civil Engineering, French, History, Mathematics, Physics and Politics. 7 Examples of the test tasks can be found in www.cae.org/content/pdf/CLA.Sample.Report.pdf: 7ff (accessed 10 October 2006) and in Hersh (2005).

Value added 163 bling in 2006 (Hersh, 2005). Hersh states that, after controlling for selectivity in admission, the tests have shown up differences between institutions. The tests evidence relativities (and not absolute standards) by regressing CLA test scores on SAT scores, enabling an institution to ascertain whether it has performed better or worse in relation to the regression line.8 This makes the CLA a variant of the ‘comparative value added’ category. A further step with the CLA is to identify where particular successes are being engendered with a view to locating and disseminating pedagogic strengths. There is no claim that the CLA is providing the answer to the question of student gain. The Council for Aid to Education (CAE) Board of Trustees’ Statement on the Role of Assessment in Higher Education9 indicates that the diversity of mission in higher education in the US requires that each campus should, in its assessment of student progress, draw on local, curriculum-embedded, test instruments as well as nationally normed instruments. Distance travelled At root, the ‘distance travelled’ is an ipsative measure, in that it applies to the educational gain made by an individual student, using some measure of entry as the baseline for judgement. This approach is based on the relationship between entry and exit scores, and depends on having a test or tests that can be given to a student on entry and exit (the problems with simple ‘gain scores’ were noted above), or on having a fully calibrated relationship between the entry test and the exit test. Whereas this might be feasible for some relatively generic aspects of curriculum, where calibration could be undertaken on a respectably sized sample of students, the more specific aspects of any curriculum are likely to defy calibration of this type. There are, for higher education, hints in this approach of a ‘common curriculum’ (or, perhaps, domain-specific common curricula) that would allow for some comparability at the exit end, if not the entry end. A suggestion along these lines was floated in the UK by Howarth and Croudace (1995) for the discipline of Psychology (it quickly sank) – but this would seem to relate more to comparisons at the level of the institution than at the level of the individual. The focus of attention where ‘distance travelled’ is an issue has however tended to be at the institutional level, with the individual indexes of gain being combined as a measure of institutional performance – thus shifting the category of value added to that of Comparable institutional effect (above). In higher education in the UK, the concept of value added has been important to the less prestigious institutions as they have sought to demonstrate that they have been particularly effective in helping students from relatively unpromising backgrounds to achieve success. This would be exemplified by an institution 8 Adelman’s (forthcoming) reservations about the use of SAT scores may be less applicable in these circumstances, since the CLA focuses on general capabilities rather than on subject specifics. 9 See the undated Statement at www.ed.gov/about/bdscomm/list/hiedfuture/4th-meeting/benjamin. pdf (accessed 10 October 2006).

164 Value added showing that students with entry qualifications that many would consider weak actually achieve a respectable honours degree, perhaps using a matrix of entry qualification categories against honours degree classification. The performance tables published by the Department for Education and Skills (DfES) for secondary schools in England are based on the relationship between, on one hand, student achievement at GCSE and equivalent level (typically at age 16) and, on the other, achievement at earlier stages of testing (‘Key Stages’ 2 or 3, as appropriate).10 The method involves comparing each pupil’s best eight results at GCSE and equivalent with the median performance of other pupils with the same or similar results at the relevant Key Stage. A school measure of value added is based on the mean individual value added score of its pupils, and is set such that a score of 1,000 is the level above which a school is exceeding expectations, or below which it is performing below expectation. Although pupil scores at Key Stages 2 and 3 are the best predictors of future achievement, other factors which are outside the control of the school (such as gender and level of deprivation) are known to exert an influence. The DfES is developing a more complex method of computing value added (what it terms ‘Contextual Value Added’) in order to take account of these exogenous factors. The Curriculum, Evaluation and Management (CEM) Centre at Durham University provides confidential value added data for schools and colleges in the UK, and has a suite of systems covering differing age-ranges.11 Subscribing institutions receive feedback on their performance against the performances of other subscribing institutions. The purpose of the CEM Centre analyses is to help institutions in their self-development rather than to place comparative institutional performances in the public domain. The aims of higher education are wide-ranging and complex, and students learn many things during their time in higher education that are of value in their careers, and which are not necessarily captured in formal assessments. For example, employability is prominent in higher education policy in the UK (as is workforce development in the US), and is being accentuated in Europe as the Bologna Process for harmonizing programmes has developed (Haug and Tauch, 2001). Work undertaken by Higher Education Quality Council (HEQC, 1997a: see this volume, Table 9.2) and more recently by the Enhancing Student Employability Co-ordination Team (ESECT) showed how complex the construct of employability is, and implicitly illustrated some of the problems associated with measuring value added across the whole of any student’s achievements. Many aspects of employability resist measurement even if achievements can be more broadly assessed (Knight and Yorke, 2003). Tests do not necessarily assess the span of desirable proficiencies. Berlins expressed concern over the proposal by eight top law schools in England to introduce a common entry examination as an addition to A-level: 10 See for details www.dfes.gov.uk/performancetables/schools_05/sec4.shtml and the associated web link. 11 For further information, see www.cemcentre.org.

Value added 165 Very many successful lawyers – I say this in praise, not criticism – are not all that bright. Some of our best judges do not shine intellectually. Becoming a good lawyer requires a mixture of talents, of which the intelligence revealed by the proposed tests is only one. Equally, many bright people have proved to be rubbish lawyers. Put bluntly, many of our best lawyers and judges would have failed these tests, and we would have been the poorer without them. And many potentially excellent lawyers will be lost to the law because of these tests. (Berlins, 2004: 17) A different slant on Berlins’ point invites the question of the extent to which exit tests might represent success in the legal profession. Berlins is implicitly drawing on Sternberg’s (1997) distinction between academic and practical intelligence. Briefly, to be academically talented alone is generally insufficient for success in a career in law – it needs to be supplemented by nous and practical capability, which fall outside the ambit of the kind of formalized testing that would be necessary for a robust index of value added. The argument has been made (Knight and Yorke, 2003) that summative assessment (let alone measurement in the strict sense) of a number of desired achievements cannot meet technical expectations such as validity and reliability without the commitment of a level of resourcing that is prohibitive. Portfolio-type assessment simply does not lend itself to the computation of value added scores. Selfreporting of gain, such as is used in Graduate Careers Australia’s questionnaire to graduates and in the National Student Survey that is implemented in the UK, is a problem since valid and reliable anchor-points are very difficult to devise. Wider benefits: private and public Students gain more than what are recorded as their achievements on a programme of study in higher education. Society also gains as a result of students’ engagement in higher education. The Institute for Higher Education Policy (IHEP, 1998) summarized the benefits arising from higher education in a matrix involving two dichotomies – public/private and economic/social (Table 7.3).12 This summary finds strong echoes in sources such as Astin (1993) and Pascarella and Terenzini (1991; 2005) from the United States, and in Bynner and Egerton (2001) and Bynner et al. (2003) from the UK. In addition, over the years there has been a slew of economic reports in the UK that attest to the value of higher education to regional and national economic prosperity.13 However, the Lambert review of business–university collaboration (Lambert, 2003) indicated that there was considerable scope for improvement in the exploitation of innovations that 12 However, some would question the stress placed on greater productivity and increased consumption at a time when the world is becoming increasingly aware of the side-effects of increasing economic activity. 13 See, for example, www.universitiesinlondon.co.uk/about/docs/knowledge_capital.pdf (accessed 10 October 2006).

166 Value added Table 7.3 Broad benefits accruing from participation in higher education Arena Public Economic Increased tax revenues Greater productivity Increased consumption Increased workforce flexibility Decreased reliance on government financial support Social Reduced crime rates Increased charitable giving/community service Increased quality of civic life Social cohesion/appreciation of diversity Improved ability to adapt to and use technology

Private Higher salaries and benefits Employment Higher savings levels Improved working conditions Personal/professional mobility Improved health/life expectancy Improved quality of life for offspring Better consumer decision-making Increased personal status More hobbies, leisure activities

Source: After IHEP (1998: 20) and reproduced with permission.

had taken place in higher education in the UK, so there is no ground for complacency on the part of higher education institutions. Financial benefit14 This is likely to be attractive as a concept to two main groups – government, since in the UK it has clear policy implications for the funding of students (especially in the light of the decision to charge, but defer payment on, ‘top-up fees’ of up to £3,000 per year),15 and students themselves16 (since it might help them to make judgments about whether they enter HE, what types of programme they might choose, and what institution they might choose). The financial return to a student from higher education has become a matter of considerable policy interest in the UK as higher education has been construed increasingly in terms of the personal benefit to be gained by individuals. On this issue the UK has followed in the steps of the US and Australia. Higher education is seen as offering a return on the investment that students make (even if some of the costs of that investment are deferred beyond graduation). During the debate regarding the merits of ‘top-up fees’ in the UK, much was made of the claim, apparently emanating from the Department for Education and Skills (Aston, 2003), that a graduate could expect to benefit to the tune of some £400,000 over a working life,17 though, when computed as a rate of return, this could be expected to vary considerably with subject – degrees in the arts producing the 14 A number of points in this section are due to John Thompson of HEFCE, and are gratefully acknowledged. 15 The tuition fee regulations are different in the countries of the UK. For a convenient summary, see www.universitiesuk.ac.uk/paymentbydegrees/tuition_fees.asp (accessed 6 June 2007). 16 And, where appropriate, their sponsors. 17 In her article, Aston (2003) adopts a sceptical stance towards this figure.

Value added 167 weakest rate of return to higher education (Walker and Zhu, 2001).18 More recent work by Chevalier et al. (2004), O’Leary and Sloane (2005a,b) and PricewaterhouseCoopers (2007) suggest that, disregarding subject disciplines, the lifetime financial return compared with those entering the labour market with two or more A-levels is now estimated to be considerably lower. The last of these studies suggests that the lifetime gain over a school leaver with two passes at A-level, in current money terms, averages approximately £160,000, but with a wide variance with subject discipline. Whereas graduates in Medicine and Dentistry can expect a gain of some £340,000, the lifetime premium for graduates in the humanities is estimated at just over £50,000 and, for arts graduates, approximately £35,000 (data for other subject disciplines can be found in PricewaterhouseCoopers, 2007: 5). The financial benefit accruing to a graduate during a working lifetime depends on a number of variables, including the demand/supply ratio for graduates in the labour market, the effect of the numbers of graduates in the labour market on nongraduates’ earnings – and, of course, the qualities and capabilities of graduates themselves. Mayhew et al. (2004) indicate some of the uncertainties inherent in economic analysis in this area. Robertson (2002: para 194ff) showed that, in the US, the financial benefit from education correlated positively with level up to that of the bachelor’s degree (he did not consider postgraduate degrees), with the salary level for graduates at bachelor’s level opening an ever wider gap with levels for other qualifications between 1975 and 1999. The salary level for those with the associate degree also increased but not to the same extent as for bachelor’s degrees. Parental background accentuated the differentials in gaining employment and salary. A methodological comment The ideal measure compares the earnings of those who have spent time in higher education (normally graduates) with those of people who, though having equivalent entry qualifications, opted not to enter higher education. The difficulty is that nowadays a substantial proportion of all those who are qualified to enter higher education in the UK do so, thereby attenuating the comparator group to the point at which an absolute measure is seriously compromised. If the intention is to compare institutions with reference to the financial benefit gained by their students, then finding comparable student cohorts requires some care. The entry profiles of, say, an elite university and a new university may be too different for comparisons to be valid.19 However, comparisons within broadly cognate groups of institutions may be feasible. 18 Chevalier et al. (2002: 5–6, n3) point out that the opportunity costs of the educational investment are often omitted from calculations, limiting the rate of return to that from education alone. 19 Where the aim is to compare the value added by institutions, the difficulty is in finding students who are sufficiently similar to be compared. This may be possible for institutions that are similar to each other but, when the analysis has been extended to compare elite Russell Group institutions with post-1992 universities in the UK, the comparison has been based on very small numbers of students (see Chevalier and Conlon 2003).

168 Value added Other considerations are the effects of increased numbers of graduates in the workforce on both graduate and non-graduate remuneration, and the time-scale over which studies of financial benefit should be conducted. When longitudinal studies report, the data are of retrospective interest and may not reflect current circumstances. It is unclear whether a ‘synthetic cohort’ approach could be designed that would enable value added information to be more relevant to contemporary circumstances; this would involve the mapping of cross-sectional data into a longitudinal framework, with appropriate adjustments.

Awards for co-curricular activity in the UK Students in the UK are often expected, as part of their assessment requirements, to document and reflect upon a formal period of placement in work or service organizations. In some institutions, this may derive from a module devoted to co-curricular activity, and attract credits. In others, the documenting may make only a marginal contribution to the student’s overall performance, and hence may attract a limited amount of effort. Some undergraduate students in the UK have the opportunity to complement their academic award with an award testifying to achievement in areas such as work placement, employment and voluntary service (see Lang and Millar, 2003, for a listing). Some awards are potentially available to all students (for example, via City & Guilds and CRAC InsightPlus), whereas others have been established by particular institutions for their own students (e.g. the York Award and the Warwick Skills Certificate). However, this landscape would, in geological terms, be described as ‘active’. Higher education institutions in the UK are increasingly expected to bring workbased learning into their curricula in order to satisfy policy expectations regarding the development of a highly qualified and effective labour force. They are developing new approaches to the accreditation of co-curricular activity (where work-based learning falls outside the formal curriculum structure) and to the accreditation of such activity as part of the formal curriculum. This has had differential effects on, for example, the York Award, the Warwick Skills Certificate, the Bristol Personal Development Award and the Essex Skills Award. Whereas the first three are successful in attracting students (and demand for the BPDA is reported as increasing), the Essex Skills Award was discontinued in 2005. The developments within institutions seem to be putting pressure on the other bodies that have hitherto offered services in this area. The funding originally made available for the CRAC InsightPlus initiative meant that students could gain an award accredited by the Institute of Leadership and Management at a very modest cost. The ending of the original funding has tipped the ‘terms of trade’ away from this initiative, which has been largely supplanted by a scheme funded under the Leonardo da Vinci programme in Europe, which has attracted some 100 students across five countries. Although the City & Guilds Licentiateship has grown slowly in the past few years, its growth has been held back by a decline in student demand for ‘sandwich’ placements. The Professional Development Award, which

Value added 169 can be gained from relatively short periods of work placement, has however seen stronger growth. This shift in emphasis is consistent with students wanting to obtain degrees in the shortest possible time (due to the implications of national policy on tuition fees and student support) and with the inclusion of short work placements in academic programmes in the light of government policy towards workforce development in higher education. The existence of awards for co-curricular and, in some cases, extra-curricular achievements is a tacit acknowledgement that formal assessments for programmes of study tell only part of the story. Students achieve more than formal assessments can ever tell, and it is sometimes the achievements that seem to be oblique to the educative enterprise that are the more telling in the longer term. Hence – and has often been said in the schools sector – calculations of value added are of limited use in conveying what students have gained and the extent to which institutions have succeeded.

Value added as an institutional performance indicator The introduction of a measure or performance indicator affects institutional behaviour. The Research Assessment Exercise in the UK has provided a strong case in point, with institutions adjusting their activities in order to gain as much as they can from the funding available: the focusing of research effort on potential ‘winners’, the transfer market in academics with successful research performances, and selectivity in putting staff into the RAE are three examples. Indicators with less immediate effects on institutional funding can nevertheless influence institutions’ behaviour. Hence serious consideration has to be given to the possibility of unintended consequences from the introduction of value added as an indicator. Indicators based on entry qualifications and exit performances can be affected by institutions, since it is possible for them to lower the former (though this could have an adverse effect on retention and completion) and raise the latter. In Chapters 4 and 5 the concern was noted that in both the US and the UK the grades gained by students in higher education have increased over time. The introduction of formal test instruments, such as those currently being piloted in the US, is likely to induce ‘teaching to the test’ (because institutions will desire to do well in ‘league tables’ or rankings), to the possible detriment of broader learning. An indicator based on the financial benefit to students or graduates is unlikely to affect institutional behaviour to any great extent because the benefit is independent of the institution, once the ‘positional values’ of the institution and qualification are taken into account. It should also be noted that the introduction of a measure or indicator will only be successful if it gains the consent of the institutions: there are examples of state-level activities in the US in which disagreements with institutions led to the intended effectiveness not being achieved in practice (see Shulock and Moore, 2002: 59ff). It is where institutional performance is concerned that problems with measuring value added are most apparent. In the UK, the indicators at entry to and exit

170 Value added from are both tricky, as analyses by Chapman (1996) and Yorke (1998b) amongst others indicate. Entry to higher education used to be treated in terms of points awarded for performance in the A-level examinations, but this left out of consideration all those who entered higher education with alternative qualifications. The replacement of A-level points by a tariff capable of subsuming a variety of qualifications20 is an improvement, but not a complete solution to the problem of the entry-side metric. The typical exit qualification, the honours degree, is influenced inter alia by the subject(s) of study, and the parity of awards from different institutions is questionable. Further, awards from higher education institutions are under the institutions’ control, albeit with some safeguarding from the external examiner system. If value added were to be used as a performance indicator (with all that might imply for ‘league tables’ or rankings), then the inherent weaknesses of the measure become significant. Although the entry and exit measures used in the US are different from those used in the UK, the same problems with value added broadly obtain. The use of value added as a performance measure is critically discussed by Cave et al. (1997: 127ff) and Yorke (1998b). Particularly when benefits are to be attached to performances, it becomes important to have measures of value added that are sufficiently robust to withstand challenge. The method for calculating value added at secondary school level in the UK has evolved as statisticians have sought to take better account of impin ging variables. Illustrative of the problem in higher education, but making only a partial connection with value added, is the methodology for the determination of awards to Australian institutions from the Learning and Teaching Performance Fund (LTPF). The partial connection with value added is because it is institutional success, as measured by various aspects of students’ performance (and particularly with students’ self-reporting of their development of ‘generic’ skills as a consequence of their study programmes), that is at issue. Access Economics Pty Limited (2005), which reviewed the way in which the Department of Education, Science and Training (DEST) was operationalizing its performance indicators for this purpose, echoed DEST practice in making the fair point that exogenous variables (such as gender and age of students) would need to be controlled in order to establish, as the cliché has it, ‘a level playing field’. Eliminating such variables from the assessment leaves in consideration only the performance measures: • • • • • •

student progress rates; student attrition rates; graduate full-time employment; graduate starting salaries; graduate full-time study; CEQ overall satisfaction;

20 See www.ucas.com/candq/tariff/index.html for details of the tariff used by the Universities and Colleges Admissions Service.

Value added 171 • •

CEQ good teaching; CEQ generic skills;

and an unidentified ‘residual’ in which is found all the variance that has not otherwise been taken into account. The review found a number of methodological problems in respect of the CEQ (ibid.: 9ff) when used for this purpose. Ramsden (2003), the originator of the CEQ, noted that its use as an indicator at the level of the institution was inappropriate (ibid.: para 2.2). There are also a number of technical problems with the instrument which remain in its variants, despite validation in statistical terms (for comment on the CEQ, see Yorke, 1996; 2006). As the Linke Report (Linke, 1991) made clear, employment data are influenced by local and national labour market conditions, and in some subject areas a graduate may take some time before gaining what would be regarded as a ‘graduate-level job’.21 The most robust indicators in the above list are probably those relating to progression and attrition. Doubts arise when consideration is given to the construct validity (see Chapter 1) of the outcomes of analyses based on these indicators for the rewarding of institutional performance in learning and teaching. When some of the indicators are problematic, the doubts are increased since there is a potential for error that is hidden by virtue of assuming that the data are reliable measures (as well as the ‘error’ from unaccounted variance). There must be some doubt about the proportion of the final compilation of measures that can be attributed to ‘error’. As was found in pilot studies of the National Student Survey in England (Richardson, 2004), it is perhaps only at the extremes of the distribution of institutional performance that differences between institutions can be inferred with an adequate level of confidence for the purposes of funding allocation.

The grail of value added still tantalizes The identification of successful (and unsuccessful) teachers and schools has for a considerable time been of interest to policy-makers in the US. Value added modelling (VAM), with its capacity to control for background variables, has been seen as offering the prospect of fulfilling this aim. McCaffrey et al. (2003) evaluated the research evidence bearing on the identification of a ‘teacher effect’ in schools and concluded inter alia ‘the existing research base on VAM suggests that more work is needed before the techniques can be used to support important decisions about teachers or schools’ (ibid.: 111). If this is the case for schools, across which is a fair amount of commonality, then their conclusion surely stands a fortiori for higher education, in which institutional autonomy is a powerful determinant of educational experiences. A valid and reliable index of value added in higher education, whilst continually tantalizing policy-makers, remains some distance beyond their reach. 21 Ramsden (2003: para 2.2) also refers to the influence of field of study on graduate employment data.

Chapter 8

Fuzziness in assessment

Stepping back from presumptions of precision As has already been argued, most achievements in higher education are too complex to be accorded a single unambiguous grade, especially when the grading scale is fine – and there is always the potential for skirmishes at the borders of grades. Even when the grading is unambiguous, it will reflect preceding judgements of what should be assessed. How does one differentiate between, say, 64 and 65 per cent (or between GPAs of 3.4 and 3.5) and, even if one could, what would the meaning-value of such a distinction be? It is appropriate in this short chapter to take a step back from the precision that is so often used (implicitly, explicitly and inappropriately) to signal students’ achievements, and to consider the fuzziness that is inherent in most assessments in higher education. The implications of the analysis presented here lead to a further question: can assessment be ‘good enough’ for its purposes without aspiring to exactness?

Fuzzy sets Assessments are quite often fuzzy measures relating to fuzzy constructs (or fuzzily shared constructs, as Webster et al., 2000, Tan and Prosser, 2004, and Woolf, 2004, have demonstrated). As Sadler (1987: 202ff) pointed out, the fuzziness inheres in both criteria and standards (regarding the latter, he identified the problem of identifying unambiguous descriptors of levels of achievement). One cannot resolve the problem of standards through attempts to define them with precision if they are inherently fuzzy. Lotfi Zadeh is generally credited with bringing fuzzy set theory into prominence. In a paper published in 1973 he wrote: The traditional techniques of system analysis are not well suited for dealing with humanistic systems because they fail to come to grips with the reality of the fuzziness of human thinking and behavior. . . . we need approaches which do not make a fetish of precision, rigor and

Fuzziness in assessment 173 mathematical formalism, and which employ instead a methodological framework which is tolerant of imprecision and partial truths. (Zadeh, 1973: 29) and [A]s the complexity of a system increases, our ability to make precise and yet significant statements about its behavior diminish until a threshold is reached beyond which precision and significance (or relevance) become almost mutually exclusive characteristics. (Zadeh, 1973: 28) Zadeh’s words bring to mind Miller’s (1956) observation that people are less accurate if they have to judge more than one attribute simultaneously. Discursive and artistic achievements are particularly vulnerable to Miller’s stricture. A few researchers have seen Zadeh’s view of fuzziness as highly relevant to the challenge of grading student work, and have sought to apply it in practical contexts. Biswas (1995) tested a fuzzy set approach to grading. Echauz and Vachtsevanos (1995) used the approach when they tried to compensate for variations in assessors’ grading by incorporating adjustments for variation in the environment of teaching and learning. However, their method was only partially able to offer compensation because a limited number of sources of possible variation could be taken into account, and because of uncertainty regarding the extent to which assessors’ actual grades were biased. Ma and Zhou (2000) argued that there would be benefits if students and faculty agreed on assessment criteria and then the criteria were used with respect to a fuzzy grading scale. Although there are other ways of getting faculty and students to engage with criteria (see, for example, Gibbs, 1999, who describes how engineering students were encouraged to appreciate what was being expected of them by making assessments of peers’ work), the assessment criteria may well have had to be specified beforehand, when the module or programme was being established and approved. Saliu (2005) sought to apply fuzzy set methodology to criterion-referenced assessment of students’ learning. For example, the four positive levels (unistructural; multistructural; relational; and extended abstract) of the SOLO Taxonomy (Biggs and Collis, 1982) for judging the quality of essay-type work were seen as overlapping constructs rather than discrete categories. Any piece of work could be seen as possessing a ‘degree of membership’ of each of the four terms, ranging from zero to unity. Markers of essays will recognize the general point qualitatively whenever they have to make decisions on how many marks to award from the stated tariff or mark scheme for, say, the structure of the argument or the coverage of the topic. Some aspects of the essay may be quite good, whereas others may be demonstrably weak. Saliu constructed a rule system based on fuzzy set theory and applied it to six component grades for a course on microcomputer system design. He also applied the standard (non-fuzzy) system used by the department for combining grades,

174 Fuzziness in assessment and a modification of the standard method with revised weightings for the components. In 19 of the 33 instances Saliu claimed that the grading obtained was identical for all three computational methods, though this proportion seems to have depended upon how ‘rounding’ was applied. What Saliu’s work does suggest (and what is widely known from experience) is that how marks are combined will have an impact on the overall grade. It is open to question whether the use of fuzzy set methodology materially improves the sharpness with which overall outcomes can be expressed. An examination of these studies suggests that it is difficult to see how the logic of fuzzy sets can be translated into action at the practical level of assessing students’ work, since the methods that have been used to date are complex and difficult to implement. Further, they lack transparency, and so there are potentially fatal problems with the transmission and reception of meanings. However, though the methodology may be problematic, the principle of fuzzy grading is not necessarily compromised.

Fuzzy grading The Higher Education Funding Council for Wales (HEFCW) acknowledged the difficulty in using a simple scale of grades when making judgements regarding the quality of subject provision in higher education institutions in Wales, and opted for a four-component profile based on programmes and curricula; teaching and learning; assessment and academic support; and student achievement. Its counterpart in England (the Higher Education Funding Council for England, HEFCE) opted for a six-component profile. Apart from the detail of the profile categories, the Welsh quality assessment process differed from that operated by HEFCE in a significant way, in that it involved the coding of judgements regarding provision according to the visible spectrum of colours, with red indicating ‘major strengths and few if any shortcomings’ and violet indicating ‘major shortcomings and few if any strengths’ (HEFCW, 1996). The shading of one colour into the next was intended to convey that, in dealing with such a complex topic, ‘hard’ category boundaries would be inappropriate, and that panels would be expected to use the indications conveyed by the colours in coming to their judgements regarding the quality of provision in the institution under scrutiny. The colour spectrum provides a useful and readily understandable way of representing a continuum of valid, yet non-categorical, judgements about quality across the four key elements of the quality framework. This reinforces the view that there are no precise or absolute boundaries between the key defining characteristics used to describe quality. (HEFCW, 1996, para 21) Though the connection was not explicitly made, the HEFCW’s colour coding is a form of fuzzy grading, with some ambiguity in categorization being acknowledged. A similar point can be made about many of the sets of criteria used to

Fuzziness in assessment 175 determine the mark awarded in respect of students’ work, though the ambiguity is kept at the level of the criteria rather than at the level of the awarded mark. In some situations, the criteria are so vague as to offer virtually no guidance to anyone with an interest in the outcome (see Cope et al., 2003: 678 for an example). In others, criteria are more elaborate. Table 8.1 is based on an excerpt from a set of assessment criteria produced by an institution.1 Similar expressions of criteria can be found across higher education in the UK. The point of the illustration is that probably few submitted pieces of work will conform neatly to the criteria as stated. How is an overall judgement reached when the piece of work exceeds some expectations regarding a grade-level (for example, ‘a fair answer’ in Table 8.1) yet falls short in others? Or when the student has submitted an elegantly presented assignment whose content is seriously deficient? A fuzzy set analyst would probably say, in respect of a particular piece, that it exhibited fractional degrees of membership of each of the categories ‘a good answer’, ‘a fair answer’ and ‘a basic answer’. In other words, the category best fitting the piece of work may be ‘a fair answer’ even though the piece exhibits qualities associated with other categories. Assessors have to judge the ‘goodness of fit’ between a piece of work and the stated assessment criteria, and their judgement will hinge on the interpretation given to the terms in the criteriology. What, for instance, might differentiate ‘fair’ from ‘good’ presentation, or reasonable structuring from a structure with some deficiencies? Sadler (2005: 180) suggests that one might grade an item of work according to a tabulation of the stated objectives (or expected learning outcomes) according to the extent to which the submission provides evidence that it meets them. Drawing on Sadler’s approach, the presumption is that the learning outcomes can Table 8.1 An adapted excerpt from a document informing students of assessment issues 60–69% Good answer

50–59% Fair answer

45–49% Basic answer

Possesses many of the characteristics of an excellent answer but is lacking in some respects. Displays good, but not necessarily comprehensive, coverage of the material. Is well laid out and argued. May not draw upon all key references. Is presented to a high standard Demonstrates understanding of the question and provides a reasonably structured piece of work. Offers fair coverage, picking out the key issues, but lacks development and elaboration beyond these. Omits some key references. Is presented to a good standard Provides a fair but incomplete coverage of the key issues. Has not developed or elaborated the argument sufficiently. Is likely to have some structural deficiencies. Is presented to a fair standard

1 It would have been unfair to single out one particular institution for comment, hence the adaptation. Similar examples appear as appendices in HEQC (1997b).

176 Fuzziness in assessment be categorized as being of primary or secondary importance, and that the grade awarded is qualitatively weighted in favour of the former (Table 8.2). Sadler’s tabulation is a useful starting point for reflecting on the assessment process. However, there are a number of difficulties with it. Sadler himself points out that many learning outcomes do not lend themselves to a dichotomous, lowinference (yes/no) judgement of whether they have been achieved,2 unless the dichotomy is seen as a higher-inference cut between satisfactory and unsatisfactory on some underlying continuum of performance. There may be some expected outcomes whose achievement could be anticipated as a matter of course, and only if they appeared not to have been achieved would the assessment process take note, probably by a penalty such as shifting the grade downward. For example, a final year student ought by that stage to be fully conversant with the citing of sources, and could be expected to be marked down if the work were deficient in this respect. Such an expected learning outcome could, of course, be subsumed under the category of secondary outcomes. More awkwardly for the tabulation, it does not cater for achievements that do not fit neatly into the gradation – the pieces of work in which all of the primary, but few of the secondary, outcomes have been achieved, and those in which the primary outcomes have been poorly achieved but the secondary outcomes have been better achieved. As anyone who has marked student work well knows, it does not always fit into neat categories of grades and hence judgement has to be exercised about the most appropriate grade to award.

Fuzziness in the honours classification At a broader level, Warren Piper (1994) obtained from 434 course leaders their understanding of the qualities that related to the different classes of the honours degree. The three most frequently mentioned qualities were originality, understanding and amount of knowledge, which probably occasion little surprise and whose relative frequency of mention varied with honours degree outcome (Figure 8.1). Originality was most strongly present at first class level, with its frequency of mention dropping sharply to upper second class level and steadily declining Table 8.2 The grading of work according to the extent to which expected learning outcomes have been achieved Primary outcomes achieved All All Most Most Some

Secondary outcomes achieved All Most Most Some Some

Resultant grade A B C D E

Source: After Sadler (2005: 180).

2 The competence-based approach seeks to do this, however – though not always successfully.

Fuzziness in assessment 177

Figure 8.1 Frequency of mention of three aspects of achievement related to categories of honours degree outcome. Note: Data reconstructed from Warren Piper (1994: 176).

thereafter. Understanding tended to drop away below the upper second class level, and amount of knowledge peaked in number of mentions at the third class honours level. Although Warren Piper is justifiably cautious about his data, the pattern of mentions he received suggests that respondents found it easier to indicate qualities at the upper end of the classification, since first class received 1,153 mentions of qualities, with the number tailing off to some 450 at the lower end. Hence not much can be read into the peak in ‘amount of knowledge’, which may merely be an artefact of respondents not having singled out many striking qualities at the lower classification levels. Implicitly, the pattern reflects an ‘excellence minus’ approach to the referencing of achievement. Warren Piper (1994) illustrated (both implicitly and explicitly) the difficulty that external examiners found in determining the level of a student’s performance: First-class candidates . . . were described as those who not only showed originality but who could relate their answers in a broad methodological [he might have added ‘and theoretical’] framework; these candidates say, . . . ‘I know what I am saying and I know why I am saying it’. Typically, respondents felt confident about recognising a first-class performance when they came across one. A characteristic of the upper second class answer, it was typically suggested, is that the candidate sticks scrupulously to the question and answers it competently. The borderline

178 Fuzziness in assessment between upper and lower second-class degrees was difficult to negotiate for some interviewees. . . It seemed as though at the 2(i)/2(ii) borderline there was a clash of criteria. As one interviewee put it: ‘at what point does competence get rewarded with an upper division and liveliness get brought down because of some technical hash?’ Another interviewee said that he was often shown third-class and fail papers and asked to look at them to see if he ‘could see anything in them’. (Warren Piper, 1994: 177) When assessors are judging work that does not, as is the case in some quantitative and science-based areas, possess the quality of being unambiguously right or wrong, they are in effect operating roughly according to fuzzy set principles. If there are, say, 5 marks available for the structure of an essay, then the mark awarded will be a signifier of the extent to which the submitted work approaches the level for which 5 represents the highest standard expected at that particular level of study. A mark of 3, therefore, signifies that the piece of work is of a ‘middling’ standard. It might contain elements that would on their own merit a 4 or even a 5, but these would be counterbalanced by elements rated less than 3. The danger lies in seeing the number as possessing cardinality, when it is at best ordinal and probably is no more than a fuzzy signifier – the fuzziness encompassing both the imprecision in the judgement and in the interpretation likely to be placed on that judgement by its recipient.

Ordinality and ‘mapping rules’ Since the application of fuzzy set principles to educational assessment is a formidable undertaking, an alternative, following Dalziel (1998), is to develop a set of qualitative grades which involve order but not numerals (letter grades would avoid the temptation to treat numerals as numbers). Rules can be generated for the conversion of letter grades into a single overall grade. This is not new, since some institutions in the UK use a mechanism of this sort when they convert a student’s profile of grades into an honours degree classification (but not when they use the addition or averaging of numbers to determine the classification, of course). However, the way the classification mechanism works does not cater for variation between subject areas (and, in some instances, within subject areas) in the distribution of grades. Unless grading takes the form of a tightly defined ‘menu marking’, it is likely to be related to a set of more general assessment criteria. If the number of criteria is large, then the grading process is particularly vulnerable to error from the ‘halo effect’, overlap between perceptions of criteria, marker fatigue, and so on. Hence it is probably better to have a relatively small number of criteria; this will not eliminate such sources of error, but may mitigate them. At the level of the particular assessed piece of work, the often-heard criticism of grading components in terms of letters is expressed as something like ‘How can you add together an A, a B, two Cs and a D for the various components?’

Fuzziness in assessment 179 This is where ‘mapping rules’ come in. Components of an assignment are often given numerical weightings to indicate their importance. In using letter grades, the mapping rules can embody priorities. If, in the hypothetical example, the A were awarded for knowledge and understanding of the topic, and the D for presentation, one would probably value the achievement more highly than if the ‘content’ merited a D and presentation merited an A. One might want to specify, for that particular piece of work, that the overall grade would be influenced most strongly by the knowledge and understanding displayed – perhaps by awarding a provisional grade for this aspect of the assignment and adjusting it upward or downward according to the performances against other criteria (there could conceivably be a subsidiary rule that did not permit such variance to exceed one grade category). When numeric marks are assigned to components of a piece of work, the resolution of priorities is typically achieved by awarding more marks for the aspects deemed of major importance, and fewer for aspects of minor importance. In combining assessment outcomes of multiple assignments within a module of study, the method might be, in respect of a grading scale running from A (high) to E, something like: • •

•

determine whether any assignment should have priority akin to ‘knowledge and understanding’ in the preceding paragraph, or reach a particular grade level in order to merit a pass overall; construct ‘mapping rules’ to determine the overall grade (this could be done automatically), for example, requiring two of three assessments to be graded B or above for the overall grade to be B, unless the worst grade were E (when the overall grade would drop to C); determine what rectification would be needed on the part of the student in order to gain an overall pass.

However, Dalziel (1998) cautions that, as the number of categories that have to be combined rises, the number of mapping rules quickly becomes extremely (he might have said ‘prohibitively’) large. This suggests that mapping rules need to be fairly simple and that they might form a guide to a judgement of the merits of the work rather than an algorithm that would be automatically followed in all cases.

‘Good enough’ assessment? It is reasonable to assume that the assessment of student work is undertaken with the best of intentions. Weaknesses in the assessment process are primarily of methodology rather than of the assessors. This book has shown that marking and grading are troublesome activities, and probably more troublesome than many – perhaps the considerable majority of – assessors appreciate. A further factor to be borne in mind is the pressure on assessors that stems from the need to assess a considerable volume of student work, often to tight deadlines. The assessment system is, as a consequence, under considerable strain. How might the strain be lessened?

180 Fuzziness in assessment First, it needs to be acknowledged that marking to a high level of precision is, in most circumstances, infeasible even where there are specification ‘menus’ against which work can be marked. However, for some aspects of achievement, such as in Medicine, Nursing and Teacher Education, there are certain curricular components for which a student must attain a specified standard in order to pass – or, perhaps more accurately, there is a threshold of performance below which a pass cannot be awarded. Elsewhere, clear-cut decisions are more of a problem, which is not to suggest that relative judgements of worth cannot be made: rather it is to suggest that worth is often better judged in terms of broad categories rather than fine distinctions. This, of course, goes against the tenets of traditional measurement theory as presented by writers such as Ebel (1972). The reasons for taking such an oppositional line are to be found in various places in this book, but a reprise of some of the key points is probably appropriate. • • • • •

It is difficult to satisfy the technical criteria that ‘high stakes’ summative assessment demands. There are many influences on assessment that can affect the mark awarded. Criteria are often fuzzy and fuzzily shared. It is difficult to mark with precision against even a tightly specified ‘menu’. Interested parties may be more interested in the generality of performance across a range of fields, and/or the particularity of performance in specific fields, neither of which is ideally served by the marking and grading approaches typically used. To these must be added two further points.

• •

Some aspects of achievement are not defensibly assessable in terms of marks or grades. The cumulation of marks presents a number of technical problems and in any case may not be particularly helpful to interested parties.

These prompt a consideration of the utility of aiming for marking practices that are ‘good enough’ for the purposes to which marks and grades are put, rather than aiming for an unachievable perfection in the calibration of students’ achievements. Some will probably recoil at the suggestion of ‘satisficing’, as Simon (1957) termed such an approach, in an aspect of the educative process as sensitive as assessment. However, much that goes on under an espoused model of precision may not stand up to scrutiny, and the model-in-use may be closer to satisficing than is appreciated. What might a satisficing approach have to offer? In circumstances in which there is no right, wrong or clearly gradable answer to the assessment challenge, assessors would openly act to judge the student’s response, perhaps according to broad categories applied to parts or the whole of the challenge, and using ‘mapping rules’ where appropriate. The grade awarded would be a broad signifier of the level reached by the student, but could be amplified by comment on strengths

Fuzziness in assessment 181 and weaknesses which could be used as supporting evidence in a portfolio of achievements and perhaps in the construction of a formal institutional transcript. The important issue for interested parties is what the student has achieved, rather than the number or letter that is attached to the performance. As will be recalled from earlier, Alverno College takes the further step of not grading achievements according to a scale. Under a satisficing approach, the distribution of assessor’s effort might be different. Instead of trying to make precise connections with an assessment specification menu, the work could be assessed in more general terms (and hence more quickly), but with the saving of time on this part of the activity being repaid by more time being spent on comment that would be useful to the student, as feedback on the standard achieved, ‘feed forward’ in respect of any subsequent work, or as a contribution to the student’s portfolio of achievements (and hence with the potential to be drawn upon for purposes beyond the academy). This would give summative assessment more of a formative slant than it typically has, which would be consistent with the desires of those who emphasize the importance of assessment for student learning (such as Alverno College Faculty, 1994, and contributors to Knight, 1995).

Chapter 9

Judgement, rather than measurement?

‘There’s always another way’ Considerable exposure has been given in the preceding chapters to the weaknesses inherent in marking and grading. Thus far, it has been argued that: • • • • • •

assessments vary with the purpose and challenge of the assessment task, the interpretation of the assessment criteria, and the assessor; most assessments cannot, for a variety of reasons, meaningfully be made with the fine discrimination of the percentage scale; numerical grades do not normally possess the properties that would allow mathematical manipulations to be valid; the methodological zigzag between fineness and coarseness in the grading system that ends up with a grade-point average is troublesome; the combination of grades may inappropriately conflate grades awarded for a variety of kinds of achievement; some kinds of achievement cannot be graded robustly without the deployment of far more resources than are typically available in higher education, and some may not sensibly be graded at all.

If there has been an over-investment in marks and grades, and in their combination, is there a viable alternative? There is a cartoon in which a mother terrapin swimming on the surface of water is followed by a line of young terrapins doing likewise; however the last terrapin has inverted its shell and is happily padding as if in a canoe. The caption ‘There’s always another way’ points to the virtues of rethinking established practice. As the following quotations indicate, some argue strongly – and the argument is by no means new – that an over-reliance on measurement or quasi-measurement is a mistake: We must remember that grading is not measurement and can never be an exact science. (Lewis, 2006: 146)

Judgement, rather than measurement? 183 The obsession with measurement is the problem. There’s something we can use instead of measurement – judgement. Some of the most important things in the world can’t be measured . . . Henry Mintzberg, quoted in The Financial Times, 16 September 2003 Following the discussion of ‘good enough’ assessment in the preceding chapter, consideration is given to an increased use of judgement in assessment, and to its implications, amongst which are a dispensing with overall grades (but not grades for curricular components) and of adopting a compromise position in which the fuzziness of overall grades is acknowledged. The latter may be of significance where it might be necessary to have some indication of the ‘transfer-value’ of a student’s achievement, especially across national boundaries.

Learning outcomes and scientific measurement The use of a ‘learning outcomes’ approach to curricula tends to sustain the scientific measurement approach to assessment that was strongly implicated in the ‘instructional objectives’ movement in the middle of the twentieth century (of which Tyler, 1949, was an early proponent, and which Mager, 1962, subsequently technicized), and whose roots lie in psychological theory stressing stimulus and response. Shepard (2000) noted that the residual effects of approaches based on scientific measurement were out of kilter with educational approaches owing much more to constructivism, and that work remained to be done in order to bring assessment into better alignment with constructivist educational methodology. If ‘instructional objectives’, latterly reformulated as ‘expected learning outcomes’, can be specified tightly at the outset then, by the principle of ‘curriculum alignment’ (Biggs, 2003), the assessment methods should be fully coherent. An issue of importance is how tightly closed the curriculum should be. As was noted in Chapter 5, if the learning outcomes are narrowly drawn, then students are likely to restrict themselves to the stated expectations, perhaps performing well on a relatively narrow agenda. If the outcomes are drawn more broadly, then the scope for interpretation on the parts of both student and assessor is widened, involving both opportunities and risks for students. Eisner advocated ‘expressive objectives’ as a counterbalance to the closedness of instructional objectives (Eisner, 1969). ‘Expressive objectives’ are applicable in learning situations in which a person first identifies the problem and then goes on to produce a solution to it – which is the kind of innovative behaviour that Reich (1991, 2002) and others have seen as critical to the success of advanced economies. Eisner subsequently found it necessary to propose a third category that covered the kinds of demand typically made of engineers and designers – where the problem is established, but the choice of solution is open (Eisner, 1985: 77f). This category, which is intermediate between ‘instructional objectives’ and ‘expressive objectives’, could be labelled ‘problem-solving objectives’. Table 9.1 is offered as a heuristic, in that it illustrates in general terms the differences between types of learning outcome, and hence by extension between

184 Judgement, rather than measurement? broad types of educational task. Grading can be predetermined in respect of instructional outcomes, and can be utterly precise in some circumstances – for example, when computer marking of multiple-choice questions is employed. In both of the other two types, the success of the proffered solution can be judged only against general criteria whose salience is likely to vary with the judge. In other words, any grade that might be given to the solution has to be hedged with an uncertainty that reflects the elasticity of the assessment criteria and the preferences of the assessor. To some extent echoing Eisner’s position on ‘problem-solving’ objectives, Knight and Yorke remark: Insofar as employability involves the promotion of achievements that cannot be specified completely and unambiguously, it cannot be measured, although local judgements can be made. (Knight and Yorke, 2004/06: 5) In other words, employment throws up problems of various kinds, to which there is no single ‘right answer’ that can be applied uncritically, since the solution has to take into account the contingencies of the situation. The same is true of life in general. In practice, the boundaries between the three types of outcome are often not as sharp as presented in Table 9.1. The consequence is that the inherent fuzziness of assessment spreads across most kinds of intended learning outcome. It is readily apparent that, the further one moves away from the boundedness of ‘instructional objectives’, and towards the openness of creative behaviour, the more difficult it becomes to assign grades, validly and reliably, for achievements (see the discussion in Chapter 2). The prediction of future achievement adds a further dimension to the challenge. Competence-based programmes and the issue of context Competence-based programmes, which came to prominence in the US around 1970, are at root an extension of the behavioural objectives approach with the focus of attention clearly on what the learner is expected to be able to do as a result. Cope et al. (2003) discuss, with reference to teacher education in Scotland, the challenge of combining assessments of academic work and professional practice. Of importance for this chapter are their observations on the importance of the context within which the performance is carried out. Where a competence-based approach is in operation, the context is a variable that is accommodated by the Table 9.1 Categories of educational objectives Type of objective Instructional Problem-solving Expressive

Problem Specified Specified Open

Solution Specified Open Open

Judgement, rather than measurement? 185 use of the slippery word ‘appropriate’, which deals with the problem of variation to some extent but inevitably at the cost of precision. Cope et al. argue that ‘it is the very context and uncertainty that define what is difficult about the practice of teaching’ (ibid.: 675). Martin and Cloke (2002) make much the same point about the standards for qualified teacher status currently applied in England and Wales, arguing that assessment is based on a scientific model that assumes (implicitly) that contextual influences are minimal. The rhetoric of the then Department for Education and Employment concerning specificity, explicitness, reliability, consistency and accessibility was borrowed from assessment systems in which context could be ignored and reliability could be more or less guaranteed. Their argument is that the scientific model is inappropriate for this type of assessment, and that the assessment of practice teaching should rely on a judgemental model in which the assessor makes professional judgements about the nature of the practice observed and its appropriateness to the particular context. Prediction Adelman (forthcoming) noted that using general measures of achievement such as SAT and ACT scores as an ‘input’ measure is inappropriate when the ‘output’ measure relates to a subject specialism. The point can be extended to the relationship between the achievements of a graduate at bachelor’s level and the capabilities required in postgraduate-level study, employment, or elsewhere outside higher education. After graduating, some of the components of the undergraduate achievement may be of little or no practical relevance, and hence it may be only a sub-set of the profile of achievements that needs to be taken into account. An overall index, such as a cumulated GPA or degree classification, blurs the picture. The analogy might be made with photographs of celebrities that have been coarsely pixellated on websites in order to make it challenging to identify them in competitions run by a radio station – they give some information, but not enough without the competition’s spoken clues. For future purposes, the overall index on its own does not allow discrimination of the relevant (for some particular purpose) from the irrelevant.

Skills and skilful practices The development in students of vocationally relevant skills is present in national systems for post-compulsory education, but the location of this activity is influenced by the structure of this provision. In Australia there is a specific national vocational education and training sector; in England vocational programmes at higher education but sub-degree level can be found in both further and higher education institutions; in Scotland the further education colleges dominate sub-degree work; and in the US much provision is specifically made through the vocationally relevant associate degree and less directly through liberal arts programmes at the same level. Hence the emphasis on such skills at the bachelor’s level varies

186 Judgement, rather than measurement? somewhat between systems,1 and the extent to which they figure in assessment schemes will reflect this. The UK government has for many years promoted the development of ‘skills’ with various adjectival prefixes – for example, the literature gives as qualifiers ‘core’, ‘key’, ‘generic’ and ‘transferable’. Wolf (2002: 117ff) traces the policy on ‘core skills’ back to a speech in 1989 by the then Minister of Education. The general suggestion of a ‘skills agenda’ was readily taken up in terms of policy, even though the minister’s engagement with ‘core skills’ seems to have been ephemeral (ibid.: 119). However, ‘skills’ have never found much favour in higher education, largely because of their narrow instrumentalism and the lack of a convincing rationale. The USEM approach (Knight and Yorke, 2004; Yorke and Knight, 2004/06; and see Chapter 1) focuses on ‘skilful practices in context’ with its implication of involving professional judgement (in contrast to behaviour that is rather mechanical in character). Adopting the USEM approach necessarily moves assessment away from the testing of the implementation of skills (fairly ‘hard’ measurement) and towards (‘softer’) judgement regarding professional performance. A broadly similar policy interest in ‘skills’ has been expressed elsewhere, with the promotion of ‘generic skills’ being seen as an appropriate response to the need to develop a workforce that was capable of adapting to the necessities of contemporary economic developments. The Australian Council for Educational Research (ACER) developed a Graduate Skills Assessment (GSA), which has been available to Australian universities since 1999. According to Armitage (2006), uptake of the GSA to date appears not to have been strong, an unresolved issue being the location of responsibility for funding its use. The reported support of the federal Minister for Education, Julie Bishop, for a scoping study for an extension to higher education of the schools-oriented Programme for International Student Assessment (PISA) sponsored by the Organisation for Economic Co-operation and Development (OECD) may give a boost to the GSA or similar test. The problem would then be one familiar to those who appraise the effect of performance indicators on performance, that of ‘teaching to the test’ at the expense of other, perhaps more desirable, educational objectives. The problem with instruments such as the GSA and the self-reporting of personal development of ‘generic skills’ that is a feature of the Course Experience Questionnaire is that they do not adequately address the capability of the person in a ‘live’ situation. People may be able to ‘talk the talk’, but this is no reliable indicator that they are effective in, say, a workplace in which they have to make a judicious combination of theoretical and practical understanding in order to achieve satisfactory outcomes. The only valid way of determining such capability is through actual practice. In most circumstances, this cannot be achieved by assessing according to the grading approaches typical of the academy, since the generally accepted technical requirements of such assessments are difficult to fulfil. A more valid and practical 1 It also varies within systems, between subject disciplines.

Judgement, rather than measurement? 187 approach is to require the student to make a claim for their achievements and to buttress this with evidence. This approach is consistent with the idea of personal development planning and the construction of a progress file and a portfolio of achievements. It also involves a reliance on qualitative data which could be supplemented by appropriate quantitative data. The use of qualitative data, though, requires a different approach to evidence from that offered by some of the characteristics of educational measurement depicted in Table 1.2.

Using a qualitative approach The contrast between the scientific measurement and judgemental models of assessment was illustrated, following Hager and Butler (1996), in Chapter 1. In the judgemental category falls the constructivist approach to educational evaluation described by Guba and Lincoln (1989). The approach favoured by these authors replaces some of the concepts of (positivistic) educational measurement by counterparts appropriate to qualitative evidence. Five possible replacements are outlined below: an extended treatment is given in Guba and Lincoln (1989: 233ff.). ‘Validity’ in the traditional educational measurement literature is, as was shown in Chapter 1, multifaceted. Concurrent, content and construct validity all refer in some way to the way in which a test can capture reality. Guba and Lincoln refer to the ‘truth value’ of the evidence which, in their qualitative, constructivist view, has to be seen in terms of ‘credibility’ which entails, inter alia, sustained engagement with what is being observed and a critical appraisal of the evidence. Predictive validity (‘external validity’ in Guba and Lincoln’s terms) relates to the generalizability of test outcomes, which, in constructivist terms, becomes transmuted into ‘transferability’. ‘Reliability’ becomes ‘dependability’, which is more tolerant of the shifts in experience and practice that occur in naturalistic settings. Guba and Lincoln offer ‘confirmability’ as the constructivist parallel to the (presumed) objectivity of positivistic approaches to measurement. Confirmability refers to the tracking of data to its sources and to the checking of the coherence of evidence and interpretation. Educational measurement is expected, under traditional approaches, to be low in ‘reactivity’ – i.e. the measurement process is supposed not to have more than a small influence on the behaviour of the person being tested. In situations such as workplaces, where students are interacting from time to time (and in some cases continuously) with those who are their assessors, the ‘low reactivity’ criterion is unlikely to be met. This is an issue with complex ramifications, such as those relating to the role conflict of a person acting both as mentor and assessor, that will not be explored here – although some of the consequences in respect of assessors not wanting to fail students were discussed in Chapter 5. The significance of incorporating qualitative data into summative assessment is that, whether or not one goes as far as Guba and Lincoln in adopting a constructivist position, one is faced with the likelihood of needing to bring together different kinds of data. Addition and similar mathematical operations are not valid options, and some form of profiling is preferable.

188 Judgement, rather than measurement?

‘Top down’ assessment? Overall summative assessment is typically built up from the assessments given in respect of programme components – a ‘bottom up’ approach. What might a ‘top down’ approach look like? The key question that such an approach would pose is something like ‘How have you satisfied, through your work, the aims stated for your particular programme of study?’ (Yorke, 1998c: 181). A question of this sort opens up the possibility of the student making a case that they merit the award in question, a case that can be made by stressing the profile of achievements particular to the individual. The matrix of ‘graduateness’ produced as an outcome of the Graduate Standards Programme (HEQC, 1997a) indicates that there are many ways in which a claim for graduateness could validly be made (Table 9.2). A student who had concentrated on, say, academic research in History would produce a different kind of claim from another who had concentrated on a service-related programme such as human resource management. If the graduateness matrix were imagined as the numbered cloth beneath a roulette wheel, the two students would place their achievement chips in patterns that were quite different from each other and yet both would win. It is important to stress that the decision on whether the student’s work met the requirements for the award of an honours degree would lie with the academics concerned. Table 9.2 Aspects of ‘graduateness’ Subject mastery Development of knowledge and understanding of: Content and range Paradigms Methodology/ies Conceptual basis Limitations and boundaries Relation to other frameworks Context of use

Intellectual/ cognitive Development of the following attributes: Critical reasoning Analysis Conceptualization Reflection/ evaluation Flexibility Imagination Originality Synthesis

Practical Development of the following attributes: Investigative skills/methods of inquiry Laboratory skills/fieldcraft Data/ information processing Context/textual analysis Performance skills Creating products Professional skills Spatial awareness

Source: HEQC (1997a); reproduced with permission.

Self/individual Development of the following attributes: Independence/ autonomy Emotional resilience Time management Ethical principles and value base Enterprise Selfpresentation Self-criticism

Social/people Development of the following attributes: Teamwork Client focus Communication Negotiation/ micropolitics Empathy Social/ environmental impact Networking Ethical practice

Judgement, rather than measurement? 189 Although the ‘top down’ assessment question might at first sight be thought to be holistic, it is not. The question asks for evidence of achievements which could be a mixture of marks or grades for modules of study (or parts of modules) which might be referenced in some way against the relevant cohort; qualitative assessments of performances in naturalistic settings (such as work placements); and claims of achievements that are not formally assessable by the higher education institution but are nevertheless supported by evidence. The making of claims of this sort implies that the student has the relevant information to hand, which would require the collation of a portfolio of achievements – the kind of activity envisaged in the Dearing Report (NCIHE, 1997) in the UK, and discussed at some length by Klenowski (2002). Driessen et al. (2005) describe an approach to the assessment of portfolio evidence which can be summarized in a flowchart (Figure 9.1). Maclellan’s (2004) proposal of ‘alternative assessment’, hints that the following might (with some adjustment) fit into a ‘top down’ approach: • • • • • • •

Student involvement in setting goals and criteria for assessment Performing at a task, creating an artefact/product Use of higher level thinking and/or problem solving skills Measuring metacognitive, collaborative and intrapersonal skills as well as intellectual products Measuring meaningful instructional activities Contextualization in real world applications Use of specified criteria, known in advance, which define standards for good performance. (Maclellan, 2004: 312) Submit portfolio Yes

No

ASSESSED BY MENTOR Student, mentor Agree pass

Disagree Fail

ASSESSED BY RATER 1

ASSESSED BY RATER 2

Rater 1, student, mentor

Rater 1, Rater 2

Agree pass

Disagree

Agree pass

Disagree

COMMITTEE REVIEW

Final decision

Figure 9.1 An approach to the assessment of portfolios. After Driessen et al., 2005: 216.

190 Judgement, rather than measurement? ‘Up to a point’ might be an appropriate reaction. The involvement of students in setting goals and criteria is an important contribution to both motivation and metacognition. Maclellan cites in her support a number of writers who would be seen broadly as subscribing to what has been termed ‘authentic assessment’ (though ‘the assessment of authentic tasks’ might be a preferable term), i.e. the assessment of activities that relate closely to situations beyond the borders of the academy.2 So far, so good. Where the argument falters is where it refers to measurement – a matter that in this book has been argued to be distinctly problematic – and on the reliance on pre-specified criteria which, unless they are expressed in broadly encompassing terms, would cause difficulties for an assessor such as Eisner (1985). A corollary of adopting a ‘top down’ approach to overall summative assessment is that the variety of achievements is unlikely to be scalable to a common format. There is an irony in the fact that, as higher education in the UK has diversified in terms of both the student body and the programmes available to it, many feel the need to bring into a common reporting format the variety of assessments that are made – in effect, to make commensurate the incommensurable. Table 9.2 showed the multi-faceted nature of the concept of ‘graduateness’, and implicitly pointed to the need for a similar multi-faceted assessment regimen in which students could demonstrate where their strengths lie. An overall honours degree classification or GPA cannot meaningfully be computed when achievements are very diverse. Note, though, that the point applies only to the overall summative assessment but not necessarily to individual assessments made in respect of programme components. Professional practice is integrative Professionals, as Eraut (2004b) observes, have to integrate their understanding and practical capabilities. Writing of social work students whose success in assessment was under threat, Bogo et al. (2006) noted that some failed to see how the various aspects of their work fitted together – in terms of the stages of professional development put forward by Dreyfus and Dreyfus (2005), they were acting more like novices than anything else. In this study the work of one student with clients was described as ‘unidimensional’, addressing only concrete issues. In terms of the SOLO taxonomy (though that was developed by Biggs and Collis, 1982, with written material as the focus) the performance would be characterized at best as ‘multistructural’ and may even have been as limited as ‘unistructural’. The article by Bogo et al. prompts a mapping of the SOLO taxonomy categories against those of Dreyfus and Dreyfus, in which a rough correlation can be discerned (Table 9.3). The correlation is at its weakest where ‘proficiency’ is concerned, since it seems to span the ‘relational’ and ‘extended abstract’ categories of SOLO. 2 The term ‘real world’ is often used in this context – indeed, Maclellan uses it herself in the cited extract. The use of the term implicitly regards education institutions as not belonging to the ‘real world’, which is an unwarranted aspersion.

Judgement, rather than measurement? 191 Table 9.3 The categories of the SOLO taxonomy mapped against those of professional development according to Dreyfus and Dreyfus SOLO level Prestructural

Brief description of SOLO level Irrelevant or unmeaningful

Unistructural Multistructural

Focus on one relevant aspect only Covers several relevant aspects, but in an uncoordinated way Relational Aspects are related into a coherent whole Extended abstract As ‘relational’ but developing a broader perspective involving higher level principles

Dreyfus and Dreyfus category Novice Advanced beginner Competence/Proficiency Proficiency/Expertise

The point of including Table 9.3 is that it might have some heuristic value in the identification of the level of performance achieved by a student in the complexity of a professional situation, such as when they are undertaking a period of professional placement. The determination of the overall level reached by the student would necessarily be a qualitative judgement built up from observations over a period of time.

Transcripts and progress files Recommendation 20 of the Dearing Report (NCIHE, 1997: 141) stated: We recommend that institutions of higher education, over the medium term, develop a Progress File. The File should consist of two elements: • a transcript recording student achievement which should follow a common format devised by institutions collectively through their representative bodies; • a means by which students can monitor, build and reflect upon their personal development. This recommendation has subsequently become part of policy in respect of higher education across the whole of the UK.3 The construction of a progress file implies the collation of a portfolio of achievements which could, if desired, be included in the summative assessment process (see the example from Driessen et al., 2005, above). Transcripts of performance are not new: they have been in use in higher education in the US for very many years and have provided the basis for major research studies spanning a number of decades (Adelman, 2004). They go beyond the crudeness of the GPA to provide the user with an indication of where the student had been particularly successful and where relatively weak. The steady movement 3 See the policy statement set out by the Quality Assurance Agency for Higher Education at http://www.qaa.ac.uk/academicinfrastructure/progressFiles/archive/policystatement/default.asp (accessed 9 November 2006).

192 Judgement, rather than measurement? in the UK towards the provision of transcripts for students has been given a push by the introduction of a Diploma Supplement in European countries. Under the Joint Declaration of the European Ministers of Education (‘The Bologna Declaration’), agreed in 1999 by 29 ministers in the European Union, students are entitled to receive a Diploma Supplement which sets out what they have achieved. The Diploma Supplement also shares features with the transcripts used in the US. In the UK the Progress File builds to some extent on considerable – if varied – experience with records of achievement in schools and further education colleges (e.g Broadfoot, 1986; Bullock and Jamieson, 1998) and in higher education in the UK (Assiter and Shaw, 1993). Politically, the idea of a record of achievement in post-compulsory education has been supported by the Review of qualifications for 16–19 year olds (Dearing, 1996: 41ff) and the National Record of Achievement review (Goodison, 1997). Personal development planning (PDP), part of the Progress File envisaged in the Dearing Report but also used in the US and Australia, was the subject of an extensive literature review by Gough et al. (2003) which revealed considerable variation both in the conceptualization of PDP4 and in the way that PDP had been implemented in higher education institutions (see also Jackson and Ward, 2004: 434ff for a schematization of the variation in implementation; and Clegg and Bradley, 2006, for evidence from a single institution showing that PDP is likely to vary with the subject discipline and any related professional body). In some cases, PDP was part of summative assessment activity, in others, it focused more on the student’s broader development. The review team found, inter alia, that most studies reported a positive effect of PDP on student learning. If the effect of PDP is broadly positive, then there are grounds for believing that the learning gain can be carried over into the area of evidencing achievement. PDP is now part of the landscape of higher education in the UK, and has been used in the education of engineers in Oman (Goodliffe, 2005).

Is a single index of achievement the wrong solution? To seek an answer to the problem of combining grades into a single index might be a little like the apocryphal drunk searching under a streetlight for dropped keys, on the grounds that that is where the illumination is. People are so used to single indices of performance in higher education that alternatives to cumulation or combination of grades are rarely considered. An editorial in the Times Higher Education Supplement on 30 September 2005 asserted that the main problem regarding the honours degree classification in the UK was ‘the lack of differentiation brought about by years of grade inflation’ (open to debate, in the light of the discussion in Chapters 4 and 5), and argued that, if full use were made of the categories of the classification, there would be no demand from employers for changing the classification system. Transcripts were not seen as a viable alternative, but could serve a useful purpose by elaborating the classification awarded to a student. Two problems with this proposal are that it does not acknowledge the 4 ‘[R]esearch on PDP and its analogues is still a young area of research with little coherence in terms used or research focus’ (Gough et al., 2003: 65).

Judgement, rather than measurement? 193 uncertainty that inheres in marks and grades, and in any case it is pragmatically difficult to recalibrate classifications to replicate the kinds of distribution of yesteryear when circumstances were very different. Two other possible courses of action are: •

•

to accept the idea of a single index (with all the caveats that this would necessitate), but to invest that index with a fuzziness that the honours degree classification and the grade-point average currently lack, and to supplement the index with some form of transcript; to limit the assessment outcome to the gaining of the intended award or some form of subordinate award (extremely rarely will someone reach the end of a programme with nothing at all to show for it), but also to provide a profile of achievement via a transcript.

There are various approaches to the inclusion in a transcript of data that elaborate on the merits of a performance, which are outlined below. Any single index will not meet the kind of criticism made by Eisner (1985) regarding the valuing of the particular and the fusion of complex achievements. Eisner is concerned that the quality of the ‘outlier’ performance does not get lost in the simplism of the single index: The uniqueness of the particular is considered ‘noise’ in the search for general tendencies and main effects. This, in turn, leads to the oversimplification of the particular through a process of reductionism that aspires toward the characterization of complexity by a single set of scores. Quality becomes converted to quantity and then summed and averaged as a way of standing for the particular quality from which the quantities were originally derived. For the evaluation of educational practice and its consequences, the single numerical test score is used to symbolize a universe of particulars, in spite of the fact that the number symbol itself possesses no inherent quality that expresses the quality of the particular it is intended to represent. (Eisner, 1985: 88–89) Eisner is concerned inter alia with the identification of excellence, which is a recurring theme, whether it be in respect of entry to, or departure from, higher education. The Burgess Group consultation paper on the future of the honours degree classification (UUK and SCoP, 2005) proposed what amounted to a normreferenced category of excellence in the honours degree, awarded to around 5 per cent of those graduating from a programme. (Eisner, and others, might question whether 5 per cent of those graduating could be expected to meet the criterion of ‘outstandingness’ that is implied by ‘excellence’.) However, relatively few graduates achieve an excellent standard across the breadth of their studies, and the more pertinent question is ‘excellent in what respects?’ This suggests a role for transcripts of achievement, in which the areas of excellence can be flagged. Focusing

194 Judgement, rather than measurement? on the issue of excellence, however, distracts attention from the broader argument that all graduates should be able to present their achievements in summary form for employers and others. Transcripts ought to be useful to employers as they seek graduates to fill particular roles in their organizations. Elton (2004) pointed out that a profile of achievements (one can read ‘transcript’ for ‘profile’) might have to include both quantitative and qualitative material. Elton points out that there is nowadays such a variety of degrees that reporting student achievement in terms of a single scale, whether by classification or GPA, has very limited meaning and is open to misinterpretation. The needs of employers Employers desire to recruit graduates who will be successful in their jobs. To recruit a graduate who fails to come up to expectations represents a waste of resources – one that in more extreme circumstances can be redeemed to some extent by firing the employee. The greater problem, perhaps, is the employee who underperforms, but not to the extent that would invoke discontinuation of their employment. Such employees represent a hidden drain on organizational resources. Employers value grades because they see grades as providing broad indications of achievement and therefore as useful in making preliminary siftings of applicants for posts. Pascarella and Terenzini (2005) note that employers tend to the view that grades are multi-faceted indicators – not only of academic achievement, but also of personal qualities like motivation or conscientiousness. They estimate that undergraduate grades account for 2.6 per cent of the variance in job success in the US, the figure being higher where the link between education and employment is closer. A similar pattern obtains as regards the link between grades and starting salary, though the figures that they quote suggest that a strong link between education and employment can raise the variance explained to around 20 per cent. The data are consistent with the findings from their earlier review of the literature (Pascarella and Terenzini, 1991). In the UK, the class of a graduate’s degree, and sometimes the institution from which the degree was obtained, are taken as indicators of a graduate’s ability to perform in an organizational environment. Some employers have a list of preferred institutions from which they recruit (Hesketh, 2000), which may exclude worthy candidates who happen to have attended the ‘wrong’ institution. They may also have the resources to use assessment centres for a second screening of applicants, the first having been undertaken via paper or its electronic equivalent. A ‘good honours degree’ may be only the first consideration in an employer’s recruitment process – the issue is often one of ‘what else can you offer?’, which might be evaluated from the application form and possibly through attendance at an assessment centre. However, assessment centres are not necessarily equitable in their judgement of candidates (Brown et al., 2004). Smaller employers, save perhaps those focusing on narrow specialisms that are taught in particular institu-

Judgement, rather than measurement? 195 tions, are typically not in the position to apply the screening processes of the larger employers. For some employers, the class of a graduate’s degree may not be the primary criterion in recruitment. A more significant criterion might be the graduate’s capacity to engage with colleagues and clients through the application of ‘emotional intelligence’ (Salovey and Mayer, 1990; Goleman, 1996), as is indicated in comments from a couple of senior people in organizations who were responding to interviews regarding the employability of graduates: I’m looking for a balance between accomplishment at a technical level, in terms of a reasonable degree. I’m not after a first, or even a 2:1 when it comes to that. Basically someone who’s gone through the course and understands what they’ve done, but also somebody with a personality . . . to me the personality side is always 51% . . . because unless you can get on with the various departments . . . your job can be made 10 times harder, or if you can get on with them it’s made 10 times easier. (Knight and Yorke, 2004: 54) [Y]ou might read all the theories, say about attachment theory, you might be very clear what attachment theory is and how it should work, but it’s very different when you’re translating it into this family or family A, or family B or family C, they might respond in a totally different way and it’s crucial that you don’t over-interpret that response from a purely theoretical perspective, that you’ve got to put it in context as a whole this family’s functioning ’cos they don’t just function, you know you can’t just pigeon-hole families, you can’t just look and say ‘this is what’s happened to this child so this is how this child is going to respond’ because A or B doesn’t equal C, there are a whole host of other issues so you need to be very open-minded, flexible and interpret as a whole really and not just individual aspects of the same situation. (Knight and Yorke, 2004: 57–58) The honours degree classification fails to convey aspects of performance such as these. So does a GPA. So would any single index. The massification of higher education implies an extension of the range of expectations made regarding studentship, beyond predominantly academic study5 to the development of a variety of broader employment-related attributes and skills. This is where records of achievement and transcripts have something to offer to employers, in that student achievements are disaggregated in ways that could signal where a student had demonstrated particular strengths. Transcripts can show the marks or grades awarded, and can augment these with data that evidence the standing of the grade or mark against the performance of the cohort with the cohort median or the student’s percentile rank. 5 Of course, vocationally oriented undergraduate programmes have never been narrowly academic.

196 Judgement, rather than measurement? The use of transcript data and other evidence, such as graduates’ claims regarding achievements, would involve employers in more effort at the recruitment stage than the use of a single index as a first screening device. It would involve additional cost. However, this additional cost could be minuscule in comparison to the costs of not appointing the best person for the position in question. Though the Dogs Trust’s slogan, sometimes seen in the rear window of motor cars, ‘A dog is for life, not just for Christmas’ is an overstated analogy for the recruitment context, it nevertheless captures something of the need to ‘think long’ when recruiting.

The challenge of change On the evidence presented in this book, the case for change in summative assessment and the reporting of student achievement is strong. However, as Fullan (2001: 69) observes, ‘Educational change is technologically simple and socially complex.’ It is relatively easy to suggest changes in the way that summative assessment is undertaken and reported: gaining acceptance for the changes is vastly more difficult because of the variety of perspectives on grading, the vested interests, and so on. Four arguments against changing the present systems grading overall performance are as follows. 1 The practice is widespread. However, national systems vary considerably in the way in which they report overall grades (e.g. Karran, 2005) and the variation can extend into institutions and parts of institutions, as in Australia (AVCC, 2002). The argument would be that, provided some method of translation between performances in different grading systems were available, such as the European Credit Transfer and Accumulation System (ECTS) (which nevertheless presents problems in some respects, as was pointed out in Chapter 1), the need for change is not particularly pressing. 2 Employers like to have a single index. Employers know well enough that a single index cannot convey the subtleties of a graduate’s achievements but welcome the use of a single index (buttressed in some cases by knowledge of the institution at which the performance was achieved) as a criterion that they can use to conduct an initial sifting from a large number of applicants. 3 There is a lack of agreement on the way forward. Whereas the second report dealing with the classification of the UK honours degree (UUK and SCoP, 2005) proposed a three-point scale, the consultation that followed showed that there was no consensus regarding any scaling option. The consultation paper subsequently produced by the Burgess Group (UUK and GuildHE, 2006) put forward for consideration pass/fail grading of the honours degree, supplemented with a transcript: in working towards its final report the Group considered, as one possibility, the discontinuation of the classification whilst retaining the concept of honours, and giving emphasis to the use of transcripts to indicate the achievements of the students at honours degree level, or

Judgement, rather than measurement? 197 – where applicable – below this. It is likely that any such proposal would be debated vigorously, given the differing perspectives of stakeholders. 4 Change is not worth the effort. Although there might be agreement that a single overall grade is a crude index of a student’s achievement, there may be little impetus to change a system that is well known (if not as well understood as many think it is) and apparently functional. Stasis seems much more probable than change. What might cause a shift in the odds? In the UK, where students have been required to pay ‘top up fees’ since the autumn of 2006, the expenditure is likely to be accompanied by enhanced demands for value for money. This will manifest itself not only in relation to the quality of teaching and learning, but also in relation to assessment, where judgements will be expected to be transparent and procedures secure. Weakness in either could give grounds for student appeals against the decision of the examiners: the achievement of transparency may prove to be the greater challenge. In passing, there has been a marked rise in the number of assessment-related appeals received by the Office of the Independent Adjudicator (OIA),6 from 40 in 2004 (OIA 2005: 46) to 135 in 2005 (OIA 2006: 46), which, allowing for the fact that the OIA has been settling into its role, might be a harbinger of a continuing trend. The general rule in the UK is that students cannot challenge academic judgements, but can challenge various procedural aspects of assessment. However, the boundary is not quite as sharp as this implies. Students have, on occasion, appealed against perceived academic bias in assessment even though the issues have seemed to be about academic judgement. The looseness of grading methodology might provide more fertile ground for appeals. Individuals would probably – for various reasons – not choose to pursue their grievances beyond the institutional procedures, but an organization such as a student union might feel it worthwhile to take a collective concern into the public domain by pressing a particular case. If this were to become a cause célèbre, then it is likely that institutions would be provoked into taking action to ensure that their procedures were as close as possible to being watertight. Elsewhere the conditions for contesting the robustness of grading methodology may be less favourable. For example, although a shift in GPA from 2.99 to 3.01 might offer the awardee a step-change in academic standing, the grading system in the US is longstanding and there seems to be no driver of sufficient magnitude to trigger customer-like assertiveness (there are a number of issues that bear on student behaviour, but there is no shock sufficient, in Senge’s, 1992, analogy, to prompt the uncomfortable frog to jump out of the steadily heating water in the bucket).

6 The Office of the Independent Adjudicator for Higher Education was designated, under the provisions of the Higher Education Act 2004, as the independent reviewer of complaints against higher education institutions. It was given statutory powers from 1 January 2005 in respect of institutions in England and Wales.

198 Judgement, rather than measurement?

Acknowledging the multidimensionality of student achievement Provided that a number of impinging variables are ignored, an analysis such as that conducted in Chapter 6 can provide a fairly crude categorization of graduates’ achievements, perhaps sufficient for the initial requirements (but not necessarily the needs) of employers. The overt roughness of the categorization would signal that achievements cannot sensibly be pinned down to a single index, and would invite further exploration of where the graduates had performed relatively strongly and where relatively weakly. Using the list of classification levels suggested in Chapter 6 to illustrate the point, and extending it to encompass GPA, the general categorization system would look something like Table 9.4. In practice, summary grades lower than 1.00 would probably be designated an unambiguous fail, given current usage of the GPA in the US. Despite the criticisms of the GPA that were advanced earlier in this book, the scaling of the raw marks into grade-points probably deflates the variance due to differences in subject marking norms – something that the direct use of percentage marks in the UK fails to do. If the summary grade subsumed performances that were highly varied, it could be flagged – say with an asterisk – to indicate to an interested party that it might be worthwhile to undertake further inquiry regarding the location of strengths and weaknesses.7 Some might argue that the GPA, as currently used in the US, is informative enough and that there is no need to meddle: for example, a GPA of 3.24, say, would be interpreted as being a good B level of achievement. However, the apparent precision of the 3.24 does not invite a further exploration of where the person had performed most strongly. It also would implicitly discourage consideration of those aspects of a student’s profile that were simply graded on a pass/fail baTable 9.4 A fuzzy approach to the honours degree classification and GPA Honours degree classification First class Borderline first and upper second class Upper second class Borderline upper and lower second class Lower second class Borderline lower second and third class Third class Non-honours degree No degree awarded

GPA Range 3.70 and above 3.30–3.69 2.70–3.29 2.30–2.69 1.70–2.29 1.30–1.69 1.00–1.29 0.70–0.99 Below 0.70

Summary grade A A/B B B/C C C/D D D/F F

Note: The GPA figures are based on US usage, and would need to be adjusted for other GPA systems. 7 The same could be done in respect of a module of study in which there were widely discrepant components.

Judgement, rather than measurement? 199 sis, and any other of the grade-variations that Adelman (2004 and forthcoming) showed were relevant to the interpretation of a student’s performance. A more sophisticated approach would include a confidence interval (whilst acknowledging the assumptions on which it was based) for the individual student’s set of graded achievements, along the lines illustrated en masse in Figure 6.3.

Invoking relativities Some institutions and researchers in the US have sought a way of presenting GPAs in such a way as to signal that the actual GPA has to be interpreted with reference to the contexts in which it was achieved. One such relativistic approach is to set the grade-point gained by a student against the mean for the students taking that class. A 3.00 gained when the class mean is 2.70 appears to be a stronger performance than when the class mean is 3.30. Another relativistic approach has been the inclusion in the student’s transcript of information placing the individual’s performance in context: Rosovsky and Hartley (2002: 15) indicate that a number of institutions including Dartmouth College and the universities of Columbia, Indiana and Eastern Kentucky are amongst others implementing this approach. However, the attachment of the median grade for an individual’s class at Dartmouth College seems to have proved unavailing in mitigating grade inflation (Felton and Koper, 2005: 562). Indiana University at Bloomington has since the late 1990s provided detailed contextual information in order that the performance represented by a student’s grade might be more readily understood. The university website8 lists the following: • • • • • • •

The number of students in the course section who received the same grade or higher. The number of students who receive any of the possible grades in the course section. The number of students who withdrew. . . or who otherwise did not receive a grade. The percentage of students who are majors in the given course department. The average grade of all students in the course section. The average grade point average of all the students in the course section. The instructor of the course section.

Perhaps the most sophisticated attempt to adjust for variations in the grading of courses is that of Johnson (1997).9 Using an approach combining Bayesian statistics with item-response modelling, Johnson determined an ‘achievement index’ 8 See www.indiana.edu/~registra/gradedist/bfc.html (accessed 4 October 2006). 9 A more accessible account can be found in Johnson (2003: 218ff).

200 Judgement, rather than measurement? for each student grade which took into account the observed grade and the grading behaviour of the assessor. Johnson’s tests of the robustness of the achievement index, using paired comparisons of student grades, suggested that it provided a considerable improvement over the raw GPA (ibid.: 260–261). Larkey (1997), commenting on Johnson’s (1997) article, noted that the proposal to adopt the achievement index at Duke University (which was, during the publication process, working its way through the university committee system) had, despite clearing earlier committee hurdles, ultimately been voted down 19–14, apparently because the opposition of faculty in the humanities and social sciences was stronger than the support from the science-oriented faculty. The former apparently had more to lose through grade adjustment than the latter, which demonstrates the power of institutional politics in arguments of this sort (and also represents a tacit confirmation of Fullan’s, 2001, point). A major difficulty with Johnson’s achievement index is the complexity of the statistical methodology which, for some (perhaps for many), would obscure the linkage between the raw and adjusted grades. With the imprecisions in grading that are ‘bracketed out’ from his methodology (and whose magnitudes are generally not susceptible to measurement or estimation) it may be more acceptable to use a cruder methodology rather than to press for complexity with its promise (perhaps mirage) of greater precision. The following comment, though made with reference to institutional performance indicators, seems apposite: Many promising indicator systems fail simply because they are too expensive, too complex, too time-consuming, or too politically costly to implement. Often the simplest is the best, even if it initially seems less technically attractive. (Ewell and Jones, 1994: 16) Felton and Koper (2005) attempted a simpler approach than Johnson’s in which they adjusted the observed or ‘nominal’ GPA by taking into account the mean GPA for the class to which the student belonged. Under their system, a grade of C (2.00 grade-points) would be average, and students with adjusted (or ‘real’) GPAs greater than 2.00 would have exceeded the class average. The adjustment would be computed for each class taken (including a weighting for credit hours, as is the normal practice), with the adjusted GPA being calculated in the usual way. Including the nominal and real GPAs on the student’s transcript, they argued, would assist interested parties to make an allowance for the ‘strategic’ student who chose easy courses and/or opted for professors who were perceived to be generous in grading.

Grades in globalized higher education In a world in which student mobility (physical or virtual) is increasing, the grade becomes increasingly significant as a unit of academic currency and exchange. Haug (1997) makes a number of useful points regarding grades and their conversion between different national systems.

Judgement, rather than measurement? 201 Noting that grading systems vary widely between countries (see the examples in Chapter 2), Haug makes observations along the following lines. • • • •

Grading systems are not linear, and the distribution of awarded grades is often skewed. Within a system, grading practices can vary between and within institutions, and between types of assessment demand within programmes of study. A small change in the marks awarded to a student can make a large difference in the grade of the award (see Figure 6.3 and associated discussion). The meanings attached to the original marks or grades may become lost after mathematical conversion algorithms are applied.

Hence Haug argues that the grade has to be understood in the context of the original assessment system and the ‘message’ has to be translated into the system into which the original grade is to be assimilated. This requires more than a simplistic ‘reading off’ of a grade awarded on a scale in one system against the scale of the receiving system: as he puts it, ‘Simple mathematical formulas with their claim to universality are nothing but a fallacious over-simplification of a reality they fail to capture.’ Haug warns those concerned with the interpretation of grades from different systems not to impute to them a precision that, given their inherent variability, they cannot possess. The interpretation of grades cannot be more objective than the grading process itself. In the light of Haug’s points, the operation of credit transfer across national boundaries is as much a matter of judgement as it is of measurement. The relatively simple inter-system conversions suggested by World Education Services10 are mute testimony to what is needed. A fuzzy approach to a summary index does not align with the European Credit Transfer and Accumulation System’s norm-referenced 5-point scale whose sharp boundaries fail to acknowledge the imprecision with which assessments are inevitably made. It also does not align with the typical categorizations of achievement used around the world. However, in a world in which achievements are increasingly expected to be portable, it is necessary to have a method of dealing with credit transfer that acknowledges imprecision. Much transfer, though, relates to performances in modules, and so the problems associated with the cumulation of module marks or grades are of little import.

Towards a longer view The more the weight placed on summative assessment, the more prominent the instrumentality of ‘getting the grade’ becomes. Performance goals dominate learning goals. Even if the adoption by a student of ‘approach’ performance goals (wanting to perform well in comparison to others) may produce outcomes of similar standard to the adoption of learning goals (Pintrich, 2000), the desirability of learning both for its own sake and for more instrumental purposes may be backgrounded. This book has concentrated its attention on undergraduate programmes. This 10 See the tables available by navigating from www.wes.org/gradeconversionguide/ (accessed 2 October 2006).

202 Judgement, rather than measurement? narrow focus was needed in order to keep the discussion of summative assessment within reasonable bounds. However, it is appropriate to take a wider perspective. An undergraduate programme, albeit of considerable importance for a person’s development and prospects, is far from the whole story. It has to be seen in a context of lifelong learning in which continuing professional development (CPD) is an expectation. Further, personal activity in the economy may shift in focus (perhaps considerably, as new challenges and opportunities present themselves), requiring rather more than CPD within an established specialism – for example, new learning is needed to cope with a switch from being an expert in polymer technology to being a manager of materials scientists, a lecturer in Materials Science, or a pro vice-chancellor with responsibilities for teaching and learning in a university. A necessary condition for success and personal satisfaction in a developed role, or in a stable role, is a desire to keep learning and to seek answers to the questions that life poses. A personal orientation towards learning goals is more likely to show long-term benefit than an orientation towards performance goals. Geisinger (1982: 1141) observed that, if studying were done purely to obtain the reinforcement of high grades, then once the programme was completed there would be little to encourage future learning – a point consistent with the subsequent meta-analysis by Deci et al. (1999). In brief, if grades act as the prime motivators, their influence may be to the detriment of lifelong learning. Of course, the situation is nothing like as cut and dried as this might be taken to suggest. The demands of the workplace, or of life more generally, provide plenty of stimuli for further learning in which grading may play only a limited role.

Assessment is challenging: discuss The title of this section mimics the kind of open-ended question that some favour on the grounds that it offers the respondent the opportunity to expound their thinking and, by extension, to demonstrate whether they can move beyond assimilating and reworking the views of others. A major purpose of this book has been to demonstrate, in the preceding chapters, what a challenging activity summative assessment is: readers will decide on the extent to which they agree with the assertion in the first part of this section’s title. The second part of the title hints at a way of approaching the challenge. There is a steady trickle in the literature that points to the need for more discussion in and between institutions about assessment. In Striving for Quality (DEST, 2002: 149) note was made of the paucity of opportunity for academics in Australia to link up with colleagues in similar subject disciplines, and the potential benefit that might accrue from an expansion of such contact. Noting the heterogeneity of summative assessment practice in Australia, James (2003: 196) remarked: ‘Establishing a common grading nomenclature and scale would be a helpful first step, probably a highly contentious one’. Despite doubts expressed at the time regarding the feasibility of extending the amount of professional discussion, the establishment of the Carrick Institute for Learning and Teaching in Higher Education offers a more optimistic future in this respect. Ekstrom and Villegas (1994), Rosovsky and Hartley (2002), James et al.

Judgement, rather than measurement? 203 (2002b) and James (2003) have all pointed to the need for greater institutional discussion about assessment. James et al. put the point succinctly: Setting standards is the business of the academy. The process we propose would involve establishing mechanisms to promote dialogue amongst academics at the level of field of study or discipline to clarify and share expectations of learning outcomes appropriate to higher education, to improve assessment methods and to provide a basis for the determination and reporting of levels of student achievement. (James et al., 2002b: 1–2) More generally, Brown and Duguid (2000) argue that engagement in relevant groupings supports professional development that in some circumstances can be difficult to pre-specify. Professional discussion of the kind envisaged by James et al., which acknowledges the socially constructed nature of assessment practice, has waxed and waned with structural adjustments to the higher education system in the UK. In the 1970s and 1980s, one of the strengths of the higher education system in the UK was the professional development of academics in all aspects of curriculum which stemmed from the activities of the erstwhile Council for National Academic Awards (CNAA). Under the CNAA, panels visited institutions that were outside the then university sector in order to ensure that curriculum proposals were soundly constructed and that standards were appropriate for the proposed award; this involved panellists and institutional representatives in discussions about matters such as why the proposed curriculum was as it was, and why particular choices had been made regarding assessment. The CNAA also convened various developmental activities in which colleagues could share knowledge and experience, on both disciplinary and transdisciplinary bases. The demise of the CNAA in 1992 led to a sharp reduction in developmental activity until the Learning and Teaching Support Network (LTSN) was established with 24 Subject Centres focusing on concerns particularly apposite to the subject disciplines involved. This aspect of the LTSN’s work is now part of the agenda of its replacement body, the Higher Education Academy. The work of the CNAA, the LTSN and the Higher Education Academy implicitly testifies to the value of peer discussion in developing shared understandings of various aspects of curriculum. The same is said to be true of the work of Her Majesty’s Inspectors when they had a role in assuring the quality and standards of higher education in the polytechnics and colleges of the UK whilst they remained answerable to their local education authorities.11 The benchmark statements12 pro 11 Sinkinson and Jones (2001), commenting on OFSTED (Office for Standards in Education) judgements of Mathematics postgraduate certificate courses in initial teacher education in England, suggest that, whereas the expertise built up by Her Majesty’s Inspectors enabled them to make effective and reliable judgements regarding provision, confidence might justifiably be lower in respect of judgements reached by the many ‘additional inspectors’ who had been employed and who had received only a few days of training. 12 Navigate as required from http://www.qaa.ac.uk/academicinfrastructure/benchmark/default. asp.

204 Judgement, rather than measurement? duced under the aegis of the Quality Assurance Agency (QAA) in the UK, and intended to act as reference points regarding the expected coverage in a range of broad subject areas, do not in themselves contain the depth of information needed for shared understanding. The same might apply to the reference documents relating to academic standards suggested by James et al. (2002b: 4). The literature contains suggestions for improving awareness of, and the need to develop expertise in, assessment. Amongst these are: •

•

the provision to teachers of a variety of information bearing on their own institutions’ assessment practices and outcomes, including comparisons of various kinds; the provision of comparative data from institutions regarded as equivalent in character.

Ekstrom and Villegas (1994: 13ff) found that, of a sample of 59 departmental chairpersons, none reported having specific departmental grading policies (this was broadly confirmed by responses from individual faculty), and ‘only about a third said that they ever had formal meetings of departmental faculty to discuss grades’ (ibid.: 13). However, roughly three quarters of the chairpersons had had informal discussions on grading, with discussion being more frequent in the less selective and the comprehensive institutions. Martin and Cloke (2000) noted the rapidity with which, in 1998, the then Department for Education and Employment in the UK introduced new standards and curricula for qualification as a teacher. Students were required to meet all of the standards without any compensation for weak performance being permissible on the grounds of their strength in other parts of the curriculum. The speed with which the changes were introduced left little time for trainers of teachers to develop a full appreciation of what the new standards implied for the assessment process. Echoing Hyland (1994) and Wolf (1995) they observed that prescriptions on their own did not carry sufficient meaning for consistency in their application across a range of different contexts, and pointed to the openness of performance to interpretation: A greater specification of content only, with no reference exemplification of Standards, simply clarifies what is to be measured whilst leaving the criteria for assessment open to interpretation by those assessing. (Martin and Cloke, 2000: 188) It is probable that relatively few institutions engage systematically in the discussion necessary to embed understanding of standards, criteria and levels of performance. In a small-scale study that included the identification of assessment criteria in History, Woolf (2005) found that, irrespective of where the formal responsibility for the construction of criteria existed, in practice the criteria were drawn up by a single person with relatively little input from colleagues. Woolf comments on the possible virtue of such a practice for consistency of language

Judgement, rather than measurement? 205 or style, and the concern this is likely to engender regarding the commonality of understanding regarding the meaning of the criteria. Regarding the latter point, he was echoing a point made by Ebel (1972), who wrote: [For criteria] to be generally meaningful they must not represent the interests, values, and standards of just one teacher. Making them generally acceptable calls for committees, meetings, and long struggles to reach at least a verbal consensus, which in some cases serves only to conceal unresolved disagreements over perceptions, values, and standards. (Ebel, 1972: 84–85) Alverno College in the US has made such discussion a feature of its development of a ‘community of judgment’ (Alverno College Faculty, 1994: 50–51). The following quotation signals the importance placed by the College on institutional dialogue in coming to terms with the complexities of assessing student performance. Just as we talk to each other about the latest development in our disciplines . . . so do we discuss our breakthroughs (and botches) in teaching and assessment on our own campus and advances in assessment throughout the country and abroad. The experience we have in making choices about what to teach, how to teach it, and how to assess the learning, along with our institutional study of how students learn and our interactions with other educators, all make irreplaceable contributions to our personal development. They especially contribute to our belief that we as faculty members are best prepared to design assessments for our students and to make judgments about the success of our students in developing the abilities they need for effective living now and in the future. (Alverno College Faculty, 1994: 51) Although discussion about assessment is likely to be beneficial, there are however limits to what analysis and dialogue can achieve. A sustained research project at Oxford Brookes University has been exploring the utility of providing a criterion-referenced assessment grid in order to get a greater degree of shared understanding about expectations (Price and Rust, 1999; Rust et al., 2003; O’Donovan et al., 2004). Although the grid proved to be useful in focusing attention on the assessment process, and helped to raise the consistency of marking, there remained problems in the articulation of criteria and standards, and in the understanding of these by the students (O’Donovan et al., 2004: 327). The researchers realized that the problem would not be solved by the production of ever more detailed specifications in the grid, and acknowledged the point made by Hussey and Smith (2002) that the apparent explicitness of written learning outcomes was phantasmal. Noting the power of tacit knowledge to colour the interpretation of formal statements, they mitigated the problem of students’

206 Judgement, rather than measurement? understanding by running activities that required students to apply the criteria in marking pieces of work. As Gibbs (1999) had found with Engineering students, this proved to be an effective approach. The researchers had spent a considerable amount of time in discussion in setting up and running the project and, even with this, precision in understanding proved to be an unrealizable ideal.

The need for more expertise This book has shown that educational assessment is complex. There are relatively few academics who have made assessment one of their specialisms, rather more who have grappled to a significant extent with it, and probably a majority who have undertaken assessment on the basis of any or all of the following: prior experience; ‘learning by doing’ (perhaps supported by colleagues who have more experience in assessing); and learning from courses aimed at developing pedagogic expertise. Whereas the Dearing Report (NCIHE, 1997, para 9.43) chided higher education institutions for not doing more to develop assessment methods and to train staff, it failed to acknowledge the magnitude of the deficit in expertise in the sector, with the possible exception of medical education, in which a number of interesting developments have taken place. To meet the challenges that this book has laid out, and to produce summative assessments that are defensible and that communicate meaningfully to interested parties, will require skill in navigating a course between the Scylla of unachievable precision and the Charybdis of swirling vagueness. Summative assessment has to be fit for its multiple purposes: it will never be perfect, it will suit some purposes better than others, and it will probably be necessary to settle for assessments that are ‘good enough’ for their purposes – and ‘good enough’ will vary with the context within which the assessment is carried out. The question which the academy has to address is ‘how good does “good enough” have to be?’

Epilogue Beyond the sunset

This book has scanned the terrain of summative assessment and the reporting of broad outcomes. From a relatively high altitude, three main features are apparent: • • •

the uncertainties inherent in marking and grading; the problems associated with the combination of marks and grades; the fundamental fuzziness of marks and grades, and of their combination.

The logic of the analyses presented in this book points towards the production of disaggregated achievements and away from the computation of single indexes such as the honours degree classification and the grade-point average. Such indexes, although indicating broadly and fuzzily a student’s level of achievement, provide insufficient information in isolation to be of much use to an ‘outsider’ such as an employer. The position taken in this book is that of Winter (1993) and Elton (2004) that the honours degree classification should be discontinued because it is inadequate as a ‘measure’ and as a signal. A similar view is taken regarding the grade-point average. Although the evidence that has been presented is sufficient to undergird the argument against the single index, others with an interest in student achievement may cling to their belief in the value of the honours degree classification or grade-point average. That belief, on the evidence presented in this book, needs to be challenged. However, it may be too much to expect established opinion to be overturned overnight, however compelling the argument. Acknowledging the realpolitik of the relationship between the academy and its external environment, a staging post in a transition between the retention of the status quo and complete abandonment of the single index is to signal, perhaps using some of the suggestions in Chapter 9, that the categories of the single index are fuzzier than they appear to be taken to be. An overt expression of this fuzziness ought then to be accompanied by a more detailed scrutiny of the student’s record in order to establish where their particular strengths lie. Any change in grading procedure is likely to provoke argument, for much may be felt to hang upon it (even if the actual effects are likely to be marginal). In the

208 Epilogue US, for example, the switch from straight letter grading to plus/minus grading has generated heated debate in some institutions, as sporadic reports on the World Wide Web illustrate.1 For change to gain broad acceptance, the ground has to be carefully prepared and a good case presented that what is proposed is broadly beneficial. This involves a two-stage process. First, the academy has to convince itself that proposals for change are soundly grounded and worthwhile. This requires a level of engagement with the evidence on assessment that until now has been accorded by few. Second, the academy has to demonstrate convincingly to its stakeholders that there is a compelling need for change, and that change is in the stakeholders’ interests. There is no quick fix. Although the change advocated in this book is in essence straightforward, its ramifications imply that its implementation would require time. There is evidence enough in this book to provide a basis for the academy to reflect on its approach to assessment and the reporting of student achievement. This book has not attempted to cover the vast literature that exists on assessment, and on that which bears upon assessment. Rather like a space probe surveying a planet from orbit, it has produced a general picture of the landform below. Some features seem more interesting than others, and may attract further probes designed to explore them from the ground. There is plenty of scope for further investigation. Assessment and the reporting of student achievement are a little like a boat that has been sailing for some time in the turbulent seas of the Southern Ocean, and whose structural integrity is suffering the consequences. There are three choices for the skipper: to carry on towards the ultimate destination and hope for the best; to pause in order to patch the boat up and then sail on; or to divert to the nearest port at which a proper refit can be undertaken. It is the refit that is needed.

1 See www.umbc.edu/ugc/Plus_Minus_Summary.htm for a Memorandum from the Undergraduate Council at the University of Maryland, Baltimore County, to the University Senate, dated 11 December 2001, and www.uta.edu/min/UA11-01-05.html for the Minutes of the meeting of the Undergraduate Assembly of the University of Texas, Arlington, held on 1 November 2005. Both URLs accessed 18 May 2006.

References

Academic Policies Committee (2005) Plus-minus grading: impacts on grading at UNCA [Report of the Academic Policies Committee, 25 March]. At www.unca.edu/facultysenate/y0405/Plus-Minus%20Report.htm (accessed 29 October 2006). Access Economics Pty Limited (2005) Review of higher education outcome performance indicators. Canberra: Department of Education, Science and Training. Adelman, C. (2001) Putting on the glitz. Connections 15 (3): 24–30. Adelman, C. (2004) Principal indicators of student academic histories in postsecondary education, 1972–2000. Washington, DC: Institute of Education Sciences, US Department of Education. Adelman, C. (forthcoming) Undergraduate grades: a more complex story than ‘inflation’. In L.H. Hunt (ed.), Grade Inflation and Academic Standards. Albany, NY: State University of New York Press. Alverno College Faculty (1994) Student assessment-as-learning at Alverno College. Milwaukee, WI: Alverno College. Amis, K. (1960) Lone voices. Encounter 15 (July): 6–11. Angoff, W.H. (1971) Scales, norms and equivalent scores. In R.L. Thorndike (ed.) Educational measurement. Washington, DC: American Council on Education, pp. 508–600. Armitage, C. (2006) OECD test to compare students. The Australian, 27 September. At www.theaustralian.news.com.au/story/0,20867,20481923‑12332,00.html (accessed 24 October 2006). Armstrong, M., Clarkson, P., and Noble, M. (1998) Modularity and credit frameworks: the NUCCAT survey and 1998 conference report. Newcastle upon Tyne: Northern Universities Consortium for Credit Accumulation and Transfer. Ashby, E. (1963) Introduction: Decision-making in the academic world. In P. Halmos (ed) Sociological studies in British university education [The Sociological Review Monograph No. 7]. Keele: University of Keele, pp. 5–13. Ashworth, P.D., Gerrish, K., Hargreaves, J. and McManus, M. (1999) ‘Levels’ of attainment in nursing practice: reality or illusion? Journal of Advanced Nursing 30 (1): 159–168. Assiter, A. and Shaw, E. (1993) Using records of achievement in higher education. London: Kogan Page. Astin, A.W. (1993) What matters in college: four critical years revisited. San Francisco: Jossey-Bass. Aston, L. (2003) The poor don’t buy flimsy money-back guarantees. The Times Higher Education Supplement, 10 January: 14. Atkins, M., Beatty, J. and Dockrell, W.B. (1993) Assessment issues in higher education. Sheffield: Employment Department. AVCC (2002) Grades for Honours Programs (concurrent with pass degree), 2002. At www.

210 References avcc.edu.au/documents/universities/key_survey_summaries/Grades_for_Degree_Subjects_Jun02.xls (accessed 22 November 2006). Baird, P. (1991) The proof of the pudding: a study of client views of student practice competence. Issues in Social Work Education 10 (1&2): 24–41. Balchin, T. (2005) Assessing students’ creativity: lessons from research. At www. heacademy.ac.uk/resources/detail/id569_assessing_students_creativity (accessed 12 July 2007). Barke, M., Braidford, P., Houston, M., Hunt, A., Lincoln, I., Morphet, C., Stone, I. and Walker, A. (2000) Students in the labour market: nature, extent and implications of term-time employment among University of Northumbria undergraduates [Research Report 215]. London: Department for Education and Skills. Baty, P. (2000) V-c’s ‘plea for firsts’ fuels quality fears. The Times Higher Education Supplement, 30 June: 1. Baty, P. (2005) Student poll is ‘not valid’. The Times Higher Education Supplement, 14 October: 7. Baty, P. (2006) Overseas students given grade top-ups at fashion college. The Times Higher Education Supplement, 21 July: 2–3. Baume, D. and Yorke, M. (2002) The reliability of assessment by portfolio on a course to develop and accredit teachers in higher education. Studies in Higher Education 27 (1): 7–25. Baume, D. and Yorke, M. with Coffey, M. (2004) What is happening when we assess, and how can we use our understanding of this to improve assessment? Assessment and Evaluation in Higher Education 29 (4): 451–477. Becker, H.S., Geer, B. and Hughes, E.C. (1968) Making the grade: the academic side of college life. New York: Wiley. Bee, M. and Dolton, P. (1985) Degree class and pass rates. Higher Education Review 17 (2): 45–52. Berenger, R.D. (2005) The Lake Wobegon effect and grade inflation: the American University in Cairo case study. Paper presented at the IAMCR Conference, Taipeh. At http://profed-iamcr.cci.ecu.edu.au/pdfs/Berenger_full_paper.pdf (accessed 22 October 2006). Berlins, M. (2004) Janet Jackson’s right breast could hurt millions – in the pocket, if nowhere else. The Guardian G2, 10 February: 17. Biggs, J.B. (2003) Teaching for quality learning at university: what the student does, second edn. Maidenhead: Society for Research into Higher Education and Open University Press. Biggs, J.B. and Collis, K.F. (1982) Evaluating the quality of learning: the SOLO Taxonomy (Structure of the Observed Learning Outcome). New York: Academic Press. Birnbaum, R. (1977) Factors related to grade inflation. Journal of Higher Education 48 (5): 519–539. Biswas, R. (1995) An application of fuzzy sets in students’ evaluation. Fuzzy Sets and Systems 74 (2): 187–194. Bloom, A. (1988) The closing of the American mind. London: Penguin. Bloom, B.S., ed., (1956) Taxonomy of educational objectives [Handbook 1: Cognitive domain]. London: Longmans. Blundell, R., Dearden, L., Goodman, A., and Reed, H. (1997) Higher education, employment and earnings in Britain. London: Institute for Fiscal Studies. Bogo, M., Regehr, C., Woodford, M., Hughes, J., Power, R. and Regehr, G. (2006) Beyond

References 211 competencies: field instructors’ descriptions of student performance. Journal of Social Work Education 42 (3): 579–593. Boud, D. (1990) Assessment and the promotion of academic values. Studies in Higher Education 15 (1): 101–111. Boud, D. (1995) Assessment and learning: contradictory or complementary? In P. Knight (ed) Assessment for learning in higher education. London: Kogan Page, pp. 35–48. Boud, D. (2000) Sustainable assessment: rethinking assessment for the learning society. Studies in Continuing Education 22 (2): 151–167. Bowden, R. (2000) Fantasy higher education: university and college league tables. Quality in Higher Education 6 (1): 41–60. Bowen, W.G. and Bok, D. (1998) The shape of the river. Princeton, NJ: Princeton University Press. Bowley, R.L. (1973) Teaching without tears, fourth edn. London: Centaur Press. Brandon, J. and Davies, M. (1979) The limits of competence in social work: the assessment of marginal work in social work education. British Journal of Social Work 9 (3): 295–347. Brennan, J., Duaso, A., Little, B., Callender, C. and van Dyck, R. (2005) Survey of higher education students’ attitudes to debt and term-time working and their impact on attainment. London: Universities UK. At http://bookshop.universitiesuk.ac.uk/downloads/ termtime_work.pdf (accessed 12 September 2006). Bressette, A. (2002) Arguments for plus/minus grading: a case study. Educational Research Quarterly 25 (3): 29–41. Bridges, D. (1993) Transferable skills: a philosophical perspective. Studies in Higher Education 18 (1): 43–51. Bridges, P., Bourdillon, B., Collymore, D., Cooper, A., Fox, W., Haines, C., Turner, D., Woolf, H. and Yorke, M. (1999) Discipline-related marking behaviour using percentages: a potential cause of inequity in assessment. Assessment and Evaluation in Higher Education 24 (3): 285–300. Bridges, P., Cooper, A., Evanson, P., Haines, C., Jenkins, D., Scurry, D., Woolf, H. and Yorke, M. (2002) Coursework marks high, examination marks low: discuss. Assessment and Evaluation in Higher Education 27 (1): 35–48. Bright, N., Hindmarsh, A. and Kingston, B. (2001) Measuring up the UK’s class system. The Times Higher Education Supplement, 22 June: 6–7. Britton, J.N., Martin, N.C. and Rosen, H. (1966) Multiple marking of English compositions: an account of an experiment (Schools Council Examination Bulletin No. 12). London: HMSO. Broadfoot, P. (1986) Profiling and records of achievement: a review of current practice. Eastbourne: Holt Educational. Broadfoot, P. (2002) Dynamic versus arbitrary standards: recognising the human factor in assessment [Editorial]. Assessment in Education: Principles, Policy and Practice 9 (2): 157–159. Brooks, V. (2004) Double marking revisited. British Journal of Educational Studies 52 (1): 29–46. Brown, G. with Bull, J. and Pendlebury, M. (1997) Assessing student learning in higher education. London: Routledge. Brown, J.S. and Duguid, P. (2000) The social life of information. Cambridge, MA: Harvard University Press. Brown, P. and Hesketh, A. with Williams, S. (2004) The mismanagement of talent: employability and jobs in the knowledge economy. Oxford: Oxford University Press.

212 References Brown, S. and Knight, P. (1994) Assessing learners in higher education. London: Kogan Page. Brumfield, C. (2004) Current trends in grades and grading practices in higher education: results of the 2004 AACRAO survey. Washington, DC: American Association of Collegiate Registrars and Admissions Officers. Bullock, K. and Jamieson, I. (1998) The effectiveness of personal development planning. The Curriculum Journal 9 (1): 63–77. Burton, N.W. and Ramist, L. (2001) Predicting success in college: SAT studies of classes graduating since 1980. New York: College Entrance Examination Board. Buscall, J. (2006) Partial grading system change. The Times Higher Education Supplement, 22 September: 10. Butler, R. (1987) Task-involving and ego-involving properties of evaluation: effects of different feedback conditions on motivational perceptions, interest and performance. Journal of Educational Psychology 79 (4): 474–482. Bynner, J. and Egerton, M. (2001) The wider benefits of higher education [Report 01/46]. Bristol: Higher Education Funding Council for England. At www.hefce.ac.uk/Pubs/hefce/2001/01_46.htm (accessed 9 October 2006). Bynner, J., Dolton, P., Feinstein, L., Makepeace, G., Malmberg, L. and Woods, L. (2003) Revisiting the benefits of higher education. Bristol: Higher Education Funding Council for England. At www.hefce.ac.uk/pubs/rdreports/2003/rd05%5F03/default.asp (accessed 9 October 2006). Cameron, D. (2004) Person, number, gender. Critical Quarterly 46 (4): 131–135. Cameron, H. and Tesoriero, F. (2004). Adjusting institutional practices to support diverse student groups. Paper presented at the Eighth Pacific Rim Conference on the First Year in Higher Education, Monash University, Australia, 14–16 July. Available via www. fyhe.qut.edu.au/FYHE_Previous/papers04.htm (accessed 17 August 2006). Campbell, D. and Russo, M.J. (2001) Social measurement. Thousand Oaks, CA: Sage. Campbell, D.T. and Stanley, J.C. (1963) Experimental and quasi-experimental designs for research on teaching. In N.L. Gage (ed.) Handbook of research on teaching. Chicago: Rand McNally, pp. 171–246. Cannings, R., Hawthorne, K., Hood, K. and Houston, H. (2005) Putting double marking to the test: a framework to assess if it is worth the trouble. Medical Education 39 (3): 299–308. Carr, P.G. (2005) Note from the National Center for Educational Statistics regarding the NAEP High School Transcript Study. At http://nces.ed.gov/programs/quarterly/vol_ 6/1_2/note.asp (accessed 12 June 2005). Carroll, J. and Appleton, J. (2001) Plagiarism: a good practice guide. Oxford: Oxford Brookes University and the Joint Information System Committee. At www.jisc.ac.uk/ uploaded_documents/brookes.pdf (accessed 29 October 2006). Carroll, L. (1999) Alice’s adventures in Wonderland. London: Walker (originally published 1865). Cave, M., Hanney, S., Henkel, M. and Kogan, M. (1997) The use of performance indicators in higher education, third edn. London: Jessica Kingsley. Chambers, M.A. (1998) Some issues in the assessment of clinical practice: a review of the literature. Journal of Clinical Nursing 7 (3): 201–208. Chapman, K. (1996) Entry qualifications, degree results and value-added in UK universities. Oxford Review of Education 22 (3): 251–264. Chevalier, A. and Conlon, G. (2003) Does it pay to attend a prestigious university? [Dis-

References 213 cussion Paper 848]. Bonn: Institute for the Study of Labor. At ftp://repec.iza.org/RePEc/Discussionpaper/dp848.pdf (accessed 2 November 2006). Chevalier, A., Conlon, G., Galindo-Rueda, F. and McNally, S. (2002) The returns to higher education teaching. London: Centre for the Economics of Education. Chevalier, A., Harmon, C., Walker, I. and Zhu, Y. (2004) Does education raise productivity, or just reflect it? The Economic Journal 114 (499): F499–F517. Clegg, S. and Bradley, S. (2006) Models of personal development planning: practice and processes. British Educational Research Journal 32 (1): 57–76. CMU (2004) Briefing for Members of Parliament, 19 January. CNAA/PCFC (1990) The measurement of value added in higher education. London: Council for National Academic Awards and the Polytechnics and Colleges Funding Council. Coll, R.K., Taylor, N. and Grainger, S. (2002) Assessment of work based learning: some lessons from the teaching profession. Asia-Pacific Journal of Co-operative Education 3 (2): 5–12. College Board (2004) Trends in college pricing 2004. Washington, DC: College Board. Connor, H., Dawson, S., Tyers, C., Eccles, J., Regan, J. and Aston, J. (2001) Social class and higher education: issues affecting decisions on participation by lower social class groups [Research Report 267]. London: Department of Education and Employment. At www.dfes.gov.uk/research/data/uploadfiles/RR267.pdf (accessed 12 September 2006). Cooke, R. (2000) Letter. The Times Higher Education Supplement, 14 July: 15. Cope, P., Bruce, A., McNally, J. and Wilson, G. (2003) Grading the practice of teaching: an unholy union of incompatibles. Assessment and Evaluation in Higher Education 28 (6): 673–684. Covington, M.V. (1997) A motivational analysis of academic life in college. In R.P. Perry and J.C. Smart (eds) Effective teaching in higher education: research and practice. New York: Agathon, pp. 61–100. Cowan, N. (2000) The magical number 4 in short-term memory: a reconsideration of mental storage capacity [plus subsequent commentaries]. Behavioral and Brain Science 24 (1): 87–185. Cowburn, M., Nelson, P. and Williams, J. (2000) Assessment of social work students: standpoint and strong objectivity. Social Work Education 19 (6): 627–637. Cowdroy, R. and de Graaf, E. (2005) Assessing highly-creative ability. Assessment and Evaluation in Higher Education 30 (5): 507–518. Cox, B. (1994) Practical pointers for university teachers. London: Kogan Page. Cresswell, M.J. (1986) Examination grades: how many should there be? British Educational Research Journal 12 (1): 37–54. Cresswell, M.J. (1988) Combining grades from different assessments: how reliable is the result? Educational Review 40 (3): 361–382. Cronbach, L.J. and Meehl, P.E. (1955) Construct validity in psychological tests. Psychological Bulletin 52 (4): 281–302. Cross, L.H., Frary, R.B. and Weber, L.J. (1993) College grading. College Teaching 41 (4): 143–148. Dale, R.R. (1959) University standards. Universities Quarterly 13 (2): 186–195. Dale, S.B. and Kruger, A. (2002) Estimating the payoff to attending a more selective college: an application of selection on observables and unobservables. Quarterly Journal of Economics 117 (4): 1491–1527. Dalziel, J. (1998) Using marks to assess student performance: some problems and alternatives Assessment and Evaluation in Higher Education 23 (4): 351–366.

214 References Dearing, R. (1996) Review of qualifications for 16–19 year olds: full report. London: School Curriculum and Assessment Authority. Deci, E.L., Koestner, R. and Ryan, R.M. (1999) A meta-analytic review of experiments examining the effects of extrinsic rewards on intrinsic motivation. Psychological Bulletin 125 (6): 627–668. DEST (2002) Striving for quality: learning, teaching and scholarship. At www.backingaustraliasfuture.gov.au/publications/striving_for_quality/default.htm. DfES (2003) The future of higher education (Cm. 5735). Norwich: The Stationery Office. Division of Evaluation, Testing and Certification (2000) The evaluation of students in the classroom: a handbook and policy guide, second edn. St. John’s, Newfoundland: Department of Education, Dracup, C. (1997). The reliability of marking on a psychology degree. British Journal of Psychology 88 (4): 691–708. Dreyfus, H.L. and Dreyfus S.E. (2005) Expertise in real world contexts. Organization Studies 26 (5): 779–792. Driessen, E., van der Vleuten, C., Schuwirth, L., van Tartwijk, J. and Vermunt, J. (2005) The use of qualitative research criteria for portfolio assessment as an alternative to reliability evaluation: a case study. Medical Education 39 (2): 214–220. Duke, J.D. (1983) Disparities in grading practice, some resulting inequities, and a proposed new index of academic achievement. Psychological Reports 53 (3): 1023–1080. Dweck, C.S. (1999) Self-theories: their role in motivation, personality and development. Philadelphia, PA: Psychology Press. Ebel, R.L. (1969) The relation of scale fineness to grade accuracy. Journal of Educational Measurement 6 (4): 217–221. Ebel, R.L, (1972) The essentials of educational measurement, second edn. Englewood Cliffs, NJ: Prentice-Hall. Ebel, R.L. and Frisbie, D.A. (1991) Essentials of educational measurement, fifth edn. Englewood Cliffs, NJ: Prentice Hall. Echauz, J.R. and Vachtsevanos, G.J. (1995) Fuzzy grading system. IEEE Transactions on Education 38 (2): 158–165. Edgeworth, F.Y. (1890a). The element of chance in competitive examinations. Journal of the Royal Statistical Society 53 (3): 460–75. Edgeworth, F.Y. (1890b). The element of chance in competitive examinations. Journal of the Royal Statistical Society 53 (4): 644–63. Education Policy Committee (2000) Grade inflation at UNC–Chapel Hill: a report to Faculty Council. At www.unc.edu/faculty/faccoun/reports/R2000EPCGrdInfl.PDF (accessed 4 October 2006). Eisner, E.W. (1969) Instructional and expressive educational objectives: their formulation and use in curriculum. In W.J. Popham, E.W. Eisner, H.J. Sullivan and L.L. Tyler, Instructional objectives [AERA Monograph Series on Curriculum Evaluation, No.3]. Chicago, IL: Rand McNally, pp. 1–31. Eisner, E.W. (1979) The educational imagination: on the design and evaluation of school programs. New York: Macmillan. Eisner, E.W. (1985) The art of educational evaluation: a personal view. London: Falmer. Ekstrom, R.B. and Villegas, A.M. (1994) College grades: an exploratory study of policies and practices. New York: College Entrance Examination Board. Elias, K.S. (2003) Tough job market forces up grades. The Times Higher Education Supplement, 5 December: 12. Elliot, A.J. (2005) A conceptual history of the achievement goal construct. In A.J. Elliot

References 215 and C.S. Dweck (eds) Handbook of competence and motivation. New York: The Guilford Press, pp. 52–72. Elliott, R. and Strenta, A.C. (1988) Effects of improving the reliability of the GPA on prediction generally and on comparative predictions for gender and race particularly. Journal of Educational Measurement 25 (4): 333–347. Elton, L. (1998) Are UK degree standards going up, down or sideways? Studies in Higher Education 23 (1): 35–42. Elton, L. (2004) A challenge to established assessment practice. Higher Education Quarterly 58 (1): 43–62. Elton, L. (2005) Designing assessment for creativity: an imaginative curriculum guide. York: Higher Education Academy (in offline archive). Eraut, M. (2004a) Informal learning in the workplace. Studies in Continuing Education 26 (2): 247–73. [Electronic version consulted: available via www.tlrp-archive.org/cgi-bin/ search_oai_all.pl?pn=12&no_menu=1&short_menu=1 (accessed 29 October 2006).] Eraut, M. (2004b) A wider perspective on assessment. Medical Education 38 (8): 803– 804. Ewell, P. (2002) Keeping the value in ‘value added’. Peer Review 4 (2–3): 34–35. Ewell, P.T. and Jones, D.P. (1994) Pointing the way: indicators as policy tools in higher education. In S.S. Ruppert (ed.) Charting higher education accountability: a sourcebook on state-level performance indicators. Denver, CO: Education Commission of the States, pp. 6–16. Faculty Committee on Grading (2005) Grading at Princeton: philosophy, strategy, practice. At www.princeton.edu/~odoc/Grading_at_Princeton.doc (accessed 14 September 2006). Feldt, L.S. and Brennan, R.L. (1989) Reliability. In R.L. Linn (ed) Educational measurement, third edn. New York: Macmillan, pp. 105–146. Felton, J. and Koper, P.T. (2005) Nominal GPA and real GPA: a simple adjustment that compensates for grade inflation. Assessment and Evaluation in Higher Education 30 (6): 561–569. Fishman, J.A. (1958) Unresolved criterion problems in the selection of college students. Harvard Educational Review 28 (4): 340–349. Fransella, F. and Adams, B. (1966) An illustration of the use of repertory grid technique in a clinical setting. British Journal of Social and Clinical Psychology 5 (1): 51–62 Friedlich, M., MacRae, H., Oandasan, I., Tannenbaum, D., Batty, H., Reznick, R. and Regehr, G. (2001) Structured assessment of minor surgical skills (SAMSS) for family medicine residents. Academic Medicine 76 (12): 1241–1246. Fullan, M. (2001) The new meaning of educational change, third edn. London: RoutledgeFalmer. Furness, S. and Gilligan, P. (2004) Fit for purpose: issues from practice placements, practice teaching and the assessment of students’ practice. Social Work Education 23 (4): 465–479. Gater, D.S. (2002) A review of measures used in US News and World Report’s ‘America’s best colleges’. Gainsville, FL: The Center, University of Florida. Geisinger, K.F. (1982) Marking systems. In H.E. Mitzel (ed) Encyclopedia of educational research, Vol. 3. New York: Free Press/Collier Macmillan, pp. 1139–1149. Gibbons, M., Limoges, C., Nowotny, H., Schwartzman, S., Scott, P., and Trow, M. (1994) The new production of knowledge: the dynamics of science and research in contemporary societies. London: Sage. Gibbs, G. (1999) Using assessment strategically to change the way students learn. In S.

216 References Brown and A. Glasner (eds) Assessment matters in higher education: choosing and using diverse approaches. Buckingham: SRHE and Open University Press, pp. 41–53. Gibbs, G. and Simpson, C. (2004–5) Conditions under which assessment supports student learning. Learning and Teaching in Higher Education 1: 3–31. Gibson, F. and Senter, H. (n.d.) Summary analysis of plus/minus grading at Clemson University for the two-year trial period. At http://virtual.clemson.edu/groups/PROVOST/ PlusMinus/Plus_Minus_Final_Report.pdf (accessed 4 October 2006). Goldenberg, D. and Waddell, J. (1990) Occupational stress and coping strategies among female baccalaureate nursing faculty. Journal of Advanced Nursing 15 (5): 531–543. Goldman, L. (1985). The betrayal of the gatekeepers: grade inflation. Journal of General Education 37 (2): 97–121 Goldman, R.D. and Widawski, M.H. (1976) A within-subjects technique for comparing college grading standards: implications in the validity of the evaluation of college achievement. Educational and Psychological Measurement 36 (2): 381–390. Goleman, D. (1996) Emotional Intelligence. London: Bloomsbury. Goodison Sir N., chair (1997) National record of achievement review: report of the Steering Group. London: Department for Education and Employment. Goodliffe, T. (2005) Personal development planning: addressing the skills gap for engineers in Oman. Learning and Teaching in Higher Education: Gulf Perspectives 2 (1), unpaged. At www.zu.ac.ae/lthe/vol2no1/lthe02_03.html (accessed 22 March 2006) Gosselin, C.L. (1997) Plus minus grading study, Fall 1994, Spring 1995, Fall 1995, and Spring 1996, and Fall 1996. At http://www2.acs.ncsu.edu/UPA/otherdata/GRADANAL.HTM (accessed 29 October 2006). Gough, D.A., Kiwan, D., Sutcliffe, K., Simpson, D. and Houghton, N. (2003) A systematic map and synthesis review of the effectiveness of personal development planning for improving student learning. London: EPPI-Centre, Social Science Research Unit. At http://eppi.ioe.ac.uk/EPPIWebContent/reel/review_groups/EPPI/LTSN/LTSN_June03. pdf (accessed 17 October 2006) Greenwood, M., Hayes, A., Turner, C. and Vorhaus, J., eds (2001) Recognising and validating outcomes of non-accredited learning: a practical approach. London: Learning and Skills Development Agency. Guba, E.G. and Lincoln, Y.S. (1989) Fourth generation evaluation. London: Sage. Hager, P. (1998) Recognition of informal learning: challenges and issues. Journal of Vocational Education and Training 50 (4): 521–535. Hager, P. (2004a) The competence affair: or why vocational education and training urgently need a new understanding of learning. Journal of Vocational Education and Training 56 (3): 409–433. Hager, P. (2004b) The conceptualization and measurement of learning at work. In H. Rainbird, A. Fuller and A. Munro (eds) Workplace learning in context. London: Routledge, pp. 242–58. Hager, P. and Beckett, D. (1995) Philosophical underpinnings of the integrated conception of competence. Educational Philosophy and Theory 27 (1): 1–24. Hager, P. and Butler, J. (1996) Two models of educational assessment. Assessment and Evaluation in Higher Education 21 (4): 367–378. Hager, P., Gonczi, A. and Athanasou, J. (1994) General issues about assessment of competence. Assessment and Evaluation in Higher Education 19 (1): 3–16. Hand, L., and Clewes, B. (2000) Marking the difference: an investigation of the criteria used for assessing undergraduate dissertations in a business school. Assessment and Evaluation in Higher Education 25 (1): 5–21.

References 217 Harackiewicz, J.M., Barron, K.E. and Elliot, A.J. (1998) Rethinking achievement goals: when are they adaptive for college students and why? Educational Psychologist 33 (1): 1–21. Hartog, Sir P. and Rhodes, E.C. (1935) An examination of examinations. London: Macmillan. Hartog, Sir P. and Rhodes, E.C. (1936) The marks of examiners. London: Macmillan. Haug, G. (1997) Capturing the message conveyed by grades: interpreting foreign grades. World Education News and Reviews 10 (2): 12–17. Available via www.wes.org/gradeconversionguide/ (accessed 18 September 2006). Haug, G. and Tauch, C. (2001) Trends in learning structures in higher education (II) [Followup Report prepared for the Salamanca and Prague Conferences of March/May 2001]. At www.oph.fi/publications/trends2/trends2.html (accessed 22 November 2006). Hawe, E. (2003) ‘It’s pretty difficult to fail’: the reluctance of lecturers to award a failing grade. Assessment and Evaluation in Higher Education 28 (4): 371–382. Hays, R.B., Davies, H.A., Beard, J.D., Caldon, L.J.M., Farmer, E.A., Finucane, P.M., McCrorie, P., Newble, D.I., Schuwirth, L.W.T. and Sibbald, G.R. (2002) Selecting performance assessment methods for experienced physicians. Medical Education 36 (10): 910–917. Hedges, C. (2004) Public lives; an a for effort to restore meaning to the grade. The New York Times, 6 May: B.2. HEFCE (2003) Schooling effects on higher education achievement [Report 2003/32]. Bristol: Higher Education Funding Council for England. HEFCE (2005) Schooling effects on higher education achievement: further analysis – entry at19 [Report 2005/09]. Bristol: Higher Education Funding Council for England. HEFCW (1996) Quality Assessment Programme 1996/97: Guidelines for assessment, second edn. Cardiff: Higher Education Funding Council for Wales (mimeo). HEQC (1994) Learning from audit. London: Higher Education Quality Council. HEQC (1996a) Inter-institutional variability of degree results: an analysis in selected subjects. London: Higher Education Quality Council. HEQC (1996b) Learning from audit 2. London: Higher Education Quality Council. HEQC (1997a) Graduate Standards Programme: final report (2 vols.). London: Higher Education Quality Council. HEQC (1997b) Assessment in higher education and the role of ‘graduateness’. London: Higher Education Quality Council. Heritage, G.L., Thomas, A.D. and Chappell, A. (2007) Institutional bias and the degree class system. Journal of Geography in Higher Education 31 (2): 285–297. Hersh, R. (2005) What does college teach? The Atlantic Monthly 296 (4): 140–143. At www.theatlantic.com/doc/200511/measuring-college-quality/ (accessed 10 October 2006). Hersh, R.H. and Benjamin, R. (2001) Assessing the quality of student learning: an imperative for state policy and practice. At www.nga.org/cda/files/HIGHEREDQUALITY.pdf (accessed 30 October 2006). Hesketh, A.J. (2000) Recruiting an elite? Employers’ perceptions of graduate education and training. Journal of Education and Work 13 (3): 245–271. Heywood, J. (2000) Assessment in higher education: student learning, teaching, programmes and institutions. London: Jessica Kingsley. Hodges, B., Regehr, G., Hanson, M. and McNaughton, N. (1997) An objective structured clinical examination for evaluating psychiatric clinical clerks. Academic Medicine 72 (8): 715–721.

218 References Hodges, B., Regehr, G., McNaughton, N., Tiberius, R. and Hanson, M. (1999) OSCE checklists do not capture increasing levels of expertise. Academic Medicine 74 (10): 1129–1134. Holroyd, C. (2000) Are assessors professional? Student assessment and the professionalism of academics. Active Learning in Higher Education 1 (1): 28–44. Hornby, W. (2003) Assessing using grade-related criteria: a single currency for universities? Assessment and Evaluation in Higher Education 28 (4): 435–454. Hoskins, S.L. and Newstead, S.E. (1997) Degree performance as a function of age, gender, prior qualifications and discipline studied. Assessment and Evaluation in Higher Education 22 (3): 317–28. Hounsell, D. (2007) Balancing assessment of and assessment for learning [Guides to integrative assessment No.2]. Gloucester: Quality Assurance Agency for Higher Education. Hounsell, D., McCulloch, M. and Scott, M., eds, (1996) The ASSHE inventory: changing assessment practices in Scottish higher education. Edinburgh: The University of Edinburgh and Napier University, in association with UCoSDA. Howarth, I. and Croudace, T. (1995) Improving the quality of teaching in universities: a problem for occupational psychologists? Psychology Teaching Review 4 (1): 1–11. Hudson, L. (1967) Contrary imaginations: a psychological study of the English schoolboy. Harmondsworth: Penguin. Humphreys, L.G. (1968) The fleeting nature of college academic success. Journal of Educational Psychology 59 (5): 375–380. Huot, B. (1990) The literature of direct writing assessment: major concerns and prevailing trends. Review of Educational Research 60 (2): 237–263. Hussey, T. and Smith, P. (2002) The trouble with learning outcomes. Active Learning in Higher Education 3 (2): 220–234. Hyland, T. (1994) Competence, education and NVQs: dissenting perspectives. London: Cassell. IHEP (1998) Reaping the benefits: defining the public and private benefits of going to college. Washington, DC: The Institute for Higher Education Policy. Ilott, I. and Murphy, R. (1997) Feelings and failing in professional training: the assessor’s dilemma. Assessment and Evaluation in Higher Education 22 (3): 307–316. Jackson, N. and Ward, R. (2004) A fresh perspective on progress files – a way of representing complex learning and achievement in higher education. Assessment and Evaluation in Higher Education 29 (4): 423–449. James, R. (2003) Academic standards and the assessment of student learning: some current issues in Australian higher education. Tertiary Education and Management 9 (3): 187–198. James, R., McInnis, C. and Devlin, M. (2002a) Assessing learning in Australian universities: ideas, strategies and resources for quality in student assessment. Available at www. cshe.unimelb.edu.au/assessinglearning/ (accessed 23 April 2007). James, R., McInnis, C. and Devlin, M. (2002b) Options for a national process to articulate and monitor academic standards across Australian universities [Submission to the Higher Education Review, 2002]. At www.backingaustraliasfuture.gov.au/submissions/ crossroads/pdf/11.pdf (accessed 13 October 2006). James, S. and Hayward, G. (2004) Becoming a chef: the politics and culture of learning. In G. Hayward and S. James, Balancing the skills equation: key issues and challenges for practice. Bristol: The Policy Press, pp. 219–43.

References 219 Jessup, G. (1991) Outcomes: NVQs and the emerging model of education and training. London: Falmer. Johnes, J. and Taylor, J. (1990) Performance indicators in higher education. Buckingham: SRHE and Open University Press. Johnson, B. (2004) Higher education credit practice in England, Wales and Northern Ireland. Derby: EWNI Credit Forum. Johnson, V.E. (1997) An alternative to traditional GPA for evaluating student performance. Statistical Science 12 (4): 251–269. Johnson, V.E. (2003) Grade inflation: a crisis in college education. New York: Springer. Johnston, B. (2004) Summative assessment of portfolios: an examination of different approaches to agreement over outcomes. Studies in Higher Education 29 (3): 395–412. Jones, A. (1999) The place of judgement in competency-based assessment. Journal of Vocational Education and Training 51 (1): 145–160. Jones, M. (2001) Mentors’ perceptions of their roles in school-based teacher training in England and Germany. Journal of Education for Teaching 27 (1): 75–94 Juola, A. (1980) Grade inflation in higher education – 1979: is it over? ERIC ED189129. Kahn, P.E. and Hoyles, C. (1997) The changing undergraduate experience: a case study of single honours mathematics in England and Wales. Studies in Higher Education 22 (3): 349–362. Kane, M. (1994) Validating the performance standards associated with passing scores. Review of Educational Research 64 (3): 425–461. Karran, T. (2005) Pan-European grading scales: lessons from national systems and the ECTS. Higher Education in Europe 30 (1): 5–22. Kemshall, H. (1993) Assessing competence: scientific process or subjective inference? Do we really see it? Social Work Education 12 (1): 36–45. Klein, S.P., Kuh, G.D., Chun, M., Hamilton, L. and Shavelson, R. (2005) An outcomes approach to measuring cognitive outcomes across higher education institutions. Research in Higher Education 46 (3): 251–276. Klenowski, V. (2002) Developing portfolios for learning and assessment: processes and principles. London: RoutledgeFalmer. Kneale, P. (1997) The rise of the ‘strategic student’: how can we adapt to cope? In S. Armstrong, G. Thompson and S. Brown (eds) Facing up to radical change in universities and colleges. London: Kogan Page, pp. 119–130. Knight, P., ed., (1995) Assessment for learning in higher education. London: Kogan Page. Knight, P.T. (2002) Summative assessment in higher education: practices in disarray. Studies in Higher Education 27 (3): 275–286. Knight, P.T. (2006) The local practices of assessment. Assessment and Evaluation in Higher Education 31 (4): 435–452. Knight, P.T. and Yorke, M. (2003) Assessment, learning and employability. Maidenhead: Society for Research in Higher Education and the Open University Press. Knight, P. and Yorke, M. (2004) Learning, curriculum and employability in higher education. London: RoutledgeFalmer. Knight, P.T. and Yorke, M. (2004/06) Employability: judging and communicating achievements. York: The Higher Education Academy. Also available at www.heacademy.ac.uk/assets/York/documents/ourwork/tla/employability/id459_employability_ %20judging_and_communicating_achievements_337.pdf (accessed 2 July 2007). Kohn, A. (2002) The dangerous myth of grade inflation. The Chronicle of Higher Education 49 (11): B7. At www.alfiekohn.org/teaching/gi.htm (accessed 4 November 2006).

220 References Kornbrot, D.E. (1987) Degree performance as a function of discipline studied, parental occupation and gender. Higher Education 16 (5): 513–534. Kuh, G. and Hu, S. (1999) Unraveling the complexity of the increase in college grades from the mid-1980s to the mid-1990s. Education Evaluation and Policy Analysis 21 (3): 297–320. Lambert, R. (2003) Lambert review of business–university collaboration [Final Report]. Norwich: HMSO. At www.hm‑treasury.gov.uk/media/EA556/lambert_review_final_ 450.pdf (accessed 10 October 2006). Lang, J. and Millar, D.J. (2003) Accredited work-related learning programmes for students: a guide for employers. York: Learning and Teaching Support Network. Lang, K. and Woolston, R. (2005) Workplace learning and assessment of probationary constables in the New South Wales Police. Paper presented at the First International Conference on Enhancing Teaching and Learning through Assessment, Hong Kong. Lankshear, A. (1990) Failure to fail: the teacher’s dilemma. Nursing Standard 4 (20): 35–37. Lanning, W. and Perkins, P. (1995) Grade inflation: A consideration of additional causes. Journal of Instructional Psychology 22 (2): 163–168. Larkey, P.D. (1997) Comment: adjusting grades at Duke University. Statistical Science 12 (4): 269–271. Laurillard, D. (1997) Styles and approaches in problem-solving. In F. Marton, D. Hounsell and N. Entwistle (eds) The experience of learning, second edn. Edinburgh: Scottish Academic Press, pp. 126–45. Levine, A. and Cureton, J.S. (1998) When hope and fear collide: a portrait of today’s college student. San Francisco, CA: Jossey Bass. Lewis, H.R. (2006) Excellence without a soul: how a great university forgot education. New York: Public Affairs. Linke, R., chair (1991) Performance indicators in higher education (Report of a trial evaluation study commissioned by the Commonwealth Department of Employment, Education and Training, 2 vols). Canberra: Australian Government Publishing Service. Lipsett, A. (2006) League tables aid pupil choice. The Times Higher Education Supplement, 1 September: 4. Looney, M.A. (2004) Evaluating judge performance in sport. Journal of Applied Measurement 5 (1): 31–47. Lord, F.M. (1963) Elementary models for measuring change. In C.W. Harris (ed.) Problems in measuring change. Madison, WI: University of Wisconsin Press, pp. 21–38. Lum, G. (1999) Where’s the competence in competence-based education and training? Journal of Philosophy of Education 33 (3): 403–418. Ma, J. and Zhou, D. (2000) Fuzzy set approach to the assessment of student-centred learning. IEEE Transactions on Education 43 (2): 237–241. McCaffrey, D.F., Lockwood, J.R., Koretz, D.M. and Hamilton, L.M. (2003) Evaluating value-added models for teacher accountability. Santa Monica, CA: The RAND Corporation. McCrum, N.G. (1994) The academic gender deficit at Oxford and Cambridge. Oxford Review of Education 20 (1): 3–26. McCulloch, R. (2005) From the classroom to Kajulu and beyond: authentic assessment within an industry-professional context. Paper presented at the First International Conference on Enhancing Teaching and Learning through Assessment, Hong Kong. Macfarlane, B. (1992) The ‘Thatcherite’ generation and university degree results. Journal of Further and Higher Education 16 (2): 60–70.

References 221 Macfarlane, B. (1993) The results of recession: students and university degree performances during the 1980s. Research in Education 49 (May): 1–10. McGuire, M.D. (1995) Validity issues for reputational studies. In R.D. Walleri and M.K. Moss (eds) Evaluating and responding to college guidebooks and rankings (New Directions for Institutional Research No. 88). San Francisco, CA: Jossey-Bass, pp. 45–59. Machung, A. (1995) Changes in college rankings: how real are they? Paper presented at the 35th Forum of the Association for Institutional Research, Boston (mimeo). McIlroy, J.H., Hodges, B., McNaughton, N. and Regehr, G. (2002) The effect of candidates’ perceptions of the evaluation method on reliability of checklist and global rating scores in an objective structured clinical examination. Academic Medicine 77 (7): 725–728. McInnis, C. (2001) Signs of disengagement? The changing undergraduate experience in Australian universities. Inaugural Professorial Lecture, University of Melbourne, 13 August. At http://eprints.unimelb.edu.au/archive/00000094/01/InaugLec23_8_01.pdf (accessed 15 June 2006). McInnis, C., Griffin, P., James, R. and Coates, H. (2001) Development of the Course Experience Questionnaire (CEQ). Canberra: Department of Education, Training and Youth Affairs. McKeachie, W.J. (2002) McKeachie’s teaching tips: strategies, research, and theory for college and university teachers, eleventh edn. Boston: Houghton Mifflin. McLachlan, J.C. and Whiten, S.C. (2000) Marks, scores and grades: scaling and aggregating student assessment scores. Medical Education 34 (2): 788–797. Maclellan, E. (2001) Assessment for Learning: the differing perceptions of tutors and students. Assessment and Evaluation in Higher Education 26 (4): 307–318. Maclellan, E. (2004) How convincing is alternative assessment for use in higher education? Assessment and Evaluation in Higher Education 29 (3): 311–321. McSpirit, S. and Jones, K.E. (1999) Grade inflation rates among different ability students, controlling for other factors. Education Policy Analysis Archives 7 (30). At http://epaa. asu.edu/epaa/v7n30.html (accessed 12 September 2006). McVey, P.J. (1975) The errors in marking examination scripts. International Journal of Electrical Engineering Education 12 (3): 203–216 McVey, P.J. (1976a) The paper error of two examinations in electronic engineering. Physics Education 11 (January): 58–60. McVey, P.J. (1976b) Standard error of the mark for an examination paper in electronic engineering. Proceedings of the Institution of Electrical Engineers 123 (8): 843–844. Mager, R.F. (1962) Preparing instructional objectives. Palo Alto, CA: Fearon. Manhire, B. (2005) Grade inflation in engineering education at Ohio University. Proceedings of the 2005 American Society for Engineering Education Annual Conference and Exposition. At www.ent.ohiou.edu/~manhire/grade/BM050228_Final.pdf (accessed 10 August 2006). Marcus, J. (2002) Grade bandwagon runs wild. The Times Higher Education Supplement, 19 July: 9. Marcus, J. (2003) Harvard saves face by deflating grades. The Times Higher Education Supplement, 7 February: 11. Marsh, H.W. (1983) Multidimensional ratings of teaching effectiveness by students from different academic settings and their relation to student/course/instructor characteristics. Journal of Educational Psychology 75 (1): 150–166. Martin, S. and Cloke, C. (2000) Standards for the award of qualified teacher status: reflections on assessment implications. Assessment and Evaluation in Higher Education 25 (2): 183–190.

222 References Mayhew, K., Deer, C. and Dua, M. (2004) The move to mass higher education in the UK: many questions and some answers. Oxford Review of Education 30 (1): 65–82. Mentkowski, M. and Associates (2000) Learning that lasts: integrating learning, development, and performance in college and beyond. San Francisco: Jossey-Bass. Messick, S. (1989) Validity. In R.L. Linn (ed.) Educational Measurement, third edn. Washington, DC: American Council on Education/Macmillan, pp. 13–103. Miller, C.M.L. and Parlett, M. (1974) Up to the mark: a study of the examination game [Monograph 21]. London: Society for Research into Higher Education. Miller, D. (2006) Letter. The Times Higher Education Supplement, 24 November: 15. Miller, G.A. (1956) The magic number seven, plus or minus two. Psychological Review 63 (2): 81–97. Miller, G.E. (1990) The assessment of clinical skills/competence/performance. Academic Medicine 65 (9: Supplement): S63–S67. Milton, O., Pollio, H.R. and Eison, J. (1986) Making sense of college grades. San Francisco, CA: Jossey-Bass. Mitchelmore, M.C. (1981) Reporting student achievement: how many grades? British Journal of Educational Psychology 51 (2): 218–227. Morgan, C.K., Watson, G.K., Roberts, D.W., McKenzie, A.D. and Cochrane, K.W. (2004) Scholarship neglected? How levels are assigned for units of study in Australian undergraduate courses. Assessment and Evaluation in Higher Education 29 (3): 283–298. Morrison, H., Cowan, P. and Harte, S. (1997) The impact of modular aggregation on the reliability of final degrees and the transparency of European credit transfer. Assessment and Evaluation in Higher Education 22 (4): 405–417. Morrison, H.G., Magennis, S.P. and Carey, L.J. (1995) Performance indicators and league tables: a call for standards. Higher Education Quarterly 49 (2): 128–145. Murrell, J. (1993) Judgement of professional competence: bags of bias. Social Work Education (Special publication: ‘Assessment of competence in social work law’): 5–19. NCIHE (1997) Higher education in the learning society (Report of the National Committee of Inquiry into Higher Education: ‘The Dearing Report’). Norwich: HMSO. NCPPHE (2000) Measuring up: the state-by-state report card for higher education. San Jose, CA: National Center for Public Policy and Higher Education. NCPPHE (2002) Measuring up: the state-by-state report card for higher education. San Jose, CA: National Center for Public Policy and Higher Education. NCPPHE (2004) Measuring up: the national report card on higher education. San Jose, CA: National Center for Public Policy and Higher Education. NCPPHE (2006) Measuring up: the national report card on higher education. San Jose, CA: National Center for Public Policy and Higher Education. Nelson, B. (2002) Higher education at the crossroads: an overview paper. Canberra: Department of Education, Science and Training. At www.backingaustraliasfuture.gov. au/publications/crossroads/default.htm (accessed 26 October 2006). Newble, D. and Cannon, R. (2000) A handbook for teachers in universities and colleges: a guide to improving teaching method, fourth edn. London: Kogan Page. Newble, D.I. and Jaeger, K. (1983) The effect of assessments and examinations on the learning of medical students. Medical Education 17 (3): 165–171. Newble, D., Jolly, B. and Wakeford. R., eds, (1994) The certification and recertification of doctors: issues in the assessment of clinical competence. Cambridge: Cambridge University Press. Newstead, S.E. (2002) Examining the examiners: why are we so bad at assessing students? Psychology Learning and Teaching 2 (2): 70–75.

References 223 Newton, P.E. (1996) The reliability of marking of General Certificate of Secondary Education scripts: mathematics and English. British Educational Research Journal 22 (4): 405–420. Nicklin, P.J. and Kenworthy, N., eds, (2000) Teaching and assessing in nursing practice: an experiential approach, third edn. Edinburgh: Baillère Tindall in association with the RCN. Norcini, J.J., and Shea, J.A. (1997) The credibility and comparability of standards, Applied Measurement in Education 10 (1): 39–59. Nuttall, D. (1982) Criteria for successful combination of teacher assessed and external elements in combining teacher assessment and examining board assessment. Aldershot: Associated Examining Board. Oakeshott, M. (1962) Rationalism in politics, and other essays. London: Methuen O’Donovan, B., Price, M. and Rust, C. (2004) Know what I mean? Enhancing student understanding of assessment standards and criteria. Teaching in Higher Education 9 (3): 325–335. OIA (2005) Annual Report 2004. London: Office of the Independent Adjudicator for Higher Education. OIA (2006) Annual Report 2005. London: Office of the Independent Adjudicator for Higher Education. O’Leary, N.C. and Sloane: J. (2005a) The changing wage return to an undergraduate education. At http://cee.lse.ac.uk/conference_papers/20_05_2005/peter_sloane.pdf (accessed 25 November 2006). O’Leary, N.C. and Sloane, P.J. (2005b) The return to a university education in Great Britain. National Institute Economic Review 193 (July): 75–89. Orr, S. (2005) ‘Justify 66 to me!’ An investigation into the social practice of agreeing marks in an HE Art and Design department. Paper presented at the Thirteenth Improving Student Learning Symposium, Imperial College, London, 5–7 September. Orrell, J. (2004) Congruence and disjunctions between academics’ thinking when assessing and their beliefs abiout assessment practice. In C. Rust (ed.) Improving student learning: theory, research and scholarship [Proceedings of the 2003 International Symposium]. Oxford: Oxford Centre for Staff and Learning Development, pp. 186–200. Owens, C. (1995) How the assessment of competence in DipSW is changing the culture of practice teaching. Social Work Education 14 (3): 61–78. Parlour, J.W. (1996) A critical analysis of degree classification procedures and outcomes. Higher Education Review 28 (2): 25–39. Pascarella, E.T. and Terenzini, P.T. (1991) How college affects students. San Francisco: Jossey-Bass. Pascarella, E.T. and Terenzini, P.T. (2005) How college affects students: a third decade of research. San Francisco: Jossey-Bass. Phillips, M. (2003) Eat your heart out, Matthew Arnold. Daily Mail, 23 January. At www. melaniephillips.com/articles-new/?p=96 (accessed 9 August 2006). Pintrich P.R. (2000) The role of goal orientation in self-regulated learning. In M. Boekaerts, P. Pintrich and M. Zeidner (eds) Handbook of self-regulation. New York: Academic Press, pp. 451–502. Please, N.W. (1971) Estimation of the proportion of candidates who are wrongly graded. British Journal of Mathematical and Statistical Psychology 24 (2): 230–238. Pollio, H.R. and Beck, H.P. (2000) When the tail wags the dog: perceptions of learning and grade orientation in, and by, contemporary college students and faculty. Journal of Higher Education 71 (1): 84–102.

224 References Prather, J.E., Smith, G., and Kodras, J.E. (1979) A longitudinal study of grades in 144 undergraduate courses. Research in Higher Education 10 (1): 11–24. Price, M. and Rust, C. (1999) The experience of introducing a common assessment grid across an academic department. Quality in Higher Education 5 (2): 133–144. PricewaterhouseCoopers (2007) The economic benefits of a degree. London: Universities UK. Purcell, K. and Pitcher, J. (1996) Great expectations: the new diversity of graduate skills and aspirations. Coventry: Institute of Employment Research, University of Warwick. QAA (2002) Goldsmiths College, University of London, Quality audit report, June 2002. At www.qaa.ac.uk/reviews/reports/institutional/Goldsmith/goldsmith.asp (accessed 23 November 2006). QAA (2006a) Outcomes from institutional audit: assessment of students. Gloucester: Quality Assurance Agency for Higher Education. At www.qaa.ac.uk/reviews/institutionalAudit/outcomes/Assessmentofstudents.pdf (accessed 3 August 2006). QAA (2006b) The classification of degree awards [Background briefing note]. Gloucester: Quality Assurance Agency for Higher Education (mimeo). Raffe, D., Howieson, C. and Croxford, L. (2001) The viability of a value-added performance indicator in Scottish FE. Report to the Scottish Further Education Funding Council. Ram, P., Grol, R., Rethans, J.J., Schouten, B., van der Vleuten, C., and Kester, A. (1999) Assessment of general practitioners by video observation of communicative and medical performance in daily practice: issues of validity, reliability and feasibility. Medical Education 33 (6): 447–454. Ramsden, P. (1992) Learning to teach in higher education. London: Routledge. Ramsden, P. (2003) Student surveys and quality assurance. Proceedings of the Australian Universities Quality Forum. At www.auqa.edu.au/auqf/2003/program/papers/Ramsden.pdf (accessed 5 November 2006). Redfern, S., Norman, I., Calman, L., Watson, R. and Murrells, T. (2002) Assessing competence to practise in nursing: a review of the literature. Research Papers in Education Policy and Practice 17 (1): 51–77. Regehr, G., MacRae, H., Reznick, R.K. and Szalay, D. (1998) Comparing the psychometric properties of checklists and global rating scales for assessing performance on an OSCEformat examination. Academic Medicine 73 (9), 993–997. Reich, R.B. (1991) The work of nations. London: Simon and Schuster. Reich, R.B. (2002) The future of success. London: Vintage. Reznick, R., Regehr, G., MacRae, H., Martin, J. and McCulloch, W. (1997) Testing technical skill via an innovative ‘bench station’ examination. American Journal of Surgery 173 (3): 226–230. Richardson, J.T.E. (2004) The National Student Survey: final report from the 2003 pilot project. Milton Keynes: Institute of Educational Technology, the Open University (mimeo). Richmond, W.K. (1968) Readings in education: a sequence. London: Methuen. Rickard, W. (2002) Work-based learning in health: evaluating the experience of learners, community agencies and teachers. Teaching in Higher Education 7 (1): 47–63. Riley, H.J., Checca, R.C., Singer, T.S., and Worthington, D.F. (1994) Current trends in grades and grading practices in undergraduate higher education: the results of the 1992 AACRAO Survey. Washington: American Association of Collegiate Registrars and Admissions Officers. Robertson, D. (2002) Intermediate-level qualifications in higher education: an interna-

References 225 tional assessment. At www.hefce.ac.uk/Pubs/rdreports/2002/rd10_02/ (accessed 10 October 2006). Rosovsky, H. and Hartley, M. (2002) Evaluation and the academy: are we doing the right thing? Cambridge, MA: American Academy of Arts and Sciences. Rothblatt, S. (1991) The American modular system. In R.O. Berdahl, G.C. Moodie and I.J. Spitzberg, Jr (eds) Quality and access in higher education: comparing Britain and the United States. Buckingham: SRHE and Open University Press, pp. 129–141. Rust, C. (2002) How can the research literature practically help to inform the development of departmental assessment strategies and learner-centred assessment practices? Active Learning in Higher Education 3 (2): 145–158. Rust, C., Price, M. and O’Donovan, B. (2003) Improving students’ learning by developing their understanding of assessment criteria and processes. Assessment and Evaluation in Higher Education 28 (2): 147–164. Ryle, G. (1949) The concept of mind. London: Hutchinson. Sabot, R. and Wakeman-Linn, J. (1991) Grade inflation and course choice. Journal of Economic Perspectives 5 (1): 159–170. Sadler, D.R. (1987) Specifying and promulgating achievement standards. Oxford Review of Education 13 (2): 191–209. Sadler, D.R. (2005) Interpretations of criteria-based assessment and grading in higher education. Assessment and Evaluation in Higher Education 30 (2): 176–194. Salamonson, Y. and Andrew, S. (2006) Academic performance in nursing students: influence of part-time employment, age and ethnicity. Journal of Advanced Nursing 55 (3): 342–351. Saliu, S. (2005) Constrained subjective assessment of student learning. Journal of Science Education and Technology 14 (3): 271–284. Salovey, P. and Mayer, J.D. (1990) Emotional intelligence. Imagination, Cognition, and Personality 9: 185–211. Schön, D.A. (1983). The reflective practitioner: how professionals think in action. New York: Basic Books. Schuwirth, L.W.T., Southgate, L., Page, G.G., Paget, N.S., Lescop, J.M.J., Lew, S.R., Wade, W.B. and Barón-Maldonado, M. (2002) When enough is enough: a conceptual basis for fair and defensible practice performance assessment. Medical Education 36 (10): 925–930. SED (1986) Assessment in Standard Grade courses – proposals for simplification. Edinburgh: Scottish Education Department. Senge, P. (1992) The fifth discipline:the art and practice of the learning organization. London: Century Business. Sennett, R. (2006) The culture of the new capitalism. New Haven, CT: Yale University Press. Shavelson, R.J. and Huang, L. (2003) Responding responsibly to the frenzy to assess learning in higher education. Change 35 (1): 10–19. Shay, S.B. (2003) The assessment of undergraduate final year projects: a study of academic professional judgment. Unpublished PhD Thesis, University of Cape Town. Shepard, L.A. (2000) The role of assessment in a learning culture. Educational Researcher 29 (7): 4–14. Shulock, N. and Moore, C. (2002) An accountability framework for California higher education: informing public policy and improving outcomes. Sacramento, CA: Institute for Higher Education Leadership and Policy. Shumway, J.M. and Harden, R.M. (2003) The assessment of learning outcomes for the

226 References competent and effective physician [AMEE Guide No.25]. Medical Teacher 25 (6): 569–584. Simon, H.A. (1957) Models of man. New York: Wiley. Simon, M. and Forgette-Giroux, R. (2000) Impact of a content selection framework on portfolio assessment at the classroom level. Assessment in Education: Principles, Policy and Practice 7 (1): 83–101. Simon, S.J. (1945) Why you lose at bridge. London: Nicholson and Watson. Simonite, V. (2000) The effects of aggregation method and variations in the performance of individual students on degree classifications in modular degree courses. Studies in Higher Education 25 (2): 197–209. Simonite, V. (2003) The impact of coursework on degree classifications and the performance of individual students. Assessment and Evaluation in Higher Education 28 (3): 459–470. Simonite, V. (2004) Multilevel analysis of relationship between entry qualifications and trends in degree classifications in mathematical sciences: 1994–2000. International Journal of Mathematics Education in Science and Technology 35 (3): 335–344. Simonite, V. (2005) Gender difference added? Institutional variations in the gender gap in first class awards in mathematical sciences. British Educational Research Journal 31 (6): 737–759. Singleton, R. and Smith, E.R. (1978) Does grade inflation decrease the reliability of grades? Journal of Educational Measurement 15 (1): 37–41. Sinkinson, A. and Jones, K. (2001) The validity and reliability of OFSTED judgements of the quality of secondary mathematics initial teacher education courses. Cambridge Journal of Education 31 (2): 221–237. Smith, D.L. (1992) Validity of faculty judgments of student performance: relationship between grades and credits earned and external criterion measures. The Journal of Higher Education 63 (3): 329–340. Smith, J. and Naylor, R. (2001) Determinants of degree performance in UK universities: a statistical analysis of the 1993 cohort. Oxford Bulletin of Economics and Statistics 63 (1): 29–60. Snyder, B.R. (1971) The hidden curriculum. Cambridge, MA: The MIT Press. Starch, D. and Elliott, E.C. (1912) Reliability of the grading of high school work in English. School Review 20 (7): 442–457. Starch, D. and Elliott, E.C. (1913a) Reliability of grading work in History. School Review 21 (10): 676–681. Starch, D. and Elliott, E.C. (1913b) Reliability of grading work in Mathematics. School Review 21 (4): 254–259. Stecher, B. (1998) The local benefits and burdens of large-scale portfolio assessment. Assessment in Education: Principles, Policy and Practice 5 (3): 335–351. Stephenson, J. (1992) Capability and quality in higher education. In J. Stephenson and S. Weil (eds) Quality in learning: a capability approach to higher education. London: Kogan Page, pp. 1–9. Stephenson, J. (1998) The concept of capability and its importance in higher education. In J. Stephenson and M. Yorke (eds) Capability and quality in higher education. London: Kogan Page, pp. 1–13. Sternberg, R.J. (1997) Successful intelligence: how practical and creative intelligence determine success in life. New York: Plume. Stiggins, R.J., Frisbie, D.A. and Griswold, P.A. (1989) Inside high school grading prac-

References 227 tices: building a research agenda. Educational Measurement: Issues and Practice 8 (2): 5–14. Stone, J.E. (1995) Inflated grades, inflated enrollment, and inflated budgets: an analysis and call for review at the state level. Education Policy Analysis Archives 3 (11). At http://epaa.asu.edu/epaa/v3n11.html (accessed 4 November 2006). Stones, E. (1994) Assessment of a complex skill: improving teacher education. Assessment in Education: Principles, Policy and Practice 1 (2): 235–251. Stowell, M. (2004) Equity, justice and standards: assessment decision making in higher education. Assessment and Evaluation in Higher Education 29 (4): 495–510. Strenta, A.C. and Elliott, R. (1987) Differential grading standards revisited. Journal of Educational Measurement 24 (4): 281–291. Tan, K.H.K., and Prosser, M. (2004). Qualitatively different ways of differentiating student achievement: a phenomenographic study of academics’ conceptions of grade descriptors. Assessment and Evaluation in Higher Education 29 (3): 267–281. Thomson, D.G. (1992) Grading modular curricula. Final Report of the GCSE Modular Aggregation Research and Comparability Study. Cambridge: Midland Examining Group. Thyne, J.M. (1974) Principles of examining. London: University of London Press. Tognolini, J. and Andrich, D. (1995) Differential subject performance and the problem of selection. Asessment and Evaluation in Higher Education 20 (2): 161–174. Tyler, R.W. (1949) Basic principles of curriculum and instruction. Chicago: Chicago University Press. Usher, A. and Savino, M. (2006) A world of difference: a global survey of university league tables. Toronto: Education Policy Institute. UUK and SCoP (2004) Measuring and recording student achievement. London: Universities UK and Standing Conference of Principals. At http://bookshop.universitiesuk. ac.uk/downloads/measuringachievement.pdf (accessed 29 October 2006). UUK and SCoP (2005) The UK honours degree: provision of information. London: Universities UK and Standing Conference of Principals. Available via www.universitiesuk. ac.uk/consultations/universitiesuk/ (accessed 29 October 2006). UUK and GuildHE (2006) The UK honours degree: provision of information – second consultation. London: Universities UK and GuildHE. Available via www.universitiesuk. ac.uk/consultations/universitiesuk/ (accessed 29 October 2006). van der Vleuten, C.P.M. and Schuwirth, L.W.T. (2005) Assessing professional competence: from methods to programmes. Medical Education 39 (3): 309–317. van der Vleuten, C.P.M., Norman, G.R. and de Graaf, E. (1991) Pitfalls in the pursuit of objectivity: issues of reliability. Medical Education 25 (2): 110–118. Voorhees, R.A. and Harvey, L., eds (2005) Workforce development and higher education: a strategic role for institutional research [New Directions in Institutional Research No. 128]. San Francisco: Jossey-Bass. Wagner, L. (1998) Made to measure. The Times Higher Education Supplement, 25 September [Higher Education Trends Supplement]: I–II. Walker, I. and Zhu, Y. (2001) The returns to education: evidence from the Labour Force Surveys [Research Report 313]. London: Department for Education and Skills. Walshe, J. (2002) Irish look to widen grading bands. The Times Higher Education Supplement, 29 March: 9. Walvoord, B.E. (2004) Assessment clear and simple: a practical guide for institutions, departments, and general education. San Francisco, CA: Jossey-Bass. Walvoord, B.E. and Anderson, V.J. (1998) Effective grading: a tool for learning and assessment. San Francisco: Jossey-Bass.

228 References Warren Piper, D. (1994) Are professors professional? London: Jessica Kingsley. Waterfield, J., West, R. and Parker, M. (2006) Supporting inclusive practice: developing an assessment toolkit. In M. Adams and S Brown (eds) Towards inclusive learning in higher education: developing curricula for disabled students. Abingdon: Routledge, pp. 79–94. Watson, R., Stimpson, A., Topping, A. and Porock, D. (2002) Clinical competence assessment in nursing: a systematic review of the literature. Journal of Advanced Nursing 39 (5): 421–31. Webster, F., Pepper, D. and Jenkins, A. (2000) Assessing the undergraduate dissertation. Assessment and Evaluation in Higher Education 25 (1): 71–80. Weko, T. (2004) New dogs and old tricks: what can the UK Teach the US about university education? Report of an Atlantic Fellowship in Public Policy, presented at the British Council on 30 March. At www.hepi.ac.uk/pubdetail.asp?ID=123&DOC=Reports (accessed 8 August 2006). Willmott A.S. and Nuttall, D.L. (1975) The reliability of examinations at 16+. Basingstoke: Macmillan. Winter, R. (1993) Education or grading? Arguments for a non-divided honours degree. Studies in Higher Education 18 (3): 363–377. Winter, R. (2003) Contextualising the Patchwork Text: addressing problems of course work assessment in higher education. Innovations in Education and Teaching International 40 (2): 112–122. Wiseman, S. (1949) The marking of English composition in grammar school selection. British Journal of Educational Psychology 19 (3): 200–209. Wolf, A. (1995) Competence-based assessment. Buckingham: Open University Press. Wolf, A. (2002) Does education matter? Myths about education and economic growth. London: Penguin. Wood, D., Bruner, J.S. and Ross, G. (1976) The role of tutoring in problem-solving. Journal of Child Psychology and Psychiatry 17 (2): 89–100. Woodfield, R., Earl-Novell, S. and Solomon, L. (2005) Gender and mode of assessment at university: should we assume female students are better suited to coursework and males to unseen examinations? Assessment and Evaluation in Higher Education 30 (1): 35–50. Woodley, A. and Richardson, J.T.E. (2003) Another look at the role of age, gender and subject as predictors of academic attainment in higher education. Studies in Higher Education 28 (4): 475–493. Woodward, W. (2003) ‘Mickey mouse’ courses jibe angers students. The Guardian, 14 January. At http://education.guardian.co.uk/higher/news/story/0,9830,874230,00.html (accessed 8 August 2006). Woolf, H. and Turner, D. (1997) Honours classifications: the need for transparency. The New Academic (Autumn): 10–12. Woolf, H. (2004). Assessment criteria: reflections on current practices. Assessment and Evaluation in Higher Education 29 (4): 479–493. Woolf, H. (2005) Developing assessment criteria: a small scale case study. In G. Timmins, K. Vernon and C. Kinealy (eds) Teaching and learning history. London: Sage, pp. 185–192. Worth-Butler, M.M., Murphy, R.J.L. and Fraser, D.M. (1994) Towards an integrated model of competence in midwifery. Midwifery 10 (4): 225–231. Yorke, D.M. (1985) Administration, analysis and assumption: some aspects of validity. In

References 229 N. Beail (ed.) Repertory grid technique and personal constructs: applications in clinical and educational settings. London: Croom Helm, pp. 383–399. Yorke, M. (1996) Indicators of programme quality. London: Higher Education Quality Council. Yorke, M. (1997) A good league table guide? Quality Assurance in Education 5 (2): 61–72. Yorke, M. (1998a) The Times ‘league table’ of universities, 1997: a statistical appraisal. Quality Assurance in Education 6 (1): 58–60. Yorke, M. (1998b) Performance indicators relating to student development: can they be trusted? Quality in Higher Education 4 (1): 45–61. Yorke, M. (1998c) Assessing capability. In J. Stephenson and M. Yorke (eds) Capability and quality in higher education. London: Kogan Page, pp. 174–191. Yorke, M. (2002a) Subject benchmarking and the assessment of student learning. Quality Assurance in Education 10 (3): 155–171. Yorke, M. (2002b) Degree classifications in English, Welsh and Northern Irish universities: trends, 1994–95 to 1998–99. Higher Education Quarterly 56 (1): 92–108. Yorke, M. (2003) Transition into higher education: some implications for the ‘employability agenda’. York: Higher Education Academy (in offline archive). Yorke, M. (2004/06) Employability in higher education: what it is – what it is not. York: The Higher Education Academy. Available at www.heacademy.ac.uk/assets/York/ documents/ourwork/tla/employability/id116_employability_in_higher_education_336. pdf (accessed 2 July 2007). Yorke, M. (2005) Issues in the assessment of practice-based professional learning. At www.open.ac.uk/cetl-workspace//cetlcontent/documents/464428aa20.pdf (accessed 6 June 2007). Yorke, M. (2006) The whole truth? National surveys of student experience and their utility. Paper presented at the EAIR Forum, Rome, August. Yorke, M. and Harvey, L. (2005) Graduate attributes and their development. In R.A. Voorhees and L. Harvey (eds) Workforce development and higher education: a strategic role for institutional research [New Directions in Institutional Research No. 128]. San Francisco: Jossey-Bass, pp. 41–58. Yorke, M. and Knight, P. (2004/06) Embedding employability into the curriculum. York: The Higher Education Academy. Available at www.heacademy.ac.uk/assets/York/ documents/ourwork/tla/employability/id460_embedding_employability_into_the_ curriculum_338.pdf (accessed 12 July 2007). Yorke, M. and Longden, B. (2005) Significant figures: performance indicators and ‘league tables’. London: Standing Conference of Principals. Yorke, M. and Longden, B. (2007) The first-year experience in higher education in the UK: Report on Phase 1 of a project funded by the Higher Education Academy. At www.heacademy.ac.uk/assets/York/documents/ourwork/research/FYE/web0573_the_ first_year_experience.pdf (accessed 12 July 2007). Yorke, M., Bridges, P. and Woolf, H. with others from the Student Assessment and Classification Working Group (2000) Mark distributions and marking practices in UK higher education. Active Learning in Higher Education 1 (1): 7–27. Yorke, M., Barnett, G., Bridges, P., Evanson, P., Haines, C., Jenkins, D., Knight, P., Scurry, D., Stowell, M. and Woolf, H. (2002) Does grading method influence Honours degree classification? Assessment and Evaluation in Higher Education 27 (3): 269–279. Yorke, M., Barnett, G., Evanson, P., Haines, C., Jenkins, D., Knight, P., Scurry, D., Stowell, M. and Woolf, H. (2004) Some effects of the award algorithm on honours degree clas-

230 References sifications in UK higher education. Assessment and Evaluation in Higher Education 29 (4): 401–413. Yorke, M., Barnett, G., Evanson, P., Haines, C., Jenkins, D., Knight, P., Scurry, D., Stowell, M. and Woolf, H. (2005) Mining institutional datasets to support policy-making and implementation. Journal of Higher Education Policy and Management 27 (2): 285–298. Yorke, M., Cooper, A., Fox, W., Haines, C., McHugh, P., Turner, D., and Woolf, H. (1996) Module mark distributions in eight subject areas and some issues they raise. In N. Jackson (ed.) Modular higher education in the UK. London: Higher Education Quality Council, pp. 105–107. Young, C. (2003) Grade inflation in higher education. ERIC ED482558. At http://eric. ed.gov/ERICDocs/data/ericdocs2/content_storage_01/0000000b/80/2a/3b/0e.pdf (accessed 7 August 2006). Zadeh, L.A. (1973) Outline of a new approach to the analysis of complex systems and decision processes. IEEE Transactions on Systems, Man, and Cybernetics 3 (1): 28–44.

Index

A-level: examination 88, 96–7, 100, 106, 111, 150, 164, 167, 170; grades 97, 112, 160n, 162; points 93–5, 170 academic bankruptcy (forgiveness) 72 Academic Policies Committee (University of North Carolina, Asheville) 131, 209 Access Economics Pty Limited 170, 209 accuracy (in marking) see precision achievement index 199, 200 Adams, B. 21, 215 Adelman, C. 58, 69, 71, 107, 113–17, 122–3, 130, 163n, 185, 191, 199, 209 affirmative action 111, 120–1, 123 agreement between markers see reliability algorithm (classification or award) 34, 74, 76, 96, 152, 179 alignment (curricular) 21, 120, 126, 183 Alverno College Faculty 181, 205, 209 American Association of Collegiate Registrars and Admissions Officers (AACRAO) 69, 72n American College Testing (ACT) 107, 122, 185 Amis, K. 119, 209 Anderson, V.J. 10, 29, 48–9, 59, 68, 73, 108, 134, 228 Andrew, S. 99, 225 Andrich, D. 61, 227 Angoff, W.H. 39, 209 anonymity in assessment 53 Appleton, J. 25, 212 approach(es) to learning 129–30 arithmetical manipulation (of marks) 3, 33, 36, 140, 143 Armitage, C. 186, 209 Armstrong, M. 74, 145, 209 Ashby, E. 32n, 209 Ashworth, P.D. 28, 209

assessment: authentic 10, 190; challenge of multiple methods 140–3; criteria 3, 17n, 18–20, 24–5, 43, 45–6, 50–3, 56, 61, 64, 108, 112, 117, 119, 137, 143, 151, 172–5, 178–80, 182, 184, 189–90, 204–6; criterion-referenced 16–19, 35, 43–5, 49, 52, 61, 102, 119, 173, 205; formative 3, 10–12, 43, 120, 181; high-stakes 12, 20, 22, 180; holistic 44–6, 49, 116; ipsative 52, 163; ‘local’ 24–5; low-stakes 12; methods 20, 30, 60, 139–41, 183, 203, 206; normreferenced 16–17, 38, 49, 52, 61, 66 ; outcomes 3, 20, 61, 68, 102, 153, 179; practices 2–3, 6–7, 116, 204; purposes of 10–11, 42, 180, 206; regulations 3, 8, 68–80, 116, 119, 146; requirements 12–13, 120, 168; self-referenced 49; summative 2–3, 7, 10–13, 20, 25, 40–1, 54, 101, 120–1, 165, 180–1, 187–88, 190–2, 196, 201–2, 206–7 assessment centre 41, 112, 194 assessment methodology assessment of prior learning (APL) 77 Assessment and Qualifications Alliance (AQA) 23 Assiter, A. 192, 209 Astin, A.W. 157–8, 165, 209 Aston, L. 166, 209 Atkins, M. 11, 209 attributes (personal) 14–15, 41, 128, 188, 195 attribution theory 123 Australian Council for Educational Research (ACER) 186 Australian Vice Chancellors’ Committee (AVCC) 1, 78, 196, 209

232 Index bachelor’s: degree 3, 8, 19, 74, 78, 82, 95, 110, 118, 167; level 1, 13, 58, 68, 74, 78, 125, 132, 167, 185 Baird, P. 121n, 210 Balchin, T. 48, 210 Barke, M. 99, 131, 210 Baty, P. 83, 107, 149n, 210 Baume, D. 18, 48, 50–1, 56, 119, 120n, 210 Beck, H.P. 62, 223 Becker, H.S. 43n, 62, 120–1, 210 Beckett, D. 142, 216 Bee, M. 159, 210 behaviourism 15, 26 benchmarking (institutional) 98, 103, 125 Benjamin, R. 156–7, 217 Berenger, R.D. 117, 210 Berlins, M. 111n, 164–5, 210 bias 7, 23, 31, 43–4, 51, 53, 100–1, 107, 113–15, 121–2, 124, 147n, 161, 173, 197 Biggs, J.B. 21, 126, 173, 183, 190, 210 Birnbaum, R. 109, 112–3, 210 Biswas, R. 173, 210 Bloom, A. 107, 120, 210 Bloom, B.S. 29, 210 Blundell, R. 128, 210 Bogo, M. 190, 210 Bok, D. 111, 211 Bologna: Declaration 5, 192; process 4, 164 borderline(s) 34, 52, 64, 75–6, 119, 140, 143–4, 149–50, 177–8, 198 Boud, D. 7, 10, 12, 211 Bowden, R. 82n, 211 Bowen, W.G. 111, 211 Bowley, R.L. 60, 211 Bradley, S. 192, 213 Brandon, J. 122, 211 Brennan, J. 99, 131, 211 Brennan, R.L. (1989) 20, 215 Bressette, A. 42, 211 Bridges, D. 26, 211 Bridges, P. 13, 34, 102, 117, 211 Bright, N. 83, 211 Britton, J.N. 22, 211 Broadfoot, P. 66, 192, 211 Brooks, V. 22, 211 Brown, G. 10, 38, 211 Brown, J.S. 203, 211 Brown, P. 112, 194, 211 Brown, S. 10, 212 Brumfield, C. 69, 212 Bullock, K. 192, 212

Burgess Group, The 5, 74, 152n, 193, 196 Burton, N.W. 95–6, 212 Buscall, J. 35, 212 Butler, J. 15, 26–7, 187, 216 Butler, R. 63, 212 Bynner, J. 165, 212 Cameron, D. 53, 212 Cameron, H. 46, 59, 212 Campaign for Mainstream Universities (CMU) 156 Campbell, D. 25, 212 Campbell, D.T. 158, 212 Cannings, R. 22, 212 Cannon, R. 60, 222 cap on/capping of grades 70, 77, 102 capability 6, 14–15, 42, 165, 186 Carr, P.G. 106, 111, 212 Carrick Institute for Learning and Teaching in Higher Education 202 Carroll, J. 25, 212 Carroll, L. 133, 212 Cave, M. 170, 212 certification 11–12 Chambers, M.A. 14n, 212 Chapman, K. 162, 170, 212 cheat-proofness (in assessment) 20, 25 cheating 20, 25, 41, 121, 127 Chevalier, A. 167, 212, 213 Clegg, S. 192, 213 Clewes, B. 50n, 51, 216 Cloke, C. 185, 204, 221 co-curricular activity, awards for 168–9 Coll, R.K. 16, 213 College Board 125n, 213 colleges: 76, 83, 84n, 88–9, 91–3, 96–8, 107–9, 156, 203; further education 86, 88, 106, 178, 185, 192 Collegiate Learning Assessment (CLA) 162–3 Collis, K.F. 173, 190, 210 compensation (between assessment outcomes) 34n, 48, 74, 76–7, 102, 135, 137, 151, 173, 204 competence, competency 14–16, 18, 39–41, 54n, 102, 138, 141, 158, 178, 191 competence-based: assessment 176; curricula 102; programmes 184 completion (of study programme) 19, 24, 84n, 112, 128, 161, 169 complex achievements, assessment of 16, 24, 66, 139, 164, 191, 193 condonement 74, 76–7, 102

Index 233 Conlon, G. 167n, 212 connoisseurship 47–8 Connor, H. 99, 213 consistency 55–6, 59–60, 62, 151, 185, 204–5; lack of 2, 51, 56, 59, 82, 111 constructivism 26, 183, 187 constructivist educational methodology 183 context, relevance of, to assessment 27–9, 33, 40, 42, 47, 53, 66, 132, 164, 184–6, 199, 201, 204 continuing professional development (CPD) 40, 202 Cooke, R. 83, 213 Cope, P. 16, 19, 28, 175, 184–5, 213 Council for Aid to Education (CAE) 163 Council for National Academic Awards (CNAA) 161, 203, 213 Course Experience Questionnaire (CEQ) 59, 99, 170–1, 186 coursework 1, 13, 16, 101–2, 104, 116, 126–7, 139, 153 Covington, M.V. 62, 213 Cowan, N. 45, 213 Cowburn, M. 39, 213 Cowdroy, R. 47–8, 213 Cox, B. 60, 213 creativity 21, 26, 46–8 credit 4–5, 36, 59, 72, 74–6, 79, 81, 113, 128, 135, 145–8, 150, 152, 168; hours 200; system 4, 128; transfer 17, 201 Cresswell, M.J. 23, 35–6, 42, 136–7, 138n, 213 criteria, unarticulated 52 critical analysis 7, 126 Cronbach, L.J. 21, 213 Cross, L.H. 49, 52, 213 Croudace, T. 163, 218 cue-seeking 62, 130 cumulation of grades, assessment outcomes 8, 20, 33, 47, 134–154, 180, 192, 201 Cureton, J.S. 106, 220 Curriculum, Evaluation and Management (CEM) Centre 164 Dale, R.R. 32, 117, 213 Dale, S.B. 111, 213 Dalziel, J. 33–4, 36, 143, 178–9, 213 Davies, M. 122, 211 Dearing, R. 192, 214 Dearing Report, The (see also NCIHE) 4–5, 169, 191–2, 206 Deci, E.L. 63, 202, 214

Department for Education and Employment 185, 204 Department for Education and Skills (DfES) 98, 106n, 164, 166 Department of Education (US) 69 Department of Education, Science and Training (DEST) 170 derived grade 2 derived scale 35 descriptors (of achievement, performance) 15, 19, 34–6, 46–7, 60, 137–8, 172 Diploma Supplement 5, 192 distribution: mark, grade, etc. 18, 49, 55, 58–9, 100, 102, 117, 132, 134, 136, 138, 141, 144, 147, 149–50, 152, 154, 160, 178, 193, 201; normal 17–8, 135 diversity: in assessment 23, 48, 75; institutional 4, 163; reporting grades 37; subject disciplines 61 Division of Evaluation, Testing and Certification (St John’s, Canada) 33, 214 Dolton, P. 159, 210 Donovan, C. 152n double marking 22–3, 50 Dracup, C. 22, 214 Dressel, P. 31 Dreyfus, H.L. 40, 54, 190–1, 214 Dreyfus, S.E. 40, 54, 190–1, 214 Driessen, E. 189, 191, 214 Duguid, P. 203, 211 Duke, J.D. (1983) 109, 214 Dweck, C.S. 62–3, 66, 214 Ebel, R.L. 7, 32–6, 38, 40–1, 68, 180, 205, 214 Echauz, J.R. 173, 214 Edgeworth, F.Y. 54, 214 Education Policy Committee (University of North Carolina–Chapel Hill) 105–7; 119–20, 124–5, 214 effective teacher theory 123, 127 efficiency (of assessment) 20, 25–6 Egerton, M. 165, 212 Eisner, E.W. 47, 183, 190, 193, 214 Ekstrom, R.B. 49, 51, 202, 204, 214 Elias, K.S. 122, 214 Elliot, A.J. 62, 66, 214 Elliott, E.C. 54, 226 Elliott, R. 116, 215, 227 Elton, L. 38, 47, 83, 102, 194, 207, 215 emotional intelligence 61, 195 employability 6, 13–4, 41, 61, 110, 164, 184, 195

234 Index Enhancing student Employability Coordination Team (ESECT) 14, 164 enterprise 13–4, 188 equity 25, 60 Eraut, M. 14n, 16, 40 190, 215 error variance 25, 79 ethnicity/ethnic group 53, 120 Europass 5 European Credit Transfer and Accumulation System (ECTS) 17–8, 35, 138, 196, 201 Ewell, P.T. 155, 200, 215 examination(s) 1, 13, 16, 22, 32, 36, 43, 46, 49, 55–6, 67–8, 73, 77, 101–2, 104, 116, 122–3, 126–7, 134, 139, 142–3, 153; ‘open book’ 13; oral (viva voce) 77; public 22–5, 35–6, 42, 106, 164 exit velocity 75, 135, 144, 146 external examiner, examining 48, 50, 83, 103, 139–40, 170, 177 Faculty Committee on Grading (Princeton) 90, 215 fairness 13, 20, 24–5, 47, 56 Feldt, L.S. 20, 215 Felton, J. 31, 199–200, 215 final language 12 Fishman, J.A. 54, 215 Forgette-Giroux, R. 138, 226 forgiveness 72 formative assessment see assessment, formative Foundation Programme (in Medicine) 19 Fransella, F. 21, 215 Friedlich, M. 24, 215 Frisbie, D.A. 32–4, 68, 214 Fullan, M. 196, 215 functional analysis 15–16 Furness, S. 40, 121, 215 fuzziness 3, 8–9, 19, 34, 66–7, 134, 137, 147–9, 172–81, 183–4, 193, 198, 201, 207 fuzzy set(s) 172–5, 178 Gater, D.S. 82n, 112, 215 Geisinger, K.F. 2, 202, 215 General Certificate of Secondary Education (GCSE) 22, 25, 106, 138, 164 generalizability (of assessment) 20, 25, 187 gender 83, 100–1, 164, 170 Gibbons, M. 27, 215

Gibbs, G. 22, 41, 173, 206, 216 Gibson, F. 131, 216 Gilligan, P. 40, 121, 215 goals 189–90; learning 62–3, 66, 201–2, performance 62–3, 66, 201 Goldenberg, D. 120, 122, 216 Goldman, L. 102, 126, 216 Goldman, R.D. 116, 216 Goleman, D. 195, 216 Goodison, Sir N. 192, 216 Goodliffe, T. 192, 216 Gosselin, C.L. 131, 216 Gough, D.A. 192, 216 de Graaf, E. 47–8, 213 grade increase (non-inflationary) 126–30 grade inflation 3, 8, 73, 83, 90, 105–33, 192, 199; adverse effects of 110–2; avoidance of low grades 120–2; definitions of 108–110; Kuschelnoten 122–3; origins of 114–5; possible causes of 115–25; validity of perceptions of 112–14 grade leniency theory 123 grading on the curve 17, 49 Graduate Skills Assessment (GSA) 186 graduateness 188, 190 Greenwood, M. 12, 216 Group of 8 99 Guba, E.G. 187, 216 GuildHE 5, 196, 227 Hager, P. 14–15, 16n, 26–7, 142, 187, 216 Hand, L. 50n, 51, 216 Harackiewicz, J.M. 63n, 66, 216 Harden, R.M. 141, 226 Hartley, M. 7, 107, 111, 114, 126, 199, 202, 225 Hartog, Sir P. 54, 150, 217 Harvey, L. 14n, 41, 227, 229 Haug, G. 164, 200–1, 217 Hawe, E. 119–20, 121n, 217 Hays, R.B. 16, 217 Hayward, G. 18, 218 Hedges, C. 111, 132, 217 Heritage, G.L. 74, 217 Hersh, R.H. 156–7, 217 Hesketh, A.J. 194, 217 Heywood, J. 10, 36, 42, 217 high school record (HSR) 95 Higher Education Academy 14n, 74n, 99n, 203 Higher Education Funding Council for England (HEFCE) 93, 96, 159n, 166n, 174, 217

Index 235 Higher Education Funding Council for Wales (HEFCW) 174, 217 Higher Education Quality Council (HEQC) 33n, 58, 83, 96, 108, 139–40, 162, 164, 175n, 188, 217 Higher Education Statistics Agency (HESA) 58, 82–6, 88, 93, 98, 100, 117–8 Hodges, B. 24, 53–4, 217, 218 holistic: judgement 24, 50, 116; marking 44–6; 49 Holroyd, C. 54, 218 honors 9, 42, 69, 73, 146 Hornby, W. 18, 45–7, 49, 218 Hoskins, S.L. 100, 218 Hounsell, D. 10–11; 121, 218 Howarth, I. 163, 218 Hoyles, C. 110, 219 Hu, S. 58, 106, 114, 117, 220 Huang, L. 107n, 225 Hudson, L. 30, 111n, 218 human capital 6, 13, 41, 66 Humphreys, L.G. 55, 218 Huot, B. 23, 218 Hussey, T. 205, 218 Hyland, T. 15, 204, 218 Ilott, I. 121n, 122, 218 inconsistency between markers see reliability Institute for Higher Education Policy (IHEP) 165–6, 218 institutions: highly selective, elite 96, 113, 115, 126; less selective 115, 204; nonselective, ‘open door’ 113; specialist 92, 96n; see also colleges, universities intelligibility (of assessment) 20, 26 internship 121, 142; see also placement Jackson, N. 192, 218 Jaeger, K. 62, 222 James, R. 2, 107, 202–4, 218 James, S. 18, 218 Jamieson, I. 192, 212 Jessup, G. 15, 219 Johnes, J. 83, 219 Johnson, B. 74, 102n, 116, 146, 219 Johnson, V.E. 58, 73, 107, 116, 119, 120n, 123, 125, 129, 132, 199, 219 Johnston, B. 47, 219 Jones, A. 16, 219 Jones, D.P. 200, 215

Jones, K. 203, 226 Jones, K.E. 117, 122, 221 Jones, M. 16, 121, 219 judgemental model 27–8, 185, 187 Juola, A. 106, 219 Kahn, P.E. 110, 219 Kane, M. 32, 219 Karran, T. 1, 35, 196, 219 Kemshall, H. 14, 40, 219 Klein, S.P. 162, 219 Klenowski, V. 189, 219 Kneale, P. 130, 219 Knight, P.T. 2, 10, 14n, 24–5, 29, 41, 43n, 139, 164–5, 181, 184, 186, 195, 212, 219 Kohn, A. 109, 220 Koper, P.T. 31, 199–200, 215 Kornbrot, D.E. 101n, 220 Kruger, A. 111, 213 Kuh, G. 58, 106, 114, 117, 220 labour market 123, 127, 156, 167, 171 Lambert, R. 165, 220 Lang, J. 168, 220 Lang, K. 16, 121, 220 Lankshear, A. 121n, 122n, 220 Lanning, W. 102, 126, 220 Larkey, P.D. 200, 220 Laurillard, D. 62, 220 league tables (rankings) 82, 103, 112, 124, 169, 170 Learning and Teaching Performance Fund (LTPF) 170 Learning and Teaching Support Network (LTSN) 203 learning outcomes 3, 6, 17–18, 21, 43, 47, 66, 101–2, 104, 116, 120, 136, 152–3, 175–6, 183, 203, 205 Levine, A. 106, 220 Lewis, H.R. 59, 90, 107, 115, 145–6, 152, 182, 220 licence to practise 11, 40 lifelong learning 40, 202 Lincoln, Y.S. 187, 216 Linke, R. 171, 220 Lipsett, A. 124, 220 Longden, B. 82n, 99n, 124, 229 Looney, M.A. 23n, 220 Lord, F.M. 64, 158, 220 Lum, G. 14, 220 Ma, J. 173, 220

236 Index McCaffrey, D.F. 171, 220 McCrum, N.G. 100, 220 McCulloch, R. 142, 220 Macfarlane, B. 128, 221 McGuire, M.D. 82n, 221 Machung, A. 82n, 221 McIlroy, J.H. 53, 221 McInnis, C. 65, 99, 130, 221 McKeachie, W.J. 108, 221 McLachlan, J.C. 141–3, 221 Maclellan, E. 62, 189–90, 221 McSpirit, S. 117, 122, 221 McVey, P.J. 55–6, 221 Mager, R.F. 15, 183, 221 Manhire, B. 114, 125, 221 mapping rules 76, 140, 178–80 Marcus, J. 107, 132, 221 mark, raw 2, 8, 60, 135, 138, 141, 146, 198, 200 mark(ing) scheme, template 23, 29, 50–1, 55, 173 marking: additive 44–6; holistic 44–6, 49; menu 44, 49–50, 178, 180–1; negative 46 Marsh, H.W. 124, 221 Martin, S. 185, 204, 221 mass(ified) higher education system 4, 127 Mayer, J.D. 195, 225 Mayhew, K., Deer, C. and Dua, M. 167, 222 measurement error 33, 136 Measuring and Recording Student Achievement Scoping Group 3–4, 134 Meehl, P.E. 21, 213 Mentkowski, M. 64, 222 mentor 121, 187, 189 Messick, S. 20, 222 metacognition 14, 64, 189–90 Millar, D.J. 168, 220 Miller, C.M.L. 62, 130, 222 Miller, D. 33, 222 Miller, G.A. 45, 222 Miller, G.E. 141, 222 Milton, O. 6, 31, 108–9, 222 minority groups 123 Mitchelmore, M.C. 38, 222 moderation (of assessments) 50, 52 modular scheme 42, 117, 139 modularized curricula, programmes of study 101, 103, 145 Moore, C. 157, 169, 225 Morgan, C.K. 46, 222 Morrison, H. 82n, 138–9, 222 motivation 24, 42, 63–4, 66, 111, 123,

126–7, 190, 194 Murphy, R. 121n, 122, 218 Murrell, J. 40, 222 multiple choice (items, tests, etc.) 12, 22, 39n, 141, 162, 184 National Advisory Body for Public Sector Higher Education (NAB) 156 National Center for Public Policy and Higher Education (NCPPHE) 155, 162, 222 National Committee of Inquiry into Higher Education (NCIHE) 4–5, 189, 191, 206, 222 National Governors Association (NGA) 157 National Student Survey 148, 165, 171 National Vocational Qualification(s) (NVQ) 15, 18 Naylor, R. 100, 226 Nelson, B. 107, 222 Newble, D. 39, 60, 62, 222 Newstead, S.E. 62, 100, 121, 218, 223 Newton, P.E. 22–3, 223 Norcini, J.J. 39, 223 Northern Universities Consortium for Credit Accumulation and Transfer (NUCCAT) 74, 145 Nuttall, D. 35, 38, 137, 223, 228 Oakeshott, M. 6, 223 objective structured clinical examination (OSCE) 53, 141 objectives: behavioural 15, 184; expressive 183–4; instructional 6, 183–4; problemsolving 183–4 O’Donovan, B. 205, 223 Office for Standards in Education (OFSTED) 203 Office of the Independent Adjudicator for Higher Education 197 O’Leary, N.C. 167, 223 Organisation for Economic Co-operation and Development (OECD) 186 Orr, S. 52, 223 Orrell, J. 50, 223 outcomes (learning) 3, 6, 17–8, 21, 43, 47, 66, 101–2, 104, 116, 120, 136, 152–3, 175–6, 183, 203, 205 overall index of achievement 1, 4, 6, 8, 26, 41, 47, 104, 136, 138, 141, 149, 152, 157, 185, 192–3, 195–8, 207 Owens, C. 16, 223

Index 237 Parlett, M. 62, 130, 222 Parlour, J.W. 139–40, 223 part-time employment 99 Pascarella, E.T. 21, 24, 30, 157–8, 165, 194, 223 pass/fail grading 5, 35, 38–9, 41, 48, 61, 70–1, 113, 116, 137, 196, 198 Patchwork Text 12 penalty/ies 77, 130, 152, 176 performance indicator(s) 83, 100, 169–70, 186, 200; see also value added Perkins, P. 102, 126, 220 personal development: planning (PDP) 187, 192; portfolio 4–5 personal mitigating circumstances 74, 77, 102, 140 Phillips, M. 119, 123 Pintrich P.R. 62, 63n, 66, 201, 223 Pitcher, J. 120, 145, 224 placement: 16, 121, 142, 168–9, 191; cooperative 160, sandwich 61, 160; see also internship plagiarism 20, 25, 102, 127, 130 Please, N.W. 35, 150, 223 plus/minus grading 208

Pollio, H.R. 62, 223

polytechnics 83n, 84n, 156, 203 Polytechnics and Colleges Funding Council (PCFC) 161 practical intelligence 30, 165 Prather, J.E. 57, 119, 124 precision (of marking, grading) 18–19, 31–2, 34, 46, 55, 70, 137, 141, 146, 172–3, 178, 180, 185, 198, 200–1, 206 prediction 20, 39, 184–5 predictive value of grades 30, 95, 156 Price, M. 205, 224 PricewaterhouseCoopers 167, 224 professional (and/or statutory) body 11, 28, 74, 192 professional development (of academics) 6, 203 Programme for International Student Assessment (PISA) 186 Progress File 5, 187, 191–2 profile (of achievement) 3, 43, 58, 76, 82, 100, 103, 116, 119, 126, 129, 135, 140–4, 152, 178, 185, 188, 193–4, 198 Prosser, M. 172, 227 psychometric testing, psychometrics 21, 24, 34, 40–1 Purcell, K. 120, 145, 224 quality assurance 10–11, 44, 58–60, 62

Quality Assurance Agency for Higher Education (QAA) 2, 28, 40, 58–9, 75, 191n, 204, 224 quizzes 12, 73 Raffe, D. 158, 224 Ram, P. 142n, 224 Ramist, L. 95–6, 212 Ramsden, P. 62, 171, 224 rankings see league tables Rasch analysis 23n reactivity (in assessment) 20, 187 recognition and reward 124, 127 Redfern, S. 40, 224 Regehr, G. 24, 224 Reich, R.B. 183, 224 reliability in marking 3, 12, 20, 22–6, 32–3, 36, 38, 48, 54–6, 85n, 94n, 114, 131, 136, 150, 157, 165, 185, 187; see also consistency repeat(ing) of module 12, 43, 69, 72, 102 repeated assessments 102 Research Assessment Exercise (RAE) 103, 127, 169 resit(ting) of assessments 74, 77, 102 retaking of module 72, 74, 77 retention 24, 112, 169 retrieval (of failure) 74, 77 Reznick, R. 53, 137, 224 Rhodes, E.C. 54, 150, 217 Richardson, J.T.E. 100–1, 149n, 171, 224, 228 Richmond, W.K. 60n, 224 Rickard, W. 142, 224 Riley, H.J. 69, 224 Robertson, D. 167, 225 robustness: of trends 83, 88; technical 1–5, 12, 36, 54, 74, 79, 82, 114, 132, 157, 162, 165, 170–1, 182, 197, 200 Rojstaczer, S. 106–7, 115, 123, 125 Rosovsky, H. 7, 107, 111, 114, 126, 199, 202, 225 Rothblatt, S. 129, 225 rounding 70, 75–7, 118–19, 143, 174 Russo, M.J. 25, 212 Rust, C. 62, 205, 224, 225 Ryle, G. 6, 225 Sabot, R. 102, 126, 225 Sadler, D.R. 56, 172, 175–6, 225 Salamonson, Y. 99, 225 Saliu, S. 173–4, 225 Salovey, P. 195, 225

238 Index sampling 21, 32, 43–4, 143, 162 satisficing 65, 67, 130, 180–1 Savino, M. 124, 227 scaffolding 29, 41 Scholastic Aptitude Test (SAT) 95–6, 107, 160n, 162–3, 185 Schön, D.A. 6n, 225 Schuwirth, L.W.T. 16, 225, 227 scientific measurement 26, 40–1, 183 scientific model 27–8, 187 score, raw see mark, raw Scottish Education Department (SED) 138 Scurry, D. 64 Scurry hypothesis 64–6 self-assessment 11 semesterized curricula 101 Senge, P. 197, 225 Sennett, R. 66, 225 Senter, H. 131, 216 Shavelson, R.J. 107n, 225 Shaw, E. 192, 209 Shay, S.B. 45, 225 Shea, J.A. 39, 223 Shepard, L.A. 26, 183, 225 Shulock, N. 157, 169, 225 Shumway, J.M. 141, 226 signal(ling) of student achievement 3, 16, 29, 33, 38, 42–3, 79, 139, 150, 158, 172, 195, 198–9, 207 Simon, H.A. 67, 130, 180, 226 Simon, M. 138, 226 Simon, S.J. 67, 226 Simonite, V. 13, 74, 97, 101–2, 136, 138, 151, 226 Simpson, C. 22, 41, 216 single index of achievement see overall index of achievement Singleton, R. 113n, 131, 226 Sinkinson, A. 203, 226 skilful practices 14, 41, 165, 187 skills 14–6, 26, 120, 128, 168, 170–1, 185–9, 195 Sloane, J. 167, 223 Smith, D.L. 54, 226 Smith, E.R. 113n, 131, 226 Smith, J. 100, 226 Smith, P. 205, 218 social context (of assessment) 33, 53 SOLO Taxonomy 173, 190–1 specialist institutions see institutions standard(s): 2.10, 19, 25, 27, 32, 39–40, 47, 59, 61, 64, 77, 82–3, 103, 106, 116, 119–20, 129, 134, 162–3, 172, 185, 189, 203–5; comparability of 22, 53–4,

56, 60; entry 93, 160; threshold, ‘good enough’ 18, 39–40, 47–8, 102, 138, 147, 151, 180 standard error 57, 137, 147 standardized marking 36 standardized test(ing) 24, 41, 53 standardizing of marks 24, 60–2, 145 Standing Conference of Principals (SCoP) 3–5, 134, 193, 196 Stanley, J.C. 158, 212 Starch, D. 54, 226 Stecher, B. 138, 226 Stephenson, J. 14, 226 Stiggins, R.J. 32, 227 Stone, J.E. 107, 109, 227 Stones, E. 40, 227 Stowell, M. 25, 74, 227 Strenta, A.C. 116, 215, 227 stress (of failing students) 120 Student Assessment and Classification Working Group (SACWG) 55, 64, 73–4, 102, 146 Student Evaluation of Educational Quality (SEEQ) 124 student evaluation (of teachers) 73 subject benchmarks 28, 39–40, 60, 203 subject mix 83 summative assessment see assessment, summative Tan, K.H.K. 172, 227 Tauch, C. 164, 217 Taylor, J. 83, 219 Teaching Quality Information (TQi) 161 technical: aspects of assessment 10, 20–6 53, 180; quality/ies of assessment 3, 10, 12, 30, 33, 41, 82, 134, 155, 165, 171, 180, 186 technical knowledge 6 Terenzini, P.T. 21, 24, 30, 157–8, 165, 194, 223 Tesoriero, F. 46, 59, 212 Thomson, D.G. 138, 227 Thyne, J.M. 42, 134, 227 Tognolini, J. 61, 227 transcript 4–6, 17, 28, 79, 113–4, 135, 181, 191–6, 199, 200 transfer/ability (of achievements) 25–6, 183, 186–7 Turner, D. 74, 102, 228 Tyler, R.W. 183, 227 Unified Marks System 138

Index 239 universities: non Russell Group 92, 94; post-1992 56–7, 84, 88–9, 91–3, 96–8, 108, 117, 135, 137, 167n; pre-1992 84, 88–9, 91–7, 108; Russell Group 89, 92, 94–5, 99, 167n; ‘top’ 92, 98, 111 Universities and Colleges Admissions Service (UCAS) 93n, 160n, 170n Universities UK (UUK) 3–5, 135, 193, 196, 227 USEM approach to employability 14–5, 186 Usher, A. 124, 227 utility (of assessment) 20, 26, 28 Vachtsevanos, G.J. 173, 214 validity 3–4, 12, 18, 21–4, 33, 45, 54, 86, 114, 139, 157, 165, 187; concurrent 21, 187; construct 21, 171, 187; content 21, 187, face 21, 47; predictive 21, 187 value(s) 14–15, 21, 27, 32, 48, 51, 126, 132–3, 154, 205; positional 169 value added 3–4, 8, 155–71; modelling (VAM) 171 value judgement 32, 42 variability (in grading or assessment) 1–2, 19, 24, 30, 54, 58–9, 68, 74, 137, 146, 151, 201 Villegas, A.M. 49, 51, 202, 204, 214 van der Vleuten, C.P.M. 16, 22, 227 vocational education, programmes 13–14, 16, 126, 185, 195n Voorhees, R.A. 14n, 41, 227 Waddell, J. 120, 122, 216 Wagner, L. 161, 227 Wakeman-Linn, J. 102, 126, 225 Walker, I. 167, 227 Walshe, J. 107, 227 Walvoord, B.E. 10, 29, 48–9, 59, 68, 73, 108, 134, 228

Ward, R. 192, 218 warrant(ing) of achievement 6, 25, 41, 111 Warren Piper, D. 139–40, 176–8; 228 Waterfield, J. 25, 228 Watson, R. 14, 121, 228 Webster, F. 18, 41n, 51–2, 119, 172, 228 Weko, T. 128, 228 weight(ing) 12, 23, 29, 48, 53, 61, 75, 77, 79, 102, 112, 116–7, 143–4, 152–3, 174, 176, 179, 200 White Paper: The Future of Higher Education 3 Whiten, S.C. 141–3, 221 Widawski, M.H. 116, 216 Willmott A.S. 35–6, 228 Winter, R. 12, 38, 208, 228 Wiseman, S. 23, 45–6, 228 withdrawn/no credit 71, 113 Wolf, A. 43, 52, 101n, 119, 137, 186, 204, 228 Wood, D. 29, 41, 228 Woodfield, R. 101, 228 Woodley, A. 100–1, 228 Woodward, W. 119, 228 Woolf, H. 74, 102, 172, 204, 228 Woolston, R. 16, 121, 220 work-based learning 142, 168 workforce development 14n, 41, 164, 169 Worth-Butler, M.M. 15, 229 Yorke, M. 6, 10–11, 13–14, 21n, 27, 29, 34, 39, 41, 43–4, 48–9, 56–7, 60, 74, 82n, 83–4, 99n, 100n, 102, 117, 121, 124, 139, 142, 146, 152, 164–5, 170–1, 184, 186, 188, 195, 210, 219, 229 Young, C. 109, 230 Zadeh, L.A. 172–3, 230 Zhou, D. 173, 220 Zhu, Y. 167, 227