CONTENTS The Unemployment Volatility Puzzle: Is Wage Stickiness the Answer? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . NABIL I. AL -NAJJAR: Decision Makers as Statisticians: Diversity, Ambiguity, and Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . PER A. MYKLAND AND LAN ZHANG: Inference for Continuous Semimartingales Observed at High Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ALEXEI ONATSKI: Testing Hypotheses About the Number of Factors in Large Factor Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . GUIDO W. IMBENS AND WHITNEY K. NEWEY: Identification and Estimation of Triangular Simultaneous Equations Models Without Additivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DARRELL DUFFIE, SEMYON MALAMUD, AND GUSTAVO MANSO: Information Percolation With Equilibrium Search Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . KYOUNGWON SEO: Ambiguity and Second-Order Belief . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . JAMES ANDREONI AND B. DOUGLAS BERNHEIM: Social Image and the 50–50 Norm: A Theoretical and Experimental Analysis of Audience Effects . . . . . . . . . . . . . . . . . . . . . . . . . URI GNEEZY, KENNETH L. LEONARD, AND JOHN A. LIST: Gender Differences in Competition: Evidence From a Matrilineal and a Patriarchal Society . . . . . . . . . . . . . . . . . . . . . CHRISTOPHER A. PISSARIDES:
1339 1371 1403 1447 1481 1513 1575 1607 1637
NOTES AND COMMENTS: ANDRIY NORETS: Inference in Dynamic Discrete Choice Models With Serially Cor-
related Unobserved State Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1665 KEISUKE HIRANO AND JACK R. PORTER: Asymptotics for Statistical Treatment Rules 1683 ANDERS RYGH SWENSEN: Corrigendum to “Bootstrap Algorithms for Testing and Determining the Cointegration Rank in VAR Models” . . . . . . . . . . . . . . . . . . . . . 1703 ANNOUNCEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1705 FORTHCOMING PAPERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1709
VOL. 77, NO. 5 — September, 2009
An International Society for the Advancement of Economic Theory in its Relation to Statistics and Mathematics Founded December 29, 1930 Website: www.econometricsociety.org EDITOR STEPHEN MORRIS, Dept. of Economics, Princeton University, Fisher Hall, Prospect Avenue, Princeton, NJ 08544-1021, U.S.A.;
[email protected] MANAGING EDITOR GERI MATTSON, 2002 Holly Neck Road, Baltimore, MD 21221, U.S.A.; mattsonpublishingservices@ comcast.net CO-EDITORS DARON ACEMOGLU, Dept. of Economics, MIT, E52-380B, 50 Memorial Drive, Cambridge, MA 021421347, U.S.A.;
[email protected] WOLFGANG PESENDORFER, Dept. of Economics, Princeton University, Fisher Hall, Prospect Avenue, Princeton, NJ 08544-1021, U.S.A.;
[email protected] JEAN-MARC ROBIN, Maison des Sciences Economiques, Université Paris 1 Panthéon–Sorbonne, 106/112 bd de l’Hôpital, 75647 Paris Cedex 13, France and University College London, U.K.;
[email protected] LARRY SAMUELSON, Dept. of Economics, Yale University, 20 Hillhouse Avenue, New Haven, CT 065208281, U.S.A.;
[email protected] JAMES H. STOCK, Dept. of Economics, Harvard University, Littauer M-24, 1830 Cambridge Street, Cambridge, MA 02138, U.S.A.;
[email protected] HARALD UHLIG, Dept. of Economics, University of Chicago, 1126 East 59th Street, Chicago, IL 60637, U.S.A.;
[email protected] ASSOCIATE EDITORS YACINE AÏT-SAHALIA, Princeton University JOSEPH G. ALTONJI, Yale University JAMES ANDREONI, University of California, San Diego JUSHAN BAI, Columbia University MARCO BATTAGLINI, Princeton University PIERPAOLO BATTIGALLI, Università Bocconi DIRK BERGEMANN, Yale University XIAOHONG CHEN, Yale University VICTOR CHERNOZHUKOV, Massachusetts Institute of Technology J. DARRELL DUFFIE, Stanford University JEFFREY ELY, Northwestern University HALUK ERGIN, Washington University in St. Louis MIKHAIL GOLOSOV, Yale University FARUK GUL, Princeton University JINYONG HAHN, University of California, Los Angeles PHILIP A. HAILE, Yale University MICHAEL JANSSON, University of California, Berkeley PHILIPPE JEHIEL, Paris School of Economics and University College London PER KRUSELL, Princeton University and Stockholm University FELIX KUBLER, University of Zurich
OLIVER LINTON, London School of Economics BART LIPMAN, Boston University THIERRY MAGNAC, Toulouse School of Economics (GREMAQ and IDEI) GEORGE J. MAILATH, University of Pennsylvania DAVID MARTIMORT, IDEI-GREMAQ, Université des Sciences Sociales de Toulouse STEVEN A. MATTHEWS, University of Pennsylvania ROSA L. MATZKIN, University of California, Los Angeles LEE OHANIAN, University of California, Los Angeles WOJCIECH OLSZEWSKI, Northwestern University NICOLA PERSICO, New York University BENJAMIN POLAK, Yale University PHILIP J. RENY, University of Chicago SUSANNE M. SCHENNACH, University of Chicago UZI SEGAL, Boston College NEIL SHEPHARD, University of Oxford MARCIANO SINISCALCHI, Northwestern University JEROEN M. SWINKELS, Northwestern University ELIE TAMER, Northwestern University EDWARD J. VYTLACIL, Yale University IVÁN WERNING, Massachusetts Institute of Technology ASHER WOLINSKY, Northwestern University
EDITORIAL ASSISTANT: MARY BETH BELLANDO, Dept. of Economics, Princeton University, Fisher Hall, Princeton, NJ 08544-1021, U.S.A.;
[email protected] Information on MANUSCRIPT SUBMISSION is provided in the last two pages. Information on MEMBERSHIP, SUBSCRIPTIONS, AND CLAIMS is provided in the inside back cover.
SUBMISSION OF MANUSCRIPTS TO ECONOMETRICA 1. Members of the Econometric Society may submit papers to Econometrica electronically in pdf format according to the guidelines at the Society’s website: http://www.econometricsociety.org/submissions.asp Only electronic submissions will be accepted. In exceptional cases for those who are unable to submit electronic files in pdf format, one copy of a paper prepared according to the guidelines at the website above can be submitted, with a cover letter, by mail addressed to Professor Stephen Morris, Dept. of Economics, Princeton University, Fisher Hall, Prospect Avenue, Princeton, NJ 08544-1021, U.S.A. 2. There is no charge for submission to Econometrica, but only members of the Econometric Society may submit papers for consideration. In the case of coauthored manuscripts, at least one author must be a member of the Econometric Society. Nonmembers wishing to submit a paper may join the Society immediately via Blackwell Publishing’s website. Note that Econometrica rejects a substantial number of submissions without consulting outside referees. 3. It is a condition of publication in Econometrica that copyright of any published article be transferred to the Econometric Society. Submission of a paper will be taken to imply that the author agrees that copyright of the material will be transferred to the Econometric Society if and when the article is accepted for publication, and that the contents of the paper represent original and unpublished work that has not been submitted for publication elsewhere. If the author has submitted related work elsewhere, or if he does so during the term in which Econometrica is considering the manuscript, then it is the author’s responsibility to provide Econometrica with details. There is no page fee; nor is any payment made to the authors. 4. Econometrica has the policy that all empirical and experimental results as well as simulation experiments must be replicable. For this purpose the Journal editors require that all authors submit datasets, programs, and information on experiments that are needed for replication and some limited sensitivity analysis. (Authors of experimental papers can consult the posted list of what is required.) This material for replication will be made available through the Econometrica supplementary material website. The format is described in the posted information for authors. Submitting this material indicates that you license users to download, copy, and modify it; when doing so such users must acknowledge all authors as the original creators and Econometrica as the original publishers. If you have compelling reason we may post restrictions regarding such usage. At the same time the Journal understands that there may be some practical difficulties, such as in the case of proprietary datasets with limited access as well as public use datasets that require consent forms to be signed before use. In these cases the editors require that detailed data description and the programs used to generate the estimation datasets are deposited, as well as information of the source of the data so that researchers who do obtain access may be able to replicate the results. This exemption is offered on the understanding that the authors made reasonable effort to obtain permission to make available the final data used in estimation, but were not granted permission. We also understand that in some particularly complicated cases the estimation programs may have value in themselves and the authors may not make them public. This, together with any other difficulties relating to depositing data or restricting usage should be stated clearly when the paper is first submitted for review. In each case it will be at the editors’ discretion whether the paper can be reviewed. 5. Papers may be rejected, returned for specified revision, or accepted. Approximately 10% of submitted papers are eventually accepted. Currently, a paper will appear approximately six months from the date of acceptance. In 2002, 90% of new submissions were reviewed in six months or less. 6. Submitted manuscripts should be formatted for paper of standard size with margins of at least 1.25 inches on all sides, 1.5 or double spaced with text in 12 point font (i.e., under about 2,000 characters, 380 words, or 30 lines per page). Material should be organized to maximize readability; for instance footnotes, figures, etc., should not be placed at the end of the manuscript. We strongly encourage authors to submit manuscripts that are under 45 pages (17,000 words) including everything (except appendices containing extensive and detailed data and experimental instructions).
While we understand some papers must be longer, if the main body of a manuscript (excluding appendices) is more than the aforementioned length, it will typically be rejected without review. 7. Additional information that may be of use to authors is contained in the “Manual for Econometrica Authors, Revised” written by Drew Fudenberg and Dorothy Hodges, and published in the July, 1997 issue of Econometrica. It explains editorial policy regarding style and standards of craftmanship. One change from the procedures discussed in this document is that authors are not immediately told which coeditor is handling a manuscript. The manual also describes how footnotes, diagrams, tables, etc. need to be formatted once papers are accepted. It is not necessary to follow the formatting guidelines when first submitting a paper. Initial submissions need only be 1.5 or double-spaced and clearly organized. 8. Papers should be accompanied by an abstract of no more than 150 words that is full enough to convey the main results of the paper. On the same sheet as the abstract should appear the title of the paper, the name(s) and full address(es) of the author(s), and a list of keywords. 9. If you plan to submit a comment on an article which has appeared in Econometrica, we recommend corresponding with the author, but require this only if the comment indicates an error in the original paper. When you submit your comment, please include any correspondence with the author. Regarding comments pointing out errors, if an author does not respond to you after a reasonable amount of time, then indicate this when submitting. Authors will be invited to submit for consideration a reply to any accepted comment. 10. Manuscripts on experimental economics should adhere to the “Guidelines for Manuscripts on Experimental Economics” written by Thomas Palfrey and Robert Porter, and published in the July, 1991 issue of Econometrica. Typeset at VTEX, Akademijos Str. 4, 08412 Vilnius, Lithuania. Printed at The Sheridan Press, 450 Fame Avenue, Hanover, PA 17331, USA. Copyright © 2009 by The Econometric Society (ISSN 0012-9682). Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation, including the name of the author. Copyrights for components of this work owned by others than the Econometric Society must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works, requires prior specific permission and/or a fee. Posting of an article on the author’s own website is allowed subject to the inclusion of a copyright statement; the text of this statement can be downloaded from the copyright page on the website www.econometricsociety.org/permis.asp. Any other permission requests or questions should be addressed to Claire Sashi, General Manager, The Econometric Society, Dept. of Economics, New York University, 19 West 4th Street, New York, NY 10012, USA. Email:
[email protected]. Econometrica (ISSN 0012-9682) is published bi-monthly by the Econometric Society, Department of Economics, New York University, 19 West 4th Street, New York, NY 10012. Mailing agent: Sheridan Press, 450 Fame Avenue, Hanover, PA 17331. Periodicals postage paid at New York, NY and additional mailing offices. U.S. POSTMASTER: Send all address changes to Econometrica, Blackwell Publishing Inc., Journals Dept., 350 Main St., Malden, MA 02148, USA.
An International Society for the Advancement of Economic Theory in its Relation to Statistics and Mathematics Founded December 29, 1930 Website: www.econometricsociety.org Membership, Subscriptions, and Claims Membership, subscriptions, and claims are handled by Blackwell Publishing, P.O. Box 1269, 9600 Garsington Rd., Oxford, OX4 2ZE, U.K.; Tel. (+44) 1865-778171; Fax (+44) 1865-471776; Email
[email protected]. North American members and subscribers may write to Blackwell Publishing, Journals Department, 350 Main St., Malden, MA 02148, USA; Tel. 781-3888200; Fax 781-3888232. Credit card payments can be made at www.econometricsociety.org. Please make checks/money orders payable to Blackwell Publishing. Memberships and subscriptions are accepted on a calendar year basis only; however, the Society welcomes new members and subscribers at any time of the year and will promptly send any missed issues published earlier in the same calendar year. Individual Membership Rates Ordinary Member 2009 Print + Online 1933 to date Ordinary Member 2009 Online only 1933 to date Student Member 2009 Print + Online 1933 to date Student Member 2009 Online only 1933 to date Ordinary Member—3 years (2009–2011) Print + Online 1933 to date Ordinary Member—3 years (2009–2011) Online only 1933 to date Subscription Rates for Libraries and Other Institutions Premium 2009 Print + Online 1999 to date Online 2009 Online only 1999 to date
$a $60
€b €40
£c £32
Concessionaryd $45
$25
€18
£14
$10
$45
€30
£25
$45
$10
€8
£6
$10
$175
€115
£92
$70
€50
£38
$a
€b
£c
Concessionaryd
$550
€360
£290
$50
$500
€325
£260
Free
a All
countries, excluding U.K., Euro area, and countries not classified as high income economies by the World Bank (http://www.worldbank.org/data/countryclass/classgroups.htm), pay the US$ rate. High income economies are: Andorra, Antigua and Barbuda, Aruba, Australia, Austria, The Bahamas, Bahrain, Barbados, Belgium, Bermuda, Brunei, Canada, Cayman Islands, Channel Islands, Cyprus, Czech Republic, Denmark, Equatorial Guinea, Estonia, Faeroe Islands, Finland, France, French Polynesia, Germany, Greece, Greenland, Guam, Hong Kong (China), Hungary, Iceland, Ireland, Isle of Man, Israel, Italy, Japan, Rep. of Korea, Kuwait, Liechtenstein, Luxembourg, Macao (China), Malta, Monaco, Netherlands, Netherlands Antilles, New Caledonia, New Zealand, Northern Mariana Islands, Norway, Oman, Portugal, Puerto Rico, Qatar, San Marino, Saudi Arabia, Singapore, Slovak Republic, Slovenia, Spain, Sweden, Switzerland, Taiwan (China), Trinidad and Tobago, United Arab Emirates, United Kingdom, United States, Virgin Islands (US). Canadian customers will have 6% GST added to the prices above. b Euro area countries only. c UK only. d Countries not classified as high income economies by the World Bank only. Back Issues Single issues from the current and previous two volumes are available from Blackwell Publishing; see address above. Earlier issues from 1986 (Vol. 54) onward may be obtained from Periodicals Service Co., 11 Main St., Germantown, NY 12526, USA; Tel. 518-5374700; Fax 518-5375899; Email
[email protected].
An International Society for the Advancement of Economic Theory in its Relation to Statistics and Mathematics Founded December 29, 1930 Website: www.econometricsociety.org Administrative Office: Department of Economics, New York University, 19 West 4th Street, New York, NY 10012, USA; Tel. 212-9983820; Fax 212-9954487 General Manager: Claire Sashi (
[email protected]) 2009 OFFICERS ROGER B. MYERSON, University of Chicago, PRESIDENT JOHN MOORE, University of Edinburgh and London School of Economics, FIRST VICE-PRESIDENT BENGT HOLMSTRÖM, Massachusetts Institute of Technology, SECOND VICE-PRESIDENT TORSTEN PERSSON, Stockholm University, PAST PRESIDENT RAFAEL REPULLO, CEMFI, EXECUTIVE VICE-PRESIDENT
2009 COUNCIL (*)DARON ACEMOGLU, Massachusetts Institute of Technology MANUEL ARELLANO, CEMFI SUSAN ATHEY, Harvard University ORAZIO ATTANASIO, University College London (*)TIMOTHY J. BESLEY, London School of Economics KENNETH BINMORE, University College London TREVOR S. BREUSCH, Australian National University DAVID CARD, University of California, Berkeley JACQUES CRÉMER, Toulouse School of Economics (*)EDDIE DEKEL, Tel Aviv University and Northwestern University MATHIAS DEWATRIPONT, Free University of Brussels DARRELL DUFFIE, Stanford University HIDEHIKO ICHIMURA, University of Tokyo MATTHEW O. JACKSON, Stanford University
LAWRENCE J. LAU, The Chinese University of Hong Kong CESAR MARTINELLI, ITAM HITOSHI MATSUSHIMA, University of Tokyo MARGARET MEYER, University of Oxford PAUL R. MILGROM, Stanford University STEPHEN MORRIS, Princeton University ADRIAN R. PAGAN, Queensland University of Technology JOON Y. PARK, Texas A&M University and Sungkyunkwan University CHRISTOPHER A. PISSARIDES, London School of Economics ROBERT PORTER, Northwestern University ALVIN E. ROTH, Harvard University LARRY SAMUELSON, Yale University ARUNAVA SEN, Indian Statistical Institute MARILDA SOTOMAYOR, University of São Paulo JÖRGEN W. WEIBULL, Stockholm School of Economics
The Executive Committee consists of the Officers, the Editor, and the starred (*) members of the Council.
REGIONAL STANDING COMMITTEES Australasia: Trevor S. Breusch, Australian National University, CHAIR; Maxwell L. King, Monash University, SECRETARY. Europe and Other Areas: John Moore, University of Edinburgh and London School of Economics, CHAIR; Helmut Bester, Free University Berlin, SECRETARY; Enrique Sentana, CEMFI, TREASURER. Far East: Joon Y. Park, Texas A&M University and Sungkyunkwan University, CHAIR. Latin America: Pablo Andres Neumeyer, Universidad Torcuato Di Tella, CHAIR; Juan Dubra, University of Montevideo, SECRETARY. North America: Roger B. Myerson, University of Chicago, CHAIR; Claire Sashi, New York University, SECRETARY. South and Southeast Asia: Arunava Sen, Indian Statistical Institute, CHAIR.
Econometrica, Vol. 77, No. 5 (September, 2009), 1339–1369
THE UNEMPLOYMENT VOLATILITY PUZZLE: IS WAGE STICKINESS THE ANSWER? BY CHRISTOPHER A. PISSARIDES1 I discuss the failure of the canonical search and matching model to match the cyclical volatility in the job finding rate. I show that job creation in the model is influenced by wages in new matches. I summarize microeconometric evidence and find that wages in new matches are volatile and consistent with the model’s key predictions. Therefore, explanations of the unemployment volatility puzzle have to preserve the cyclical volatility of wages. I discuss a modification of the model, based on fixed matching costs, that can increase cyclical unemployment volatility and is consistent with wage flexibility in new matches. KEYWORDS: Unemployment volatility puzzle, wage stickiness, search and matching, Nash wage equation.
JOBS IN THE SEARCH AND MATCHING MODEL are characterized by monopoly rents, due to the matching frictions that give rise to search costs and unemployment. Despite notable recent exceptions, most of the literature assumes that the rents are shared through continuous recontracting between the firm and the worker, and uses the Nash solution to the wage bargain to derive the wage rate. The outcome is the “Nash wage equation,” which gives the wage rate as a linear combination of the productivity of the match and the worker’s returns from search and other nonmarket activities. Pissarides (1985) and Mortensen and Pissarides (1994) have shown that because the returns from nonmarket activities are less cyclical than the value of labor product, wages are less cyclical and employment is more cyclical in search and matching equilibrium than in a competitive market-clearing model. Shimer (2005a), however, argued that under common parameter values, the Nash wage rate is close to being as cyclical as productivity, and so the model does not have enough power to generate the observed cyclical volatility in its key variable—the ratio of job vacancies to unemployment (“tightness”). The model can explain more volatility in employment and less in wages than the competitive model does, as shown in calibrated business cycles models, but the volatility in unemployment that it can explain is tiny compared with the data.2 1 The Walras–Bowley lecture, North American Summer Meetings of the Econometric Society, Duke University, June 2007. I am grateful to the editor Daron Acemoglu and the referees for their extensive comments, and to Antonio Antunes, Christian Haefke, Robert Hall, John Kennan, Per Krusell, Iourii Manovskii, Rachel Ngai, Michael Reiter, Robert Shimer and Gary Solon for comments and discussion. Pedro Gomes provided research assistance. Partial funding for this study was provided by the Centre for Economic Performance, a designated research center of the ESRC. 2 See Hornstein, Krusell, and Violante (2005), Mortensen and Nagypal (2007), and Yashiv (2007) for a discussion of several issues related to this controversy. When matching frictions are
© 2009 The Econometric Society
DOI: 10.3982/ECTA7562
1340
CHRISTOPHER A. PISSARIDES
I call the failure of the model to match the observed volatility of unemployment the unemployment volatility puzzle. I discuss this puzzle in a simple search and matching model, focusing on the role of wages. The model has only one driving force—the average product of labor (which in the canonical model is always equal to the marginal product). With this restriction, it is easy to show that the canonical model can deliver nontrivial volatility in unemployment only if there is at least some wage stickiness, defined as a wage rate that changes less than in proportion to the average product of labor over the cycle (see Hall and Milgrom (2008)). In the context of Shimer’s claims, the search and matching model has one big advantage over the competitive model: it is immune to Barro’s (1977) critique that in a rational equilibrium wage stickiness should not cause employment volatility. Moreover, as Hall (2005a) noted, there are rent division rules that stabilize the wage without violating either side’s production participation constraints, and thus make the employer’s profit from a new hire more cyclical than implied by the Nash wage rate. These rules imply a wide range of volatility of labor-market tightness, including the observed level of volatility. These findings, and the commonly held view that wages are sticky over the cycle, seemed to point to the conclusion that a solution to the unemployment volatility puzzle is an alternative to the Nash wage equation that delivers more wage stickiness. Whether or not another wage equation is the answer to the unemployment volatility puzzle depends on the consistency between the model’s wage equation and the empirical evidence. The commonly held view that wages are sticky over the cycle is derived mainly from time-series evidence, starting with the famous Keynes–Tarshis–Dunlop controversy, but this evidence is not relevant for the search and matching model.3 I reexamine the evidence in the context of the search and matching model, and find that the answer is not as simple as implied by the argument of the preceding paragraph. In the search and matching model, the timing of wage payments during the job’s tenure is not important for job creation. Job creation is driven by the difference between the expected productivity and the expected cost of labor in new matches. I demonstrate that as long as the firm and the worker use the Nash wage rule to split rents at the time of job creation, the job creation conditions are unaffected by the rule used to split rents in ongoing jobs. So wages in continuing jobs may be completely fixed, and yet, if wages in new matches satisfy the Nash wage equation, the volatility of job creation will be unaffected by their wage stickiness. The wage stickiness that matters in this model incorporated into the conventional real business cycle model, the standard results are replicated and there is an improvement in the model’s performance with respect to employment. See Langot (1995), Merz (1995), Andolfatto (1996), and den Haan, Ramey, and Watson (2000). See also Cole and Rogerson (1999). But none of these papers explicitly addressed the issue of unemployment volatility. 3 See Brandolini (1995) and Abraham and Haltiwanger (1995) for surveys of the time-series evidence.
THE UNEMPLOYMENT VOLATILITY PUZZLE
1341
is therefore wage stickiness in new matches, and the model’s Nash wage equation should be compared with the empirical evidence relating only to wages in new matches.4 I summarize the existing empirical evidence about the cyclicality of wages in new matches and find that the model’s Nash wage equation gets it about right: there is as much cyclicality in the empirical wage equations for new jobs as in the simple wage equation derived in the canonical model. The Nash wage equation implies too much cyclical volatility for wages in ongoing jobs, but this cyclicality is irrelevant for job creation. Moreover, this is true both in the United States and, perhaps more surprisingly, in the main European economies for which there are relevant data. I conclude that a good explanation of the unemployment volatility puzzle needs to be consistent with the observed proportionality (or near proportionality) between wages in new matches and labor productivity. Models that imply nontrivial departures from unit elasticity between wages in new matches and productivity go against a large body of evidence. I show in this paper that a small modification to the model, that maintains the usual parameter values used in calibrations, can deliver more volatility in the job finding rate without departing from wage flexibility in new jobs. The modification is in the way in which matching costs are modeled. In the canonical model, matching costs (other than foregone output) are proportional to the duration of a vacancy. This is a very strong assumption, because if, following a positive productivity shock, the duration of vacancies increases, the firm’s cost of meeting a worker increases in proportion to the increase in duration. This discourages firms from posting many more vacancies when positive shocks arrive. If instead costs rise less than in proportion to the duration of vacancies, the firm’s incentives to post vacancies remain high. I show that a simple remodeling of the costs from proportional to partly fixed and partly proportional can increase the volatility of tightness and job finding, virtually matching the observed magnitudes, without violating wage flexibility. I argue that the assumption that there are fixed costs to job creation is in itself a realistic assumption. These costs include the costs of negotiating with the successful job applicant, putting her on the firm’s payroll, and training her.5 Hagedorn and Manovskii (2008) have used a calibration technique different from Shimer’s (2005a) to conclude that the typical worker’s nonmarket 4 A similar argument was independently put forward by Haefke, Sonntag, and van Rens (2007), who also reported empirical estimates motivated by the model. See also Shimer (2004) for a discussion of similar issues. 5 Hall and Milgrom (2008) changed the wage bargain to one of strategic (sequential) bargaining with delay costs. They showed that the model delivers the required unemployment volatility and avoids unrealistic wage stickiness. The fixed costs that I introduce in the model with the Nash wage equation play a similar role to their negotiation costs, but in their model the negotiation costs are not paid in equilibrium. Mortensen and Nagypal (2007) also showed that training costs increase the volatility of job finding.
1342
CHRISTOPHER A. PISSARIDES
returns are high, about 95% of market returns, and the share of labor in the wage bargain is low. With these parameters, the model can calibrate the observed cyclical volatility in tightness with near proportional wages. The model (at least in its canonical form) is subject to two other criticisms, however, even if one accepts that the improvement in a person’s welfare from job acceptance can be that small. Costain and Reiter (2008) noted, in a paper that anticipated to some extent both the Shimer (2005a) critique and the Hagedorn and Manovskii (2008) response, that if nonmarket returns are high, the response of unemployment to labor-market policy, in particular unemployment insurance, is too large.6 Hall and Milgrom (2008) also noted that the Hagedorn and Manovskii calibrations imply too high a labor supply elasticity, given empirical estimates. This research has brought out a general point: if a calibration of the canonical model implies that, on average, equilibrium wages are closer to productivity, it also implies an amplification of the impact of productivity shocks on unemployment without violating the near proportionality of wages in new jobs. The reason for this is that the profit margin becomes small, so small productivity shocks cause large proportional changes in profits, even if wages are near proportional to productivity. In the remainder of this paper, I first briefly discuss some issues in the dynamic evolution of unemployment, in particular, the role of movements in and out of the labor force, and flows between employment and unemployment (Section 1). Following this, I derive the key cyclical elasticities of tightness and wages from a simple search and matching model with endogenous job finding but exogenous job separations (Sections 2 and 3). I then survey the econometric evidence on wages and show that the estimated elasticities for new jobs match the model’s calibrated elasticities (Section 4). I finally discuss the role of matching costs and nonmarket returns for the cyclical volatilities of the model (Section 5). 1. WHAT DRIVES THE DYNAMICS OF UNEMPLOYMENT? The approach that I follow in this paper is to derive the impact of cyclical shocks on unemployment by modeling the flows in and out of unemployment. Two questions immediately arise: First, do we lose essential generality if we ignore transitions between unemployment and out of the labor force?; second, should we model cyclicality in both the flows in and the flows out of unemployment? The answer to these questions for the conventional rate of unemployment is no to the first and yes to the second. 6
Policy effects might be dampened if the production function is such that a policy-induced fall in employment increases the average and marginal product of labor. Hagedorn, Manovskii, and Stetsenko (2008) obtained such an effect from the assumption of labor heterogeneity with imperfect substitutability between skilled and unskilled workers, and from complementarity between capital and skilled labor.
THE UNEMPLOYMENT VOLATILITY PUZZLE
1343
The flow rates between activity and inactivity show some cyclicality, but several investigators have concluded that they do not contribute substantially to the cyclical volatility of unemployment (see Shimer (2005b), Hall (2005b), Braun, De Bock, and DiCecio (2006), Elsby, Michaels, and Solon (2009), and Fujita and Ramey (2009)). The rate of inactivity itself is nearly acyclical, and the correlation between the cyclical components of the rates of unemployment and employment is −095 I therefore focus on a simple two-state model, with workers moving between the states of unemployment and employment in response to shocks. This is also the focus of the canonical model that has recently come under scrutiny. In a two-state model, I define the change in the unemployment rate, ut = st (1 − ut ) − ft ut where st is the flow rate between employment and unemployment during period t (the inflow, defined as the total number of workers who move from employment to unemployment divided by the number of employed workers) and ft is the flow rate in the other direction (the outflow, defined as the number of workers who move from unemployment to employment divided by the number of unemployed workers). If the two flow rates remain constant at s and f for a sufficiently long time, unemployment converges to the steady-state rate (1)
u=
s s+f
With quarterly data on unemployment stocks and flows, constructed under the assumption that s and f are constant during the quarter, the unemployment rate obtained from (1) is virtually indistinguishable from the actual unemployment rate.7 I therefore use (1) as my unemployment equation throughout this paper. Taking first differences of (1), I find that the change in the rate of unemployment from quarter t − 1 to quarter t is given by (2)
ut = (1 − ut )ut−1
st ft − ut (1 − ut−1 ) st−1 ft−1
Figure 1 shows the contribution of each flow rate to the change in unemployment. The flow rates are derived from the quarterly job finding and job exit probabilities constructed by Shimer (2005b) and available online, and the two series shown in Figure 1 are for each of the terms on the right side of (2). Clearly, both rates contribute to the change in the unemployment rate. Their correlation coefficient is −05, so on average their contribution is in the same 7 See Shimer (2005b), where constructed quarterly time-series data for the flows are also available for downloading (http://robert.shimer.googlepages.com/flows). The same appears true of economies with longer durations in each state, as in pre-1980s Britain. See Pissarides (1986).
1344
CHRISTOPHER A. PISSARIDES
percentage points
2 1 0 -1 -2 -3 1948 1953 1958 1963 1968 1973 1978 1983 1988 1993 1998 2003 finding
separation
FIGURE 1.—Contribution of job finding and job separation rates to changes in unemployment.
direction. A consensus estimate in the literature for the contribution of the inflow rate lies between one-third and one-half of the total.8 2. THE CANONICAL MODEL The recent literature has either ignored the inflow rate when studying the cyclical dynamics of unemployment or treated shocks to the inflow rate as one of the exogenous forces driving changes in the outflow rate. But because more low-productivity jobs are destroyed in recession (e.g., Solon, Barsky, and Parker (1994)), at least some part of job separations is driven by endogenous decisions in response to aggregate productivity shocks. If all job destruction were driven by exogenous separation shocks, the jobs destroyed in recession would be a random draw from the productivity distribution. The model of Mortensen and Pissarides (1994) accounts for the volatility in the flow into unemployment through the endogenous job destruction decisions of firms. An important implication of the endogeneity of job destruction, related to the composition bias of the empirical literature, is the following. Compare the impact of exogenous and endogenous job destruction shocks on the expected profit from a new job position. Exogenous job destruction shocks are equivalent to shocks to the discount rate applied to the flow of output from all 8 See the references cited earlier in this section. Of course, the question of the cyclicality of unemployment flows is quite distinct from the question of the cyclicality of job flows. Concerning especially the flow out of jobs, one can distinguish between the flow of workers from employment to unemployment (the focus of this paper), the flow of workers out of jobs (the separation rate), and the flow of jobs from activity to inactivity (the job destruction rate). See, for example, Davis, Haltiwanger, and Schuh (1996) and Hall (2005b). Because of the abstractions of the model, I often refer to the unemployment outflow rate as the job creation or job finding rate and refer to the unemployment inflow rate as the job destruction or job separation rate.
THE UNEMPLOYMENT VOLATILITY PUZZLE
1345
jobs, so they have an impact on the present discounted value of profits and on job creation. But endogenous job destruction due to productivity shocks does not have an impact on job creation in the neighborhood of equilibrium, essentially because of the envelope theorem. In the canonical model, both the entry of new vacancies and the choice of reservation productivity that governs job destruction maximize the value of a vacant job. The impact that small productivity shocks have on the entry of vacancies is unaffected by the response of the reservation productivity to the same productivity shocks. Intuitively, following a small negative shock to productivity, the productivity of the jobs destroyed is close to the reservation value and so their profit flow is zero. Their disappearance does not affect the present discounted value of the profit from the average job that governs job creation.9 The only types of shocks that I study in this paper are productivity shocks. By the argument of the preceding paragraph, I can derive their impact on job creation from a model with constant job destruction rate, which helps make my argument more transparent. Their impact on job destruction requires, in addition, the endogenous job destruction margin of the Mortensen–Pissarides model. To compute the quantitative impact of productivity shocks on job destruction in the Mortensen–Pissarides model, I require knowledge of the parameters of the process that brings idiosyncratic productivity shocks to active jobs. We do not yet know enough about the quantitative properties of this process to calibrate it independently of the observed job creation and job destruction rates. For this reason, I focus here on the unemployment outflow rate shown in Figure 1, for which the model makes strong quantitative predictions, and leave the quantitative evaluation of the model’s inflow rate for future research.10 My objective is to compare the second moments of the endogenous variables, in particular unemployment, vacancies, and wages, with the second moment of labor productivity. But because the matching flows are large and there is a lot of persistence in productivity when compared with the speed at which unemployment approaches its steady state, I can approximate the cyclical results by comparative static results with a continuous-time model that compares steady states at different realizations of labor productivity (Shimer (2005a)). 9 This claim is spelt out more fully and more formally in the longer version of this paper that circulated as Centre for Economic Performance (LSE) Discussion Paper 0839 (November 2007). It is also available on my web page, http://personal.lse.ac.uk/pissarid/. 10 In the longer version of the paper cited in the preceding footnote, I calibrated a particular version of the model with endogenous job destruction and uniform idiosyncratic productivity shocks. The calibration results for the job creation rate are the same as in the shorter model in this paper. Recently, Menzio and Shi (2008) used the tenure distribution to “back out” the distribution of idiosyncratic productivities across active jobs. Although this is a promising avenue for calibrating the steady-state distribution, it still leaves open the question of shocks to the idiosyncratic productivity, their frequency, and their intensity. Menzio and Shi bypassed this difficulty by assuming that there are no idiosyncratic shocks.
1346
CHRISTOPHER A. PISSARIDES
In the simple version of the model (Pissarides (2000, Chap. 1)), the flow of workers from employment to unemployment is the result of a negative shock that hits occupied jobs at constant rate s; I refer to this as the job separation rate. The flow of workers from unemployment to employment is derived from the rate at which unemployed workers are matched to vacant jobs; I refer to this as the job finding rate. Matching is pairwise and random, and is given by the aggregate matching function m(u v) which is concave in its arguments and homogenous of degree 1. The arguments are the measures of unemployment and vacancies, the first describing the state of the system at any point in time and the second resulting from the profit maximization decisions of firms. The transition rate for each vacant job is the average m(u/v 1) ≡ q(θ) where θ ≡ v/u is the tightness of the market and q (θ) < 0 The transition rate for unemployed workers is f (θ) ≡ m(1 v/u) = θq(θ) and f (θ) > 0. With knowledge of s and f , we get the unemployment rate from (1). 2.1. Job Creation The utility function of both workers and firms is linear. Unemployed workers enjoy some imputed income z during unemployment, which has to be given up when they take a job. The job creation decision is initiated by an employer when she posts a vacancy, at a flow cost c for the duration of the vacancy. A search equilibrium is a pair (θ w) that simultaneously solves the job creation condition and the wage rule. To derive the job creation condition, let V be the value of a new vacancy to an employer. It satisfies the Bellman equation (3)
rV = −c + q(θ)(J − V )
J is the value of an occupied job and satisfies (4)
rJ = p − w − sJ
where r represents the risk-free interest rate, and the assumption is made that a destroyed job has zero value to the employer. Vacancy creation exhausts all available profits, so the job creation condition is (5)
V =0
⇐⇒
p−w c = r+s q(θ) 2.2. Wages
The canonical model assumes that wages share the surplus from the job in fixed proportions at all times. If we let W be the worker’s expected returns from holding a job and let U be the expected returns from unemployment, wages solve (6)
W − U = β(J + W − V − U)
β ∈ [0 1)
THE UNEMPLOYMENT VOLATILITY PUZZLE
1347
This sharing rule can be derived as the solution to a generalized Nash bargaining problem w = arg max{(W − U)β (J − V )1−β } and is referred to as the Nash sharing rule or simply as the Nash wage. Given (4) and the equivalent equation satisfied by W , (7)
rW = w − s(W − U)
the wage equation also satisfies, in general, (8)
w = rU + β(p − rU − (r + s)V )
This equation makes clear that there are three separate channels through which a shock to productivity is transmitted to wages. First, there is a direct effect from the own-job productivity that is due to the sharing assumptions, the p term in (8); second and third, there are indirect effects that transmit shocks to p through changes in the reservation values of the firm and the worker. The controversy surrounding wages centers on the role of the reservation values in wage determination, which in the Nash wage rule have maximum impact because they define the “threat points” of the firm and the worker.11 The solution commonly found in the literature is derived from (8) by using the expression for the value of unemployment and the job creation condition to substitute out the reservation values. We first note that because of (3), (5), and (6), (9)
rU = z + f (θ)(W − U) = z + f (θ) =z+
β (J − V ) 1−β
β cθ 1−β
The wage equation is (10)
w = (1 − β)z + β(p + cθ)
The job creation condition (5) slopes down in (θ w) space and the wage equation (10) slopes up, giving a unique equilibrium tightness–wage pair. Given tightness, equation (1) delivers the unemployment rate. 11
Note that because the vacancy value is lost at rate s whereas the value of unemployment is never lost, the value of a vacancy is discounted at rate r + s whereas that of unemployment is discounted only at rate r
1348
CHRISTOPHER A. PISSARIDES
2.3. Wages in New and Continuing Jobs The Nash sharing rule in the canonical model holds at all times, irrespective of the tenure of the job. But this is not important for job creation in the model. If wages in new jobs are fixed by the Nash wage rule, the job creation condition is the same as that in the canonical model, irrespective of how wages are determined in subsequent job tenures. When I examine the empirical evidence on wages later in this paper, I find a sharp distinction between the wages in new matches and the wages in continuing jobs. In anticipation of the discussion of the empirical results that follows, I now demonstrate the equivalence of the job creation condition in the canonical model and in an alternative model where the Nash sharing rule holds only in new jobs. A job is new when the firm and the worker first match, and remains “new” for an average period of length 1/λ At some constant rate λ it changes status to a continuing job. The event that changes the status of the job from new to continuing need not be specified. It might be connected with the arrival of aggregate or idiosyncratic shocks to productivity. To make the point more transparent, I first derive the wage equation and job creation condition under the assumption that the market is in stationary equilibrium at productivity p and tightness θ throughout the life of the job.12 The asset values of belonging to a new or continuing job are now, respectively, distinguished by superscript n or c In continuing jobs, the asset values are as in (4) and (7), except that J, W , and w are distinguished by superscript c In new jobs we now have (11)
(r + s)J n = p − wn + λ(J c − J n )
(12)
(r + s)(W n − U) = wn + λ(W c − W n ) − rU
I define the surplus from new jobs as S n = J n + W n − U Making use of all asset-value equations for new and continuing jobs to calculate S n I obtain (13)
Sn =
p − rU r+s
The duration of new jobs does not influence their net surplus. I assume that the Nash sharing rule holds for new jobs but not necessarily for continuing jobs. The rule in (6) implies J n = (1 − β)S n and so job creation is given by (14)
(1 − β)S n = (1 − β)
c p − rU = r +s q(θ)
12 Less strictly, the wage equation that I derive under this restriction is a good approximation to a wage equation for new jobs when the event that changes the status of the job arrives with much higher frequency than the event that changes productivity. But the main result holds in more general setups with appropriate modification of the Bellman equations.
THE UNEMPLOYMENT VOLATILITY PUZZLE
1349
From (9), which still holds for new jobs, I get the job creation condition for this model: (15)
(1 − β)(p − z) − βcθ c = r +s q(θ)
The interesting property of (15) is that it is identical to the job creation condition of the canonical model, obtained when (10) is substituted into (5). Reversing this argument, we can define a “mean” wage w¯ for a job by (16)
w¯ = (1 − β)z + β(p + cθ)
use this as if it were the wage rate at every tenure, and derive the full solution for job creation. But this wage is identical to the Nash wage equation (10), although the Nash wage rule in this model holds only for new jobs with an arbitrary mean duration 1/λ Wages in continuing jobs can take on arbitrary values, since we derived (15) without imposing any restrictions on wc Of course, wages in new and continuing jobs will not, in general, behave like the Nash wage if the Nash sharing rule is applied only to new jobs. Making use of the Nash sharing rule and the asset valuation equations to solve for wages in new jobs, I obtain (17)
wn = (1 − β)z + β(p + cθ) + λ(βJ c − (1 − β)(W c − U))
This wage equation coincides with the Nash wage equation of the canonical model if wages in continuing jobs also satisfy the Nash sharing rule, because in this case βJ c = (1 − β)(W c − U) But if the share of the firm in continuing jobs is bigger than required by Nash, the wage rate in new jobs is higher to compensate the worker, whereas if it is smaller, the opposite holds. Now it is straightforward to show that if the change of job status coincides with a change in productivity and tightness (or any other variable), a wage equation like (17) still holds.13 For the benefit of the analysis that follows, I therefore now assume that a match is formed in some state i but its status changes to a continuing match when a new state j arrives, which coincides with the arrival of a new productivity and tightness for all jobs. States alternate between i and j at the same constant rate λ I can derive, with obvious notation, (18) 13
win = (1 − β)z + β(pi + cθi ) + λ(βJjc − (1 − β)(Wjc − Uj ))
The reason that I did not use this more general setup to make the previous point (that when wages in new jobs satisfy the Nash rule, job creation is the same as in the canonical model) is that in the more general setup, the mean wage equation in (16) is considerably more complicated. However, it still coincides with the equivalent equation when the Nash sharing rule applies to all jobs and there are productivity differences across jobs.
1350
CHRISTOPHER A. PISSARIDES
Let wjc be the continuation wage in state j that is, the wage offered to those hired in state i or in state j in earlier periods, and assume that it is exogenous. Then substituting out of (18) the continuation values from the Bellman equations, I obtain (19)
win = wiN +
λ (wn − wjc ) r +s+λ j
where wiN is the Nash wage in state i that is, the wage that satisfies (10) with p and θ qualified by subscript i. Equation (19) is intuitive. If wjc = wjn the Nash sharing rule is used for both continuing and new jobs, so the wage equation is the equation of the canonical model. If continuation wages are expected to be higher than wages in new matches, new wages offered now are lower to compensate. I will return to this equation when I study wage volatility in new and old jobs. 3. SOLVING THE MODEL 3.1. Parameter Values and Steady-State Solutions Given our interest in the job finding rate only, we can ignore for now the Beveridge curve (1) and focus on the two-equation system (5) and (10) in the two unknowns θ and w I solve the model for monthly data with the parameters shown in Table I. I will return to the model with two wages after I derive the canonical solutions. The matching function is assumed to be Cobb–Douglas m = m0 uη v1−η with unemployment elasticity η = 05 (see Petrongolo and Pissarides (2001)). Following common practice, I also assume β = 05 which internalizes the search TABLE I PARAMETER VALUES, QUARTERLY DATA Parameter
Value
Description
Source/Target
r s z c m0 η β
0004 0036 071 0356 07 05 05
Interest rate Exog. separations Leisure and UI comp. Vacancy cost Matching fn. scale Matching fn. elasticity Share of labor
Data Shimer (2005b) Hall–Milgrom (2008) Mean θ Job finding probability Petrongolo–Pissarides (2001) β = η (efficiency)
Data, Mean Values
θ m0 θ1−η
072 0594
Mean v/u (tightness) Job finding prob.
JOLTS, HWI Shimer (2005b)
THE UNEMPLOYMENT VOLATILITY PUZZLE
1351
externalities.14 The job finding rate is f (θ) = m0 θ05 The sample mean for θ in 1960–2006 was 072, derived by making use of Job Openings and Labor Turnover Survey (JOLTS) data since December 2000 and the Help-Wanted Index (HWI) adjusted to the JOLTS units of measurement before then. Shimer’s (2005b) monthly transitions data (under the assumption that monthly transition rates are constant within the quarter) give a mean value for 1960–2004 of 0594 for the job finding rate and 0036 for the job separation rate. The implied unemployment rate is 57% very close to the actual mean. I set s = 0036 and make use of the mean job finding rate and mean tightness value to solve for m0 The result is m0 = 07 All costs and returns are normalized by the value of output, which is set to 1 in the initial equilibrium. The income equivalent that the unemployed give up to take a job was recently calculated by Hall and Milgrom (2008) to be 071; it includes both unemployment insurance and the value of time. The cost of advertising vacancies and recruiting is obtained from the steady-state solutions of the model; it is the value that gives θ = 072 when all the other parameter values are set as above. We can check how plausible it is by computing the expected recruitment costs, c/q(θ) = 043 giving that the expected recruitment cost is 43% of a month’s output. The solution for wages obtained in the steady state is w = 098. The percentage gain in flow receipts when a worker accepts a job is substantial, 100(098/071 − 1) = 384% But the “permanent income” of employed workers is only marginally above the “permanent income” of unemployed workers, a consequence of the assumption of infinite horizons, short unemployment durations, and uniform unemployment incidence. 3.2. Elasticities To compute the impact of productivity shocks on the model’s unknowns, I differentiate the job creation condition and wage equation, and compute the elasticities at the steady-state solutions. The computed θ elasticity is 367. In his original critique of the search and matching model, Shimer (2005a) found an elasticity of 171 The target for this elasticity is 756 which would be the regression coefficient in a simple regression with tightness as the dependent variable and productivity as the independent variable.15 Virtually the only reason for the difference in the two elasticities is due to z The bigger z in our 14 See Hall and Milgrom (2008) for a different motivation for β ∼ = 05, and see Hagedorn and Manovskii (2008) for reasons to select a much smaller value for β As with β the elasticity η plays some role in the quantitative solutions of the model, but within the small range of 0.5–0.7, which conforms to the empirical estimates reported in Petrongolo and Pissarides (2001), the solutions are robust to it. 15 Shimer (2005a) set the target at 191 which is the ratio of the standard deviations of tightness and productivity. But this should be the target if there were no measurement or other random errors in the two variables and if no other shocks influenced tightness. To get the 756 target, I multiply by their correlation coefficient. See Shimer’s paper for the data used here.
1352
CHRISTOPHER A. PISSARIDES
calibration has an impact because it reduces the firm’s steady-state profit and so implies that cyclical shocks have a bigger proportional impact on profits. The change of z from 04 (Shimer’s number) to our 071 increases the elasticity in our version of the model from 174 to 367. The response of wages to the change in productivity is very close to proportionality. The wage–productivity elasticity is 0985 and in the version of the model with z = 04, it is 1 This implication of the Nash wage equation was used to criticize it, because time-series data show substantial wage stickiness (see Section 4).16 Two features of the model encouraged research in modifications of the model that might yield a stickier wage equation. First, the model can accommodate substantial wage stickiness without violating rationality, because of the local monopoly rents created by a job match (Hall’s (2005a) point); second, even small amounts of wage stickiness can make a lot of difference in the response of job creation to productivity shocks. To illustrate how much wage stickiness is needed to match the response of tightness to productivity shocks, let the elasticity of θ with respect to p be εθ and let the elasticity of wages be εw . From (5), (20)
εθ =
1 p − εw w η p−w
If εw = 1, then εθ = 1/η = 2 essentially the Shimer critique. Obviously, reducing εw increases εθ As we noted before, in this model εθ = 367 achieved by a wage elasticity εw = 0985 and a solution for w = 0983 How much wage stickiness is needed to hit the required θ elasticity of 756? As is clear from (20), the answer to this question depends critically on how close w is to p I rearrange (20) to obtain (21)
εw∗ =
η(p − w) p − εθ w w
Using the same solution for w as in the model with the Nash wage equation and setting εθ = 756 gives εw∗ = 0885 For solutions that yield a w closer to p the required wage elasticity is closer to 1. In this simple model the distance of w from p is dictated largely by z so its role is crucial in computing the wage elasticity needed to deliver the required volatility in the job finding rate. For z = 04 the required wage stickiness is εw∗ = 0758 whereas for z = 09, the computed Nash wage elasticity of 0955 delivers even more volatility in θ than the required elasticity, with εθ = 110717 16
Wage stickiness in this debate is loosely defined as a wage–productivity elasticity that is less than 1. I follow this practice. 17 The latter is the point made by Hagedorn and Manovskii (2008).
THE UNEMPLOYMENT VOLATILITY PUZZLE
1353
3.3. Wages in New Jobs I have argued that the job creation condition of the canonical model also holds in a model where the Nash sharing rule applies only to new jobs. The wage equation, however, differs in this case, as shown by equation (19), so its cyclical properties are also likely to differ. To derive precise dynamic properties of the initial wage, we need to know the dynamic properties of the continuation wage. Different models of the determination of the continuation wage, which have no implications for the volatility of job creation, will imply different wage elasticities for new matches. I derive here the elasticity of wages in new jobs for a popular class of models. Suppose that when workers are hired in some state i the wage rate is chosen to satisfy the Nash division rule, but then it is not varied again. So the continuation wage in a job created in state i is win and so in (19), wjc = win . This wage rule is inspired by the implicit contract models of Azariadis (1975) and Baily (1974). In their model workers are risk averse but firms are risk neutral, and there are stochastic productivity shocks. Once a worker is hired, it is to the advantage of the firm to hold the wage rate constant when a productivity shock arrives. The risk averse worker is made better off by the absence of income fluctuations and the risk neutral firm is not made worse off, as long as the mean value of profits is not affected by the productivity shocks over long periods of time. The solution for wages when wjc = win is substituted into (19) is (22)
win =
r +s+λ N λ w + wn r + s + 2λ i r + s + 2λ j
What is the response of new wages to changes in productivity pi with this equation? We can answer this question in two ways. One, in keeping with the approach of the previous analysis, obtains the effect by log differentiating (22) with respect to pi and pj , that is, displaces productivities throughout the job tenure by the same proportion. A second approach, however, is more appropriate to this particular example. Since win are wages in new jobs when productivity is pi and wjn are wages in new jobs when productivity is pj the cyclicality of new wages can be approximated by the ratio of the log difference between wjn and win to the log difference between pj and pi For the first approach, I assume that the elasticity of new wages with respect to same-state productivity is common across states and denoted by εn and that the Nash wage elasticity is, as before, εw Differentiating (22), I derive (23)
εn = εw
(r + s + λ)wiN /win r + s + λ + λ(1 − wjn /win )
Calibrating this wage elasticity precisely requires parameters for which we do not have information, but we can argue that it is likely to be very close to the Nash wage elasticity. First, suppose we evaluate it, as an approximation, at
1354
CHRISTOPHER A. PISSARIDES
the point where new wages are equal (in level) to the Nash wage equation. Then the elasticity is precisely equal to the Nash wage elasticity. Next, suppose pi > pj , and so win > wjn and win < wiN Then both the numerator of (23) and λ in the denominator are multiplied by numbers bigger than 1, which have an offsetting impact on the ratio εn /εw The same happens if pi < pj as can easily be checked. To calculate the cyclical response of new wages by making use of the log difference between win and wjn I note that the ratio of the elasticities εn /εw is to a good approximation equal to the ratio of the proportional changes in wages in new jobs and the Nash wage: (24)
n wj − win wjN − wiN εn = εw wjn + win wjN + wiN
To compute this ratio, I substitute win from (22) and wjn from the equivalent equation for it into (24), and noting that the economy alternates periodically between the two states, I obtain (25)
r+s+λ εn = εw r + s + 3λ
Under this formulation the ratio of the elasticities is constant and less than 1. If we assume that the periodic productivity shocks have typical cyclical durations, about four years, then a reasonable assumption for monthly data is λ = 002 The elasticity ratio in (25) is then 06 so if the model predicts a Nash elasticity of about 1 the predicted elasticity of wages in new jobs in this particular example is about 06 Of course, wages in ongoing jobs do show some cyclicality. A modification to the implicit contract model by Beaudry and DiNardo (1991) to take into account quitting behavior by workers employed in continuing jobs implies the following. When workers are hired in a good state, for example, when pi > pj , wage behavior is described well by the assumptions underlying (22). Firms insure these workers against negative shocks. But when workers are hired in a low state, the outside wage offers when the good state arrives are higher than their own continuation wage, so they have an incentive to quit. Quitting is suboptimal in this model for both the individual agents and the social planner, because all jobs have the same productivity, and there are costs to recruiting and (possibly) job changing. The firm responds optimally to raise the wage and match outside wage offers. It follows that the Nash division rule holds also for the continuation state when workers are hired in the poor state. Therefore, wages in new jobs in this case satisfy the Nash wage equation precisely and so does their elasticity (see equation (18)).
THE UNEMPLOYMENT VOLATILITY PUZZLE
1355
More generally, if wages in continuing jobs respond to productivity shocks at some fraction k of the response of wages in new jobs, that is, if wjc − win = k(wjn − win ) the elasticity ratio in (25) becomes (26)
r+s+λ εn = εw r + s + λ + 2λ(1 − k)
So the bigger is the response of the continuation wage to productivity shocks, the closer the elasticity of wages in new matches gets to the Nash wage elasticity of the model, with an upper limit of 1.18 4. WHAT DO WAGE EQUATIONS SHOW? The first and most influential studies of cyclical wage stickiness were based on time-series regressions derived either from single-equation or small aggregate models of the economy.19 These studies were stimulated by the controversy between Keynes, his followers, and his critics (in particular Dunlop and Tarshis) about the role of wage stickiness in the business cycle. They continued well into the 1980s. Their findings are mixed. Results are sensitive to the specification used and to the sample period. Time-series data before 1960 show less wage cyclicality than data since 1970. A robust finding of these studies is that whichever way the cyclicality of wages goes, it is not very much; that is, wages are sticky, and may exhibit a limited degree of pro- or countercyclicality depending on time period, deflator used, coverage, and other issues. These time-series studies have been extremely influential in shaping the opinions of macroeconomists about wage stickiness, giving rise to a consensus that made it into most textbooks. But their findings are not relevant to the search and matching model. As we argued, the profit measure that matters for job creation in the search and matching model is the share of a new match claimed by the firm. Given this share, the timing of wage payments is irrelevant and it is in this spirit that we were able to show that once wages in new matches satisfy the Nash wage rule, job creation also satisfies the formulas of the canonical model, irrespective of what happens to wages in ongoing jobs. Moreover, 18 There are other explanations in the literature that draw a distinction between starting and continuation wages. Shaked and Sutton (1984), Thomas and Worral (1988), MacCleod and Malcomson (1993), and others argued that once a job is formed, wages change only when a shock makes the participation constraints of either side binding. The cyclicality of wages should fall with tenure because many jobs do not hit the participation constraints in response to cyclical shocks. Arozamena and Centero (2006) built on the common argument that incumbents with long tenures accumulate job-specific capital to give another reason why the cyclicality of wages should fall with tenure. 19 This is not a comprehensive survey of the empirical literature, but a selective discussion of results that bear directly on the model. For good surveys of the main issues and the main empirical findings, see Brandolini (1995) and Abraham and Haltiwanger (1995).
1356
CHRISTOPHER A. PISSARIDES
even if the distinction between new and old matches is overlooked, the search and matching model is concerned with the cyclicality of wages in individual matches, not the average in the economy as a whole. In this connection there appears to be a strong countercyclical bias in the mean wages analyzed in the aggregate studies, at least during the 1970s and 1980s.20 The bias is due mainly to the fact that low-wage, low-skill workers bear the brunt of cyclical adjustments and so their weight in aggregate data is bigger in cyclical peaks than in troughs. In view of this, the results of panel regressions of individual workers, or matches, are more relevant to the search and matching model than the results of aggregate studies. These results favor strong procyclicality of wages in new matches.21 Panel studies typically run a wage log change regression for the individuals in their panel on a set of personal characteristics, such as tenure, experience and education, regional or industry dummies, and time dummies. The coefficients on the time dummies are then used in a second regression as the dependent variable with a time trend and a cyclical indicator variable as regressors. Tables II and III summarize the results of individual studies of wage behavior, focusing on studies that draw a distinction between continuing jobs and new matches. The tables give the coefficient estimated in the second regression for the cyclical indicator, which is the change in national unemployment. The numbers given are the annual percentage change in wages when national unemployment falls by 1 percentage point from one year to the next. Figure 2 shows the estimated cyclical component for wages in new and continuing matches from Devereux’s (2001) Panel Study of Income Dynamics (PSID) study.22 Some facts readily emerge. First, the wages of job changers are always substantially more procyclical than the wages of job stayers. The same fact is reflected in studies that draw a distinction between the wages of stayers and the wages of all workers. The wages of all workers are always more procyclical than the wages of job stayers. Second, the wages of job stayers, and even of those who remain in the same job with the same employer (Devereux (2001), Shin and Solon (2006)), are still mildly procyclical. Perhaps surprisingly, there is more procyclicality in the wages of stayers in Europe than in the United States. The procyclicality of job stayers’ wages is sometimes due to bonuses, overtime pay, and the like, but it still reflects a rise in the hourly cost of labor to the firm in cyclical peaks. The cyclical indicator variable used in the panel studies is usually national unemployment, following the lead of Bils (1985). A consensus estimate of the 20 See Solon, Barsky, and Parker (1994) and the discussion in Abraham and Haltiwanger (1995, Section V). 21 The panel studies cover data from 1970 onward and the recession of the 1970s appears to be a particularly procyclical wage episode. The discussion that follows is entirely about wages since the late 1960s; earlier cycles may be different. 22 I am grateful to Paul Devereux for making these data available.
THE UNEMPLOYMENT VOLATILITY PUZZLE
1357
TABLE II ESTIMATES OF THE CYCLICALITY OF HOURLY WAGES, UNITED STATESa Author
Data
Coefficient on −u × 100
Bils (1985)
NLSY 1966–1980
All (whites/nonwh.) Stayers Changers
16/18 06/04 30/40
Shin (1994)
NLSY 1966–1981
All (whites/nonwh.) Stayers Changers
17/14 12/02 27/38
Changers Changers
259 300
PSID 1976–1984
All, cont. u All, initial u All, min. u
07 06 29
CPS 1979, 1983
All, cont. u All, initial u All, min. u
00 00 31
Grant (2003)
NLSY 1966–1981
All, cont. u All, initial u All, min. u
237 060 229
Solon, Barsky, and Parker (1994)
PSID 1968–1987
All men All women Stayers, men
140 053 124
Devereux (2001)
PSID 1970–1991
All Stayers Single job holders
116 081 054
Shin and Solon (2006)
NLSY 1979–1993
All Stayers Single job holders
137 117 113
Barlevi (2001) Beaudry and DiNardo (1991)
PSID, 1968–1993 NLSY, 1979–1993
a The dependent variable is the annual change in the log of hourly earnings, obtained from the estimated coefficients on annual time dummies in individual wage regressions. Results are for men, unless otherwise stated. Unemployment is national unemployment in percent, except for Barlevi’s study, which uses state unemployment. In Beaudry and DiNardo’s and Grant’s studies the results shown are from regressions with three independent unemployment variables: (i) contemporaneous unemployment; (ii) unemployment at start of job; (iii) lowest unemployment since start of job. Acronyms used: National Longitudinal Surveys of Youths (NLSY); Current Population Survey (CPS).
coefficient in wage regressions for job changers is close to 3, that is, for every percentage point rise in unemployment, the wages in new matches are lower by about 3%. Converting the empirical estimates to an overall impact of the cyclical component of hourly productivity on wages gives results that are very close to the model’s predictions. In the model, the total elasticity of wages with respect to mean productivity is in the range 098–1 If wages in ongoing jobs do not satisfy the Nash
1358
CHRISTOPHER A. PISSARIDES TABLE III ESTIMATES OF THE CYCLICALITY OF HOURLY WAGES, EUROPEa Datab
Author
Devereux and Hart (2006)
Coefficient on −u × 100
U.K. NESPD (admin.) 1975–2001
Stayers (men/women) Movers within co. Movers between co.
193/193 228/231 296/284
U.K. BHPS 1991–2004
Stayers Movers within co. Movers between co.
219 227 289
W. Germany GSOEP 1984–2002
Stayers Movers within co. Movers between co.
161 343 344
Peng and Siebert (2006)
N. Italy ECHP 1994–2001
Stayers Movers within co. Movers between co.
360 663 561
Carneiro and Portugal (2007)
Portugal QP (admin.) 1986–1998
Stayers (men/women) New hires (no panel)
120/085 208/178
Peng and Siebert (2007)
a The dependent variable is the annual change in the log of hourly earnings, obtained from the estimated coefficients on annual time dummies in individual wage regressions. Results are for men, except for the studies of Devereux and Hart and Carneiro and Portugal, which report results separately for men and women. The data sets described as admin. are based on employer data; the others are from household surveys. Acronyms used: New Earnings Survey Panel Data (NESPD); British Household Panel Survey (BHPS); German Socio-Economic Panel (GSOEP); European Community Household Panel (ECHP); Quadros de Pessoal (QP). b Results for East Germany and South/Central Italy not significant.
sharing rule, this elasticity is approximately the elasticily for initial wages when continuation wages are not too sticky. Given the empirical estimates of an un0.08 0.06
percent
0.04 0.02 0 -0.02 -0.04 -0.06 -0.08 -0.1 1971
1975
1979 productivity
1983
1987
changers
stayers
FIGURE 2.—Cyclical wage and productivity changes.
1991
THE UNEMPLOYMENT VOLATILITY PUZZLE
1359
employment semielasticity, one could argue that to compare the model’s prediction with the data, one should calculate the unemployment semielasticity implied by the model and compare it with the estimate. This, however, would not be the correct comparison, for two reasons. First, we deliberately focused on testing the model’s performance with respect to the job creation rate, although we argued that the job destruction rate is also important in an overall explanation of the volatility of unemployment. Our quantitative model does not have a complete model of unemployment volatility, so its predictions cannot be expected to match those in the data. Second, even if the model had a complete model of unemployment volatility, the fact that unemployment is one of the model unknowns implies that its prediction of the wage–unemployment semielasticity is a composite of more than one model predictions. In particular, since the driving force is p, we can write −1 du d log w d log w = (27) du d log p d log p If the model underpredicts du/d log p then it will overpredict d log w/du even if it made an accurate prediction of the wage elasticity. For these reasons I evaluate the model’s wage predictions by converting the estimated unemployment elasticities to productivity elasticities using actual time-series data.23 I use annual observations for 1948–2006 for the same unemployment variable as in the panel regressions and annual observations for the deviation of productivity from trend to run a simple ordinary least squares regression which yields ut / ln pt−1 = −034 This result is remarkably robust to small changes in specification, such as using the log change in labor productivity instead of its deviation from Hodrick–Prescott (HP) trend. When the sample is restricted to 1970–1993, as in the panel studies, the coefficient goes up to −049. These estimates appear to confirm a stable “Okun law” of hourly productivity on unemployment, although usually Okun’s law is between aggregate GDP, which includes the change in hours, and unemployment. Applying the estimated Okun coefficients to convert the estimated cyclical impact on the wages of job changers to a wage–productivity elasticity, I find that for the estimated semielasticity of 3 the productivity elasticity of wages is 3 × 034 = 102 and for the estimates over the 1970–1993 period, it is 147 These numbers are very close to the predictions of the model for the cyclicality of Nash wages, and if anything they exceed the model’s elasticities by a small (but statistically insignificant) margin. 23 In the discussion of results in the empirical literature there is reference to “low” or “high” wage cyclicality through an implicit reference to an Okun-type relation that converts the cyclicality of unemployment to the cyclicality of gross domestic product (GDP). The correctness of this indirect approach is not questioned and standard errors are not reported. I follow a similar approach here, but I also obtain a direct measure of cyclicality for the series in Figure 2 further down in this section.
1360
CHRISTOPHER A. PISSARIDES
More support for the volatility of wages in new jobs can be obtained from more direct estimates. I HP-filtered Devereux’s PSID data shown in Figure 2 and reran his second-stage regression, obtaining, for job changers, log wt = 000 − 267ut (065)
R2 = 047
The coefficient estimate is very similar to those obtained in the literature (the literature uses a time trend instead of HP filtering). Running a second regression with the data in Figure 2 to obtain a more direct estimate of the productivity elasticity of job changers, I obtain log wt = 000 + 170 log pt−1 (056)
R2 = 033
Both variables are deviations from HP trend. The estimated elasticity is above those computed indirectly from the unemployment estimates, but not significantly different. However, the point estimate obtained here is above the predictions of the model for wages in new matches when wages in continuing jobs are not as volatile as in new jobs. More recently, Haefke, Sonntag, and van Rens (2007) computed a quarterly wage series from the CPS for matches that originate from nonemployment. They used the outgoing rotation group for 1979–2006 and defined new matches as those that involve a worker who declared himself or herself unemployed in one of the preceding three months. This contrasts with the samples in the two tables above, which cover all new matches, including job-to-job transitions. They found that the variance of the wage series for new hires is significantly higher than the variance of wages in ongoing jobs, in contrast to persistence, which is less. More importantly for the model, they found near proportionality between the wages of new hires and productivity. Their point estimates for the elasticity of wages with respect to productivity in a variety of specifications are about 09 or above for new hires and about 03 for all workers.24 Their 3 to 1 ratio implies an approximate estimate for k = 033 in (26), although given the results of panel studies and the data sources that they used, k = 05 is a more plausible figure. With k = 033 we get that the model’s prediction of the elasticity of wages in new matches is about 07 whereas k = 05 gives 075 for this elasticity. This is below the estimated elasticity of 09 but well within the 1 standard deviation interval. 24 They also argue for strong composition effects, which biases downward any estimate of the coefficient of log wage on unemployment, as is done in the panel regressions. Their estimates are based on a mean or median constructed hourly wage series, which behaves very similarly to the Bureau of Labor Statistics (BLS) measure of mean wages.
THE UNEMPLOYMENT VOLATILITY PUZZLE
1361
The evidence shown in Tables II and III, and also the more direct estimates of Haefke, Sonntag, and van Rens (2007), show much less cyclicality in the wages of continuing jobs.25 The wages of continually employed workers increase in cyclical peaks, with an estimated unemployment coefficient of 1 to 15. This implies a wage–productivity elasticity of 0.3–0.5.26 The fact that the elasticities in continuing jobs are about half of what they are in new jobs implies that the losses suffered by workers who form new matches in recession are not immediately reversed. However, this is only indirect evidence for this important property. The work of Beaudry and DiNardo (1991) yields more direct evidence on this issue. Beaudry and DiNardo ran the usual set of panel regressions with the PSID and the 1979 and 1982 Pension Supplements of the May CPS, but tried three different unemployment rates as cyclical indicators: contemporaneous unemployment, as in the other studies, unemployment at the time of hire, and the lowest unemployment rate during the tenure of the job. They found that the dominant influence on wages was exerted by the lowest unemployment rate during the job’s tenure. The estimated coefficient on this variable implied a unit wage–productivity elasticity. Grant (2003) replicated their results with a different data set—the various cohorts of the NLS—and also found the strongest influence coming from the lowest unemployment rate since the formation of the match, although contemporaneous unemployment was also significant in his estimates.27 This evidence is strongly supportive of the argument that outside labormarket conditions exert a strong and asymmetric influence on wage negotiations, because incumbents’ wages respond to the most favorable outside labormarket conditions, but do not reverse those gains when labor-market conditions deteriorate. The authors interpret this as evidence in favor of long-term implicit contracts, with the firm shielding wages from adverse outside conditions, and low mobility costs. When outside conditions improve, the firm raises wages to stop the workers from quitting. Yet more evidence supporting the strong procyclicality of wages was found by Blanchflower and Oswald (1994).28 They estimated a “wage curve” for 25 An issue I did not address at all is taxation. If a firm can get more tax breaks in recession or if overall company taxation is progressive, this gives a reason for more procyclicality in labor costs than estimated in wage equations. 26 Blank (1990), who, unlike in much of the literature, used the percent change in GDP as her cyclical indicator, estimated elasticities of that order of magnitude for repeated cross sections of the PSID or panels derived from it. 27 Similar results regarding the lowest past unemployment were obtained by McDonald and Worswick (1999) for Canada and Bell, Nickell, and Quintini (2002) for the United Kingdom. Devereux and Hart (2007), using the superior New Earnings Survey Panel Data Set for Britain, found some supportive evidence, but they also found that the spot market is more important than in the original Beaudry and DiNardo study. 28 The main objective of Blanchflower and Oswald’s (1994) study was to show that there is a “wage curve,” a negative relation between real wages and local unemployment. Their main tests
1362
CHRISTOPHER A. PISSARIDES
industry-aggregated wages across a panel of 19 U.S. manufacturing industries and found that industry profits per employee exert a strong positive influence on total compensation per employee when controlling for industry and time effects. They interpreted this finding as evidence in favor of the bargaining model for wage determination. Their result implies that there is comovement between the cyclicality of profits and the cyclicality of wages, as in the search and matching model, with the cyclicality in profits driving job creation. 5. THE ROLE OF MATCHING COSTS The results of panel regressions contain one clear message: the wages of workers who change jobs during the year are at least as cyclical as labor productivity, but the wages of those in ongoing jobs are one-third to one-half as cyclical (in terms of the wage–productivity elasticity). The Nash wage equation of search models does not quite match these facts, but we have shown that extensions that distinguish between wages in new and continuing jobs yield results that are a good match, at least for wages in new jobs. The evidence certainly does not support explanations of the unemployment volatility puzzle that imply substantially less volatility in wages in new jobs than in productivity. I conclude with a discussion of the role of matching costs in the cyclical volatility of tightness. Given that the timing of wage payments is not relevant to the firm’s job creation and job destruction decisions, I simplify the modeling by applying the Nash solution to the wage bargain to all workers. However, the wage volatility results obtained from the model should be compared with the estimated wage volatility in new matches only. Matching costs in the model are of two kinds: the worker’s foregone leisure and unemployment income, and the vacancy posting costs of the firm. The worker’s foregone costs have been discussed extensively in the context of Hagedorn and Manovskii’s (2008) work, and I will refer to them briefly. But the nature of the firm’s matching costs have received a lot less attention, although as I show here they play a critical role in the volatility results. Tightness does two things in the canonical job creation model: it drives job creation through the matching function and influences the expected cost of hiring a worker. These properties are important and interconnected; one could argue that they are the identifying features of the model. Whereas the matching use repeated cross sections. Although a wage curve is certainly consistent with the wage equation of the search and matching model, I did not include their study in Table II because their evidence is not about hourly wages, but about annual earnings, it does not distinguish between stayers and job changers, and it does not focus on the cyclical dimension of wages. However, as Card (1995) in his review of Blanchflower and Oswald pointed out, their point estimates are consistent with the estimates of the cyclicality literature and provide further support for cyclicality. In estimates by Card (1995, Table 3), hourly wages in the Blanchflower and Oswald samples also exhibit cyclicality for a variety of worker types and, importantly, the unemployment elasticity of wages doubles for workers who had more than one employer during the previous year (when compared with the wages of workers who had the same employer throughout the year).
THE UNEMPLOYMENT VOLATILITY PUZZLE
1363
function has been the topic of extensive research, the relation between tightness and mean recruitment costs has been much less researched, yet the precise relation between these two plays an important role in the cyclical volatility of tightness. To see why, suppose there is a positive shock to productivity. Firms post more vacancies at cost c each, and through the search externalities, the entry of more vacancies increases the average duration of vacancies 1/q(θ) The model’s assumptions imply that the expected cost of hiring a worker increases in proportion to the increase in the mean duration of vacancies, since it is c/q(θ) The increase in average hiring costs checks the growth in vacancies and so reduces the response of tightness to the productivity shock. Of course, the original motivation for making the vacancy cost depend on tightness is the realism of the assumption—a firm that expects a vacant job position to remain vacant longer should also expect the total cost of filling that position to be higher (in present-discounted value terms). However, the proportionality relation was more a matter of convenience in the absence of more information about the precise relation between costs and tightness. Other matching costs, such as training, negotiation, and one-off administrative costs of adding a worker on the payroll, are neglected by the model. I show that if fixed costs of this type are taken into account, the tightness volatility results can change substantially. To demonstrate this claim, suppose that in addition to the vacancy posting cost of the canonical model, there is a fixed matching cost. For example, it may be a cost of interviewing or negotiating with the worker after she arrives but before she is hired or it may be a fixed cost of training her after she is hired. The important property that needs to be satisfied by this component of costs is that it is independent of the duration of vacancies.29 I consider here, within the Nash framework of the canonical model, the implications of adding fixed matching costs to the proportional posting cost of the canonical model. For the purposes of the modeling, I interpret these costs as costs that are paid after the worker who is eventually hired arrives but before the wage bargain takes place; for example, they may be the costs of finding out 29 The argument that follows does not go through if the fixed cost is paid when the vacancy is first created, independently of a worker’s arrival. If there is a fixed posting cost K, the Bellman equations and the Nash sharing rule are all the same as in the canonical case, but the zero-profit condition on vacancies is replaced by V = K This is not enough to give more volatility in tightness. The argument requires that the fixed costs are matching costs but it is irrelevant whether they are paid before or after the wage bargain takes place. Thus, waiting and negotiation costs before wage agreement play an important role in enhancing cyclicality in the strategic bargaining model of Hall and Milgrom (2008), although in their model the costs need not be paid. Mortensen and Nagypal (2007) emphasized training costs that are paid after the wage bargain takes place as a source of volatility. Rotemberg (2006) assumed that the average cost of posting a vacancy falls in the number of vacancies posted by the firm, which has the same qualitative implications for volatility as fixed costs.
1364
CHRISTOPHER A. PISSARIDES
about the qualities of the particular worker, of interviewing, and of negotiating with her. They are sunk before the wage bargain is concluded and the worker takes up the position, but this property is not important for volatility, because training costs that are not sunk play a similar role. The attractive feature of making them sunk, however, is that they can be interpreted as a component of the cost of frictions that characterize search models, so they are an alternative way of calibrating frictions to the conventional proportional cost. Suppose that when the worker arrives, the firm pays a fixed fee H before the Nash wage is agreed. Because the cost is sunk at the time of the bargain, it is ignored in the Nash bargain equations; its only impact on the formal structure of the model is to introduce a cost of taking up the worker in the vacancy equation. The vacancy equation now becomes30 rV = −c + q(θ)(J − H − V ) The important property of this reformulation is that the constant posting cost c is now effectively replaced by the cost c + q(θ)H which falls in tightness. The job creation and wage equations with this vacancy equation become (28)
c p−w = + H r +s q(θ)
(29)
w = (1 − β)z + β(p + cθ + f (θ)H)
The intuition for the first condition is obvious. Job creation entails two costs: the proportional cost c and the fixed cost H The fixed cost increases wages with coefficient βf (θ) because if the negotiation fails, the firm has to pay H when it meets another worker—an event that takes place at rate f (θ) So by staying in the match, the worker saves the firm an expected cost f (θ)H and wages increase by a fraction β of that saving by the Nash assumptions. Equations (28) and (29) are solved for the two endogenous variables θ and w Table IV gives a sample of results for different combinations of the two hiring costs, c and H constructed such that the solutions for θ and w are in all cases the same as in the model with H = 0 As before, the notation is εθ for the elasticity of θ with respect to p, calculated from εθ =
p − εw w 1 η p − w − (r + s)H
30 Given that the costs are interpreted as costs needed to find out more about the worker and to negotiate with her, it may be argued that the underlying assumption in this equation is that once the cost is paid, the worker is always recruited. However, it would make no difference to the argument if we introduced a probability φ < 1 that the match is successful after the firm pays the H cost. The vacancy equation would then be rV = −c − q(θ)H + q(θ)φ(J − V ) with only a trivial modification to the job creation condition (28).
THE UNEMPLOYMENT VOLATILITY PUZZLE
1365
TABLE IV MODEL RESULTS AT DIFFERENT COMBINATIONS OF JOB CREATION COSTS H
0 01 02 03 04
c
εθ
εw
ε∗w
036 027 020 011 002
367 418 487 582 725
098 099 099 100 101
088 097 098 100 101
and εw∗ is the wage elasticity required to raise the θ elasticity to the data point 756 calculated from εw∗ =
η[p − w − (r + s)H] p − εθ w w
It is clear from the table that as the hiring costs are shifted from the proportional to the fixed component, the volatility of job creation increases, whereas the wage elasticity hardly changes. At very small values of the proportional component, the observed elasticities are consistent with the data. Since we do not have information about how the job creation costs are split between the costs that depend on the duration of vacancies and the costs that do not, we cannot choose one combination over another on the basis of independent evidence. Hall and Milgrom (2008) derived a combination of job creation costs consistent with their strategic bargaining approach that produces results that are very similar to those shown in the bottom two rows of Table IV. What role do the second type of matching costs—the foregone nonmarket value of the unemployed—play in this model? As noted by Hagedorn and Manovskii (2008), a higher leisure value z can also increase the cyclicality of wages, but as Costain and Reiter (2008) emphasized, this also increases to unreasonable levels the responsiveness of unemployment to changes in unemployment compensation. How unreasonable? At H = 0 the value of z required to match the θ elasticity in the data is about 085 I compare the unemployment outcome for two unemployment compensation levels, 02 and 03 If the value of leisure is 045 (i.e., centering it on the plausible overall value of z = 07), equilibrium unemployment at 02 compensation is 52% and at 03 it is 62% But if the value of leisure is 06 and I recalibrate the model to give the sample unemployment mean at z = 085 unemployment at compensation level 02 is 49% and at 03 it is 70% In other words, the impact of unemployment compensation on unemployment doubles. Nickell, Nunziata, and Ochel (2005) summarized cross-country econometric evidence and found that a 10 percentage point difference in unemployment compensation is associated with a 11 percentage point difference in unemployment. So the canonical model with
1366
CHRISTOPHER A. PISSARIDES
z = 071 gets this response about right, whereas in the case of z = 085, the unemployment response is too high.31 Does the model with H > 0 also suffer from this criticism? The case that matches the θ elasticity in the data with z = 071 is shown in the bottom row of Table IV. In this case, unemployment at compensation level 02 is 475% and at 03 it is 66% The impact is nearly 19 percentage points, compared with the Nickell, Nunziata, and Ochel estimate of 11 It is closer to the data than the 21 response at z = 085 but still above the data estimate. 6. CONCLUSIONS The main aim of this paper was to examine the cyclical volatility of wages and its implications for unemployment volatility in the search and matching model. I have shown that the job creation condition that drives the volatility of the job finding rate depends on the wage bargain in new jobs. Even if wages in new jobs only are fixed by the Nash bargaining solution, the volatility of the job finding rate is still the same as in the canonical model, where all wages are fixed by the Nash bargaining solution. Time-series or panel studies on the cyclical volatility of wages show considerable stickiness, but this evidence is dominated by wages in ongoing jobs and is not relevant for job creation in the search and matching model. An examination of panel data evidence on the volatility of wages in new jobs shows that volatility is about the same as in the Nash wage equation of the canonical model. It follows that the explanation for the unemployment volatility puzzle—the observation that the response of unemployment to cyclical productivity shocks is bigger than implied by the canonical model—has to be one that preserves the wage elasticities implied by the model. A simple modification of the model can deliver this result: breaking up the proportional vacancy costs of the model into a proportional vacancy cost and a fixed matching component delivers more volatility in the job finding rate for the same volatility in wages. The fixed component is justified by the existence of one-off negotiating, administrative costs, or training costs, whereas the proportional component is justified by the advertising cost and capital idleness costs associated with a vacancy. REFERENCES ABRAHAM, K. G., AND J. C. HALTIWANGER (1995): “Real Wages and the Business Cycle,” Journal of Economic Literature, 33, 1215–1264. [1340,1355,1356] ANDOLFATTO, D. (1996): “Business Cycles and Labor Market Search,” American Economic Review, 86, 112–132. [1340] 31
As we noted in the Introduction, a way to break the tight link between policy effects and cyclical volatilities is to explore nonlinearities in the production function (Hagedorn, Manovskii, and Stetsenko (2008)).
THE UNEMPLOYMENT VOLATILITY PUZZLE
1367
AROZAMENA, L., AND M. CENTENO (2006): “Tenure, Business Cycle and the Wage Setting Process,” European Economic Review, 50, 401–424. [1355] AZARIADIS, C. (1975): “Implicit Contracts and Underemployment Equilibria,” Journal of Political Economy, 83, 1183–1201. [1353] BAILY, N. (1974): “Wages and Employment Under Uncertain Demand,” Review of Economic Studies, 42, 37–50. [1353] BARLEVY, G. (2001): “Why Are the Wages of Job Changers so Procyclical?” Journal of Labor Economics, 19, 837–878. [1357] BARRO, R. J. (1977): “Long-Term Contracting, Sticky Prices, and Monetary Policy,” Journal of Monetary Economics, 3, 305–316. [1340] BEAUDRY, P., AND J. DINARDO (1991): “The Effect of Implicit Contracts on the Movement of Wages Over the Business Cycle: Evidence From Micro Data,” Journal of Political Economy, 99, 665–688. [1354,1357,1361] BELL, B., S. NICKELL, AND G. QUADRINI (2002): “Wage Equations, Wage Curves and All That,” Labour Economics, 9, 341–360. [1361] BILS, M. J. (1985): “Real Wages Over the Business Cycle: Evidence From Panel Data,” Journal of Political Economy, 93, 666–689. [1356,1357] BLANCHFLOWER, D. G., AND A. J. OSWALD (1994): The Wage Curve. Cambridge, MA: MIT Press. [1361] BLANK, R. M. (1990): “Why Are Wages Cyclical in the 1970s?” Journal of Labor Economics, 8, 16–47. [1361] BRANDOLINI, A. (1995): “In Search of a Stylised Fact: Do Real Wages Exhibit a Consistent Pattern of Cyclical Variability?” Journal of Economic Surveys, 9, 103–163. [1340,1355] BRAUN, H., R. DE BOCK, AND R. DICECIO (2006): “Aggregate Shocks and Labor Market Fluctuations,” Working Paper 2006-004A, Federal Reserve Bank of St. Louis. [1343] CARD, D. (1995): “The Wage Curve: A Review,” Journal of Economic Literature, 33, 785–799. [1362] CARNEIRO, A., AND P. PORTUGAL (2007): “Workers’ Flows and Real Wage Cyclicality,” Discussion Paper 2604, Institute for the Study of Labor (IZA), Bonn. [1358] COLE, H., AND R. ROGERSON (1999): “Can the Mortensen–Pissarides Matching Model Match the Business Cycle Facts?” International Economic Review, 40, 933–959. [1340] COSTAIN, J. S., AND M. REITER (2008): “Business Cycles, Unemployment Insurance, and the Calibration of Matching Models,” Journal of Economic Dynamics and Control, 32, 1120–1155. [1342,1365] DAVIS, S. J., J. C. HALTIWANGER, AND S. SCHUH (1996): Job Creation and Destruction. Cambridge, MA: MIT Press. [1344] DEN HAAN, W., G. RAMEY, AND J. WATSON (2000): “Job Destruction and the Propagation of Shocks,” American Economic Review, 90, 482–498. [1340] DEVEREUX, P. J. (2001): “The Cyclicality of Real Wages Within Employer–Employee Matches,” Industrial and Labor Relations Review, 54, 835–850. [1356,1357] DEVEREUX, P. J., AND R. A. HART (2006): “Real Wage Cyclicality of Job Stayers, WithinCompany Job Movers, and Between-Company Job Movers,” Industrial and Labor Relations Review, 60, 105–119. [1358] (2007): “The Spot Market Matters: Evidence on Implicit Contracts From Britain,” Scottish Journal of Political Economy, 54, 661–683. [1361] ELSBY, M. W., R. MICHAELS, AND G. SOLON (2009): “The Ins and Outs of Cyclical Unemployment,” American Economic Journal: Macroeconomics, 1, 84–110. [1343] FUJITA, S., AND G. RAMEY (2009): “The Cyclicality of Separation and Job Finding Rates,” International Economic Review (forthcoming). [1343] GRANT, D. (2003): “The Effect of Implicit Contracts on the Movement of Wages Over the Business Cycle: Evidence From the National Longitudinal Surveys,” Industrial and Labor Relations Review, 56, 393–408. [1357,1361]
1368
CHRISTOPHER A. PISSARIDES
HAEFKE, C., M. SONNTAG, AND T. VAN RENS (2007): “Wage Rigidity and Job Creation,” Discussion Paper 3714, Institute for the Study of Labor (IZA), Bonn. [1341,1360,1361] HAGEDORN, M., AND I. MANOVSKII (2008): “The Cyclical Behavior of Equilibrium Unemployment and Vacancies Revisited,” American Economic Review, 98, 1692–1706. [1341,1342,1351, 1352,1362,1365] HAGEDORN, M., I. MANOVSKII, AND S. STETSENKO (2008): “The Cyclical Behavior of Equilibrium Unemployment and Vacancies With Worker Heterogeneity,” Unpublished Paper, University of Pennsylvania. [1342,1366] HALL, R. E. (2005a): “Employment Fluctuations With Equilibrium Wage Stickiness,” American Economic Review, 95, 50–65. [1340,1352] (2005b): “Job Loss, Job Finding, and Unemployment in the U.S. Economy Over the Past Fifty Years,” in NBER Macroeconomics Annual, ed. by M. Gertler and K. Rogoff. Cambridge, MA: MIT Press, 101–137. [1343,1344] HALL, R. E., AND P. MILGROM (2008): “The Limited Influence of Unemployment on the Wage Bargain,” American Economic Review, 98, 1653–1674. [1340-1342,1350,1351,1363,1365] HORNSTEIN, A., P. KRUSELL, AND G. L. VIOLANTE (2005): “Unemployment and Vacancy Fluctuations in the Matching Model: Inspecting the Mechanisms,” Federal Reserve Bank of Richmond Economic Quarterly, 91, 19–51. [1339] LANGOT, F. (1995): “Unemployment and Business Cycle: A General Equilibrium Matching Model,” in Advances in Business Cycle Research, ed. by P.-Y. Henin. Berlin: Springer, 287–322. [1340] MACLEOD, B. W., AND J. M. MALCOMSON (1993): “Investments, Holdup, and the Form of Market Contracts,” American Economic Review, 83, 811–837. [1355] MCDONALD, J. T., AND C. WORSWICK (1999): “Wages, Implicit Contracts, and the Business Cycle: Evidence From Canadian Micro Data,” Journal of Political Economy, 107, 884–892. [1361] MENZIO, G., AND S. SHI (2008): “Efficient Search on the Job and the Business Cycle,” Working Paper 08-029, Penn Institute for Economic Research, University of Pennsylvania. [1345] MERZ, M. (1995): “Search in the Labor Market and the Real Business Cycle,” Journal of Monetary Economics, 36, 269–300. [1340] MORTENSEN, D. T., AND E. NAGYPAL (2007): “More on Unemployment and Vacancy Fluctuations,” Review of Economic Dynamics, 10, 327–347. [1339,1341,1363] MORTENSEN, D. T., AND C. A. PISSARIDES (1994): “Job Creation and Job Destruction in the Theory of Unemployment,” Review of Economic Studies, 61, 397–415. [1339,1344] NICKEL, S., L. NUNZIATA, AND W. OCHEL (2005): “Unemployment in the OECD Since the 1960s. What Do We Know?” Economic Journal, 115, 1–27. [1365] PENG, F., AND W. S. SIEBERT (2006): “Real Wage Cyclicality in Italy,” Discussion Paper 2465, Institute for the Study of Labor (IZA), Bonn. [1358] (2007): “Real Wage Cyclicality in Germany and the UK: New Results Using Panel Data,” Discussion Paper 2688, Institute for the Study of Labor (IZA), Bonn. [1358] PETRONGOLO, B., AND C. A. PISSARIDES (2001): “Looking Into the Black Box: A Survey of the Matching Function,” Journal of Economic Literature, 39, 390–431. [1350,1351] PISSARIDES, C. A. (1985): “Short-Run Equilibrium Dynamics of Unemployment, Vacancies, and Real Wages,” American Economic Review, 75, 676–690. [1339] (1986): “Unemployment and Vacancies in Britain,” Economic Policy, 3, 499–559. [1343] (2000): Equilibrium Unemployment Theory (Second Ed.). Cambridge, MA: MIT Press. [1346] ROTEMBERG, J. (2006): “Cyclical Wages in a Search-and-Bargaining Model With Large Firms,” Discussion Paper 5791, Centre for Economic Policy Research, London. [1363] SHAKED, A., AND J. SUTTON (1984): “Involuntary Unemployment as a Perfect Equilibrium in a Bargaining Model,” Econometrica, 52, 1351–1364. [1355] SHIMER, R. (2004): “The Consequences of Rigid Wages in Search Models,” Journal of the European Economic Association (Papers and Proceedings), 2, 469–479. [1341] (2005a): “The Cyclical Behavior of Equilibrium Unemployment and Vacancies,” American Economic Review, 95, 25–49. [1339,1341,1342,1345,1351]
THE UNEMPLOYMENT VOLATILITY PUZZLE
1369
(2005b): “Reassessing the Ins and Outs of Unemployment,” Unpublished Paper, University of Chicago. [1343,1350,1351] SHIN, D. (1994): “Cyclicality of Real Wages Among Young Men,” Economics Letters, 46, 137–142. [1357] SHIN, D., AND G. SOLON (2006): “New Evidence on Real Wage Cyclicality Within Employer– Employee Matches,” Working Paper 12262, National Bureau of Economic Research. [1356, 1357] SOLON, G., R. BARSKY, AND J. A. PARKER (1994): “Measuring the Cyclicality of Real Wages: How Important Is Composition Bias?” Quarterly Journal of Economics, 109, 1–25. [1344,1356, 1357] THOMAS, J., AND T. WORRALL (1988): “Self-Enforcing Wage Contracts,” Review of Economic Studies, 55, 541–554. [1355] YASHIV, E. (2007): “Labor Search and Matching in Macroeconomics,” European Economic Review, 51, 1859–1895. [1339]
Centre for Economic Performance, London School of Economics, Houghton Street, London WC2A 2AE, U.K. and IZA, Bonn and CEPR, London;
[email protected]. Manuscript received November, 2007; final revision received November, 2008.
Econometrica, Vol. 77, No. 5 (September, 2009), 1371–1401
DECISION MAKERS AS STATISTICIANS: DIVERSITY, AMBIGUITY, AND LEARNING BY NABIL I. AL -NAJJAR1 I study individuals who use frequentist models to draw uniform inferences from independent and identically distributed data. The main contribution of this paper is to show that distinct models may be consistent with empirical evidence, even in the limit when data increases without bound. Decision makers may then hold different beliefs and interpret their environment differently even though they know each other’s model and base their inferences on the same evidence. The behavior modeled here is that of rational individuals confronting an environment in which learning is hard, rather than individuals beset by cognitive limitations or behavioral biases. KEYWORDS: Learning, statistical complexity, belief formation.
The crowning intellectual accomplishment of the brain is the real world—Miller (1981).
1. INTRODUCTION WHILE CLASSICAL SUBJECTIVIST DECISION THEORY allows for virtually unlimited freedom in how beliefs are specified, this freedom is all but extinguished in economic modeling. Most equilibrium concepts in economics—be it Nash, sequential, or rational expectations equilibrium—require beliefs to coincide with the true data generating process. As a result, disagreements and differences in beliefs are reduced to differences in information.2 On the other hand, there is no shortage of examples in the sciences, business, or politics where the way individuals look at a problem and interpret the evidence is just as important in determining beliefs as the data on which these beliefs are based. To capture this and other related phenomena, I study individuals facing the most classical of statistical learning problems, namely inference from independent and identically distributed (i.i.d.) data. These individuals are modeled as classical, frequentist statisticians concerned with drawing uniform inferences that do not depend on prior beliefs. The main contribution of the paper is to show that distinct models can be consistent with the same empirical evidence, even asymptotically when data increases without bound. Individuals may then hold different beliefs and interpret their environment differently 1
I am grateful to a co-editor and four referees for extensive and thoughtful feedback that substantially improved the paper. I also thank Drew Fudenberg, Ehud Kalai, Peter Klibanoff, Nenad Kos, Charles Manski, Pablo Schenone, and Jonathan Weinstein for their comments. I owe a special debt to Lance Fortnow and Mallesh Pai without whom this project would not have even started. 2 In games with incomplete information, this also requires the common prior assumption which dominates both theoretical and applied literatures. © 2009 The Econometric Society
DOI: 10.3982/ECTA7501
1372
NABIL I. AL-NAJJAR
even though they know each other’s model and base their inferences on identical data. Decision makers are assumed to be as rational as anyone can reasonably be. But rationality cannot eliminate the constraints inherent in statistical inference—any more than it can eliminate other objective constraints like lack of information. The approach advocated in this paper is to model rational individuals as seeking uniform, distribution-free inferences in environments where learning is hard. No appeal to computational complexity, cognitive limitations, or behavioral biases is made. What makes learning hard? It is intuitive that two individuals with common experience driving on U.S. highways will agree on which side of the road other drivers will use. It is far less obvious that two nutritionists, exposed to a large common pool of data, will necessarily reach the same theories about the impact of diet on health. These, and countless other examples like them, suggest that some learning problems can be vastly more difficult than others. It is, however, not at all clear what this formally means: learning the probability of any event in an i.i.d. setting is equivalent to learning from a sequence of coin flips. This is so regardless of how “complicated” the event, the true distribution, or the outcome space is. Focusing on learning probabilities one event at a time misses the point, however. Decision making is, by definition, about choosing from a family of feasible acts. From a learning perspective, this raises the radically different and difficult problem of using one sample to learn the probabilities of a family of events simultaneously. In this paper, I use the theory of uniform learning, also known as Vapnik–Chervonenkis theory, as the formal framework to model intuitive concepts like “a learning problem is hard” or “a set of events is statistically complex.”3 In Section 2, I introduce the idea that decision makers use frequentist models to interpret evidence.4 Theorem 1 identifies the essential tension between the amount of data available and the richness, or statistical complexity, of the set of events evaluated by the decision maker. Learning is straightforward in settings like repeated i.i.d. coin flips, where data is abundant and the set of alternatives to choose from is narrowly defined. In this case, frequentist, Bayesian, and just about any other sensible inference agree. More interesting are situations where data is scarce relative to the statistical complexity of the set of alternatives being evaluated. Learning is hard in the impact-of-diet-on-health problem because we are concerned with learning 3 This theory occupies a central role in modern statistics, but is relatively unknown to economic theorists. Two exceptions I am aware of are Kalai (2003) and Salant (2007). Section 2.5 provides a brief, self-contained exposition. 4 The term “model” in this paper is used to refer both to the models used by decision makers to learn from their environments and to our formal description of that environment. The intended meaning will be clear from the context.
DECISION MAKERS AS STATISTICIANS
1373
about many events simultaneously—namely how different diets affect individuals with different characteristics. In this case, a decision maker compensates for the scarcity of data by limiting inference to a statistically simple family of events.5 Beliefs, which are pinned down only on a subset of events, are thus statistically ambiguous. As a result, different individuals with different models may draw different inferences and hold different beliefs based on the same data. In Section 3, I turn to asymptotic properties of uniform learning as data increases without bound. On a practical level, large sample theories permit greater tractability and clearer intuitions. Another motivation is that equilibrium notions in economics are usually interpreted as capturing insights about steady-state or long-run behavior. A theory of learning in which statistical ambiguity is nothing more than a passing phenomenon will have little to say about steady-state behavior.6 Theorem 3 shows that the known theory of uniform learning has no bite in the limit. Specifically, in standard outcome spaces, which I take to be complete separable metric spaces with countably additive probabilities, all statistical ambiguity disappears in the limit. In these spaces, the tension between the availability of data and statistical complexity disappears. This is at odds with the central role this tension plays in finite settings in distinguishing between simple and hard learning problems. I argue that the asymptotic elimination of statistical ambiguity in standard outcome spaces is a consequence of implicit structural restrictions these spaces impose. For example, the Borel events on [0 1] are defined in terms of a topology that embeds a notion of similarity between outcomes. By restricting learning to Borel events, we in effect overcome statistical ambiguity through a substantive similarity assumption that should be made explicit, rather than built into the mathematical structure of the model. To model a structure-free environment, I consider an arbitrary set of outcomes with the algebra of all events and all finitely additive probability distributions. Like the finite outcome case, this model is free from any inductive biases involving notions of distance, ordering, or similarity. In Theorem 4, I show that statistical ambiguity persists in the form of a set of probability measures representing beliefs that are not contradicted by data. Finally, in Section 4, I show how this set of beliefs can be integrated into standard models of decision making. 5
A Bayesian decision maker, on the other hand, draws inferences about all events (by updating), but what he learns is highly sensitive to his prior. 6 This point appears in Bewley (1988), who introduced the notion “undiscoverability” to capture the idea of stochastic processes that cannot be learned from data. His model and analysis are quite different from what is reported here.
1374
NABIL I. AL-NAJJAR
2. UNIFORM LEARNING AND CONSISTENCY WITH EMPIRICAL EVIDENCE 2.1. Basic Setup A decision maker uses i.i.d. observations to learn about the unknown probability distribution on a set of outcomes. My focus is on statistical inference and belief formation; decision making is discussed in Section 4. To better convey the motivation, this section focuses on finite settings, where both the set of outcomes and the amount of data are finite. BASIC MODEL—Finite Outcome Spaces (Xf 2Xf Pf ): Xf is a finite set, the set of events is the set of all subsets 2X f , and Pf is the set of all probability measures. The decision maker bases his inference on repeated i.i.d. samples from P ∈ Pf . Formally, let S denote the set of all infinite sequences of elements in Xf , interpreted as outcomes of infinite sampling. Under P, i.i.d. sampling corresponds to the product probability measure P ∞ on (S S ), where S is the σ-algebra generated by the product topology. For an element s = (x1 ) ∈ S, let st denote the finite sample that consists of the first t observations from s. When discussing finite outcome spaces, I will assume that data is limited to finite samples of t observations. 2.2. Informal Motivation and Intuition A decision maker is interested in learning the probabilities of events A ⊂ Xf . This decision maker does not have a prior belief about the probabilities of various events in Xf , but seeks instead uniform, that is, distribution-free, inferences about these probabilities.7 To be more specific, the decision maker observes a sequence st = (x1 xt ) drawn i.i.d. from the true but unknown distribution P.8 Define the empirical frequency of the event A relative to st by (1)
ν t (A s) ≡
#{i : xi ∈ A i ≤ t} t
where # denotes the cardinality of a finite set. The weak law of large numbers implies that the probability of any event can be estimated uniformly over all distributions when t is large. This can be stated formally as:9 7 A discussion of the difficulties with the Bayesian procedure of starting with a prior and updating it using the data can be found in the working paper version of this paper. 8 Many decision problems may be usefully modeled as stationary, while some nonstationary problems become stationary in a richer outcome space. In any event, if the underlying distribution is nonstationary, then one would expect learning to be even harder and for reasons quite distinct from those we wish to emphasize here. In particular, failure of learning would hold a fortiori in nonstationary settings where the object to be learned is constantly changing. 9 This follows directly from Billingsley (1995, p. 86) applied to the indicator function of A.
DECISION MAKERS AS STATISTICIANS
1375
LEMMA 1: For every ε > 0 there is an integer t such that ∀A ⊂ Xf ∀P ∈ Pf : P ∞ s : |P(A) − ν t (A s)| < ε > 1 − ε (2) Note that the sample size t in the lemma is independent of P A, and #Xf (in fact, the lemma also holds for infinite outcome spaces). Inference about any single event is equivalent to learning from independent coin tosses, so it is not meaningful to talk about an event A being simple or complicated if all we are concerned about is learning the probability of A in isolation. Choice involves, almost by definition, evaluating many acts simultaneously. To appreciate this point, define the set of ε-good samples for an event A as GoodtεP (A) ≡ s : |P(A) − ν t (A s)| < ε This is the set of representative samples for the event A, that is, those samples on which the empirical frequency of A is close to the true probability. Suppose now that the decision maker is choosing between bets fi i = 1 I, with fi paying 1 if the event Ai occurs and 0 otherwise. To make the problem interesting, use Lemma 1 to assume t large enough so that P ∞ [GoodtεP (Ai )] > 1 − ε for each Ai . This says that we can accurately estimate the expected payoff of each bet fi , but says little about accurately comparing these expected payoffs. The latter is possible only at samples that are representative for all of the events A1 AI simultaneously. That is, what we need is for the probability (3) GoodtεP (Ai ) P∞ i
to be large. Our assumption that P ∞ [GoodtεP (Ai )] > 1 − ε for each Ai only ensures that the probability of the intersection in (3) is at least 1 − Iε, a conclusion that quickly becomes useless as the number of events being compared increases. Roughly, a family of events {A1 AI } is statistically simple if the sets of samples GoodtεP (Ai ) i = 1 I overlap so that if each event has high probability, then so would their intersection. In this case, the amount of data needed to learn the entire family is not larger than what is needed to learn any one of its members in isolation. By contrast, a family of events {A1 AI } is statistically complex if the intersection i GoodtεP (Ai ) has low probability, even though there is enough data to guarantee that each set of samples GoodtεP (Ai ) has probability at least 1 − ε. In this case, learning the entire family requires considerably more observations than what would have been sufficient to learn any one of its members. What determines whether a family of sets is statistically simple or complex? The answer is supplied by the beautiful and powerful theory of Vapnik and Chervonenkis (1971) that characterizes the learning complexity of a family of events. Section 2.5 provides a brief account of this theory.
1376
NABIL I. AL-NAJJAR
2.3. Uniform Learning DEFINITION 1 —Uniform Learnability: A family of subsets C ⊂ 2Xf is εuniformly learnable by data of size t, ε > 0, if ∀P ∈ Pf P ∞ s : sup |P(A) − ν t (A s)| < ε > 1 − ε (4) A∈C
C is uniformly learnable if for every ε ∈ (0 1) there is t such that (4) holds. The crucial aspect of the definition is that learning is uniform over the events, so supA∈C is inside the probability statement. This comes at the expense of limiting the scope of learning to a subset of events C . The probability being evaluated in (4) is that of samples that are representative for all events in C simultaneously, that is, samples at which the empirical frequency of each event A ∈ C is close to its true probability. This suggests the following definition: DEFINITION 2: A (feasible) model is a triple (C ε t), where C is ε-uniformly learnable with data of size t. For each event A, think of ν t (A s) as a point estimate of P(A) and of ε as the size of a confidence interval around ν t (A s). Extending this intuition, we define (5) μtCε (s) = p ∈ Pf : sup |p(A) − ν t (A s)| ≤ ε A∈C
as the set of distributions consistent with empirical evidence. A probability measure that does not belong to μtCε is one that can be rejected with high confidence as inconsistent with the data. In a model (C ε t) we shall interpret the collection of events C and the degree of confidence ε as reflecting the decision maker’s model of his environment. The amount of available data t, on the other hand, is an objective constraint. The feasibility of a model, by itself, is a hopelessly weak criterion; it is, for instance, trivially satisfied when C = ∅ or ε = 1. It is normatively compelling to think of the decision maker as selecting models that are maximal in the sense that they do not overlook additional inferences that could have been drawn using the same amount of data t. Concerns about maximality are orthogonal to the main results of this paper. The interested reader will find a formal treatment of these ideas in the working paper version. 2.4. Learning, Scarcity of Data, and the Order of Limits A central question of this paper is “What might lead individuals to adopt a model (C ε t) that involve a coarse representation of the true environment?”
DECISION MAKERS AS STATISTICIANS
1377
Here, “coarse” means a model where C 2Xf . Our aim is to answer this question without appealing to bounded rationality or behavioral biases. We envision instead a frequentist decision maker with no prior beliefs who desires to draw uniform inferences from limited data. When data is scarce, the criterion of uniform learnability captures the intuition that a decision maker may limit the richness of the set of events he draws inference about, possibly to a set much smaller than the power set 2Xf .10 Formally, THEOREM 1: (i) For every Xf and ε > 0, there is t¯ such that 2X f is ε-uniformly learnable with data of size t ≥ t¯. (ii) For every t, ε > 0 and α > 0, there is n¯ such that #Xf > n¯ implies #C <α #2X f for any C that is ε-uniformly learnable with data of size t. Part (i) reflects a setting where data is plentiful: one fixes the finite outcome space Xf and then, taking the amount of data to infinity, guarantees uniform learning of the power set. Part (ii) reflects situations where data is scarce relative to the richness of Xf . In this case, the set of events that can be uniformly learned is a small fraction of the set of all events. To further clarify these points, we note that the bound (2) corresponds to a statistical experiment in which a new sample of t observations is drawn to evaluate each event A, potentially requiring a preposterous amount of data when the number of events to be evaluated is large. The uniform learning criterion (4), on the other hand, requires the set of representative samples to be the same for all events in C . It therefore corresponds to a statistical experiment in which inference is based on one shot at sampling t observations. When data is scarce, this forces the decision maker to restrict attention to a narrower, statistically simple family of events. 10 A numerical example may help the reader appreciate the difficulty: suppose there are z1 binary attributes that define an individual’s characteristics, z2 binary attributes that define diet characteristics, and z3 binary attributes that define health consequences, so the cardinality of the finite outcome space is 2z1 +z2 +z3 . For entirely conservative values of, say, z1 + z2 + z3 = 50, 50 the cardinality of the set of events is the incomprehensibly large number 22 . While learning the probability of any single event may require only a manageable amount of data, uniformly learning the probability of all events would require an amount of data that is in the realm of fantasy— even by the standard of idealized economic models. For example, with ε = 001, using (A.8) in the Appendix, a lower bound on the required number of observations is 35 × 1015 , which is of the same order of magnitude as the estimated number of minutes since the Big Bang (roughly, 735 × 1015 ).
1378
NABIL I. AL-NAJJAR
To sum up, when data is scarce, individuals seeking uniform inference restrict the scope of the events and acts they consider.11 For example, an investor may rely on macroeconomic or finance theories to restrict the distributions of returns. But despite decades of extensive and commonly shared evidence, even the best theories in these fields leave ample room for disagreement, as seen daily in conflicting policy recommendations, forecasts, and investment strategies. In environments like these, one is left with considerable freedom to choose which set of nonfactual, theoretically based restrictions to impose. It should therefore not be surprising that rational individuals may disagree even when facing identical information. 2.5. Vapnik–Chervonenkis Theory Uniform learning can be given an elegant and insightful characterization using the theory of Vapnik and Chervonenkis. Since the concepts that follow can be defined for outcome spaces of any cardinality, let X be an arbitrary (possibly infinite) set. The key concept of the theory is the shattering capacity of a family of sets C . In the remainder of the paper, assume that C is closed under complements. Define the nth shatter coefficient of such C to be s(C n) =
max
{x1 xn }⊂X
#{A ∩ {x1 xn } : A ∈ C }
Here, interpret {x1 xn } as a potential sample drawn from X. Then #{A ∩ {x1 xn } : A ∈ C } is the number of subsets that can be obtained by intersecting the sample with some member of C . The shatter coefficient s(C n) is a measure of the complexity of C . Clearly, s(C n) ≤ 2n . The Vapnik–Chervonenkis (or VC) dimension of C is VC ≡ max{s(C n) = 2n } n
If there is no such n, we write VC = ∞. In words, the VC dimension is the largest cardinality n such that there exists a set of n points that can be shattered by C . The central result in statistical learning theory is given by the following theorem. VC THEOREM: A family of events C ⊂ 2X is uniformly learnable if and only if it has finite VC dimension. 11 Al-Najjar and Pai (2008) study coarse decision making along these lines, spelling out in greater details the relationship between uniform learning and the problem of overfitting. Their paper then applies the framework to cognitive phenomena, like rules of thumb, categorization, linear orders, and satisficing, that appear anomalous from a Bayesian perspective.
DECISION MAKERS AS STATISTICIANS
1379
A version of this result appeared in Vapnik and Chervonenkis (1971).12 For a textbook treatment, see Theorems 12.5 and 13.3 in Devroye, Gyorfi, and Lugosi (1996). A consequence of the theorem, stated as Equation (A.7) in the Appendix, relates the speed of learning C to its VC dimension. For a family of events C to have a small VC dimension means that it is not “too rich” to be uniformly learned. But the cardinality of a family C has at best a tangential relationship to its statistical complexity. For example, the family of half-intervals appearing in Example 1 is uncountable yet has a VC dimension of 2 and thus is easy to learn. 3. LARGE SAMPLE THEORY I now turn to the asymptotic properties of uniform learning as the amount of data increases to infinity. There are at least three reasons why large sample theory is important. First, large samples make it possible to provide sharp definitions of concepts like statistical ambiguity and indeterminacy of beliefs. Second, one would like to know whether statistical ambiguity is robust in the limit as the amount of available data increases. Finally, equilibria in economic and game theoretic models are often viewed as steady states that arise as limits of learning processes. To introduce scarcity of data and statistical ambiguity in the limit, I consider two models of infinite outcome spaces. MODEL 1—Continuous Outcome Spaces (Xc B Pc ): Xc is a complete separable metric space, the set of events is the family of Borel sets B , and Pc is the set of countably additive probability measures on B . A prototypical example is the continuum [0 1] with the usual metric topology.13 MODEL 2—Discrete Outcome Spaces (Xd 2Xd Pd ): Xd is an arbitrary infinite set with the discrete topology, the set of events is the set of all subsets 2X d , and Pd is the set of finitely additive probability measures on 2Xd . These spaces are “discrete” in the sense that there is no extraneous metric or measurable structure that restricts the set of events or probabilities. As before, samples are drawn according to the product probability measure P ∞ on (S S ), where S is the σ-algebra generated by the product topology 12
An English translation of an earlier paper in Russian. Any continuous outcome space is, in a sense, equivalent to a subset of [0 1] with the metric topology, hence the use of the term “continuous” in describing these spaces. See Royden (1968, Theorem 8, p. 326) and the proof of Theorem 3. 13
1380
NABIL I. AL-NAJJAR
on the set of infinite samples S.14 Lemma 1, the concepts of uniform learning, shattering, the VC dimension and the VC Theorem all apply without change to infinite outcome spaces, assuming finite samples. 3.1. Exact Learning and Statistical Ambiguity To minimize repetition, in this subsection, I use (X Σ P ) to stand for either the continuous or the discrete outcome model. As the decision maker is given more data, he can sharpen his model by either decreasing ε, increasing the events C , or both. We formalize this using the notion of a learning strategy: DEFINITION 3: A learning strategy is a sequence {(Cn εn tn )}∞ n=1 of models that satisfy the following conditions: (i) εn → 0. (ii) Cn ⊆ Cn+1 for every n. (iii) Cn is an εn -uniformly learnable family by data of size tn . ¯ The learning strategy is simple if there is n¯ such that Cn = Cn+1 for every n ≥ n. As more data become available, the set of models that can be uniformly learned increases. Simple strategies increase confidence while holding C constant. Given a learning strategy σ = {(Cn εn tn )}∞ n=1 and infinite sample s, the set of beliefs consistent with empirical evidence is μσ (s) ≡ p : ∀n lim sup |p(A) − ν t (A s)| = 0 t→∞ A∈C
n
The next theorem is a law of large numbers for limiting beliefs: on a “typical”
sample, any probability distribution p ∈ μσ (s) assigns to each event A ∈ n Cn a probability equal to its true probability: THEOREM 2—Exact Learning: Fix any learning strategy σ = {(Cn εn tn )}∞ n=1
and write Cσ = n Cn . Then for any P ∈ P , (6)
μσ (s) = {p : p(A) = P(A) ∀A ∈ Cσ }
P ∞ -a.s.15
In particular, μσ (s) is a nonempty, convex set of probability measures, almost surely.
14 These are standard concepts in the case of Xc . Appendix A.1 provides the requisite background to cover the less familiar case of (Xd 2Xd ). 15 When P is only finitely additive, the notation P ∞ denotes the strategic product of P. See Appendix A.1 for details.
DECISION MAKERS AS STATISTICIANS
1381
The main challenge in proving this result is to show that it holds for finitely additive probabilities, as required in Section 3.3 below. Knowledge of the probabilities of events in Cσ may have implications for events outside Cσ . For instance, if we know the probability of two disjoint events A B ∈ Cσ , then we can unambiguously deduce the probability of the event A ∪ B even if it did not belong to Cσ . To make this formal, call a function p : Cσ → [0 1] a partial probability if it is the restriction to Cσ of some probability measure p on Σ.16 We can now give a formal definition of statistical ambiguity: DEFINITION 4: An event A ∈ Σ is (statistically) unambiguous relative to Cσ if, for any partial probability p on Cσ , and any two extensions p and p of p to Σ, p (A) = p (A) Let Cσ denote the set of all statistically unambiguous events (relative to Cσ ).17 In light of the definition and Theorem 2, we may therefore conclude that (7)
μσ (s) = {p : p(A) = P(A) ∀A ∈ Cσ }
P ∞ -a.s.
Determinacy of beliefs can be defined in terms of the existence of learning strategies that eliminate statistical ambiguity: DEFINITION 5: Beliefs are (asymptotically) determinate if there is a learning strategy σ such that Cσ = Σ. That is, under the strategy σ, for every P ∈ P , (8)
μσ (s) = {P}
P ∞ -a.s.
The question we turn to next is the determinacy of beliefs in the continuous versus the discrete model. 3.2. Determinacy of Beliefs in Continuous Outcome Spaces The following theorem shows that no meaningful indeterminacy persists in the continuous model in the limit: THEOREM 3: In the continuous model (Xc B Pc ) beliefs are determinate via a simple learning strategy. 16 A more direct condition defining partial probabilities was identified by Horn and Tarski (1948). See also Bhaskara Rao and Bhaskara Rao (1983, Definition 3.2.2). 17 While Cσ need not have any particular structure, Cσ is easily seen to be a λ-system, that is, a family of events closed under complements and disjoint unions (Billingsley (1995)). For example, the set C of half-intervals in Example 1 is not closed under unions, but C = B . The importance of λ-systems in the study of ambiguity was, to my knowledge, first pointed out by Zhang (1999).
1382
NABIL I. AL-NAJJAR
That is, there is always a simple strategy that can “learn” the true distribution. The following example illustrates the theorem when Xc = [0 1]: EXAMPLE 1: Let Xc = [0 1], B the Borel sets on [0 1], and P the set of countably additive probabilities on B . Consider the set C of half-intervals [0 r] r ∈ [0 1], and their complements. Let σ be the simple strategy with Cn = C for each n. Then: (i) C is uniformly learnable. (ii) Agreement on C implies agreement on all Borel sets. By Theorem 2, any p ∈ μσ (s) must agree with the true P on C almost surely. Therefore p and P define identical distribution functions, hence identical probability measures on B .18 There are two distinct learning principles at play in this example: • Statistical Learning: The set of half-intervals in [0 1] is uniformly learnable. This is the classical Glivenko–Cantelli theorem.19 • Deduction: The half-intervals are sufficient to determine beliefs on all Borel events. The theorem generalizes the intuition in Example 1 by showing that any complete separable metric space contains a uniformly learnable family that determines beliefs in the limit. As shown in the proof, a belief-determining family can be found whose structure is similar to that of half-intervals. It is difficult to think of bounded rationality reasons that would prevent a decision maker from using simple learning procedures like these. Theorem 3 reveals that continuous outcome spaces fail to capture the limiting behavior in finite settings, where indeterminacy of beliefs is natural. In the next subsection, I will argue that the conclusion of Theorem 3 is an artifact of the structure of Xc which distorts the learning problem by restricting the sets of permissible events and distributions. These restrictions are artificial in the sense that they have no counterparts in finite models. 3.3. Indeterminacy of Beliefs in Discrete Outcome Spaces In this section, I consider asymptotic learning in the discrete outcome space (Xd 2Xd Pd ). First we need the following definition: 18 Note that B itself is not uniformly learnable. This can be easily seen from the fact that B has infinite VC dimension. What matters for eliminating disagreements in the limit is that there is a uniformly learnable family (the subintervals) that is sufficient to determine beliefs on B . 19 The Glivenko–Cantelli theorem states that the empirical distribution function converges to the true distribution function uniformly almost surely. This theorem follows from the Vapnik– Chervonenkis theorem by noting that the half intervals have VC dimension of 2. To see this, any pair of points x1 x2 ∈ Xf can be shattered by C , so VC ≥ 2. Given any set of three points x1 < x2 < x3 , intersections with elements of C generate the sets {x1 } {x3 } {x1 x2 }, and {x2 x3 }, but no intersection can generate the singleton set {x2 }. Since no set with three points can be shattered, we have VC = 2.
DECISION MAKERS AS STATISTICIANS
1383
DEFINITION 6: Beliefs are (asymptotically) indeterminate if there exists P such that for every learning strategy σ, μσ (s) = {P}
P ∞ -a.s.
Indeterminacy is stronger than the negation of determinacy in two ways. First, the quantifiers are reversed: one can find a single “difficult-to-learn” distribution P that cannot be identified from the data regardless of the learning strategy used. Second, the failure to identify P occurs with probability 1, rather than just with positive probability. The relationship with statistical ambiguity is that if beliefs are indeterminate, then there are (statistically) ambiguous events under any learning strategy. THEOREM 4: Beliefs are statistically indeterminate in any discrete outcome space (Xd 2Xd Pd ). To compare Theorems 3 and 4, note first that statistical inference in Xd works just like it did in continuous outcome spaces. What changes here is that beliefs on 2Xd are no longer determined by a uniformly learnable C . The proof builds on a fundamental combinatorial result, known as Sauer’s lemma, that bounds the cardinality of uniformly learnable families in finite outcome spaces. This result cannot be directly used here because we must consider infinite families of events where information about cardinality is not very useful. This necessitates a more delicate indirect argument in which finitely additive probabilities are used in an essential way. The scope of disagreement asserted in the theorem can be substantial: COROLLARY 1: Given any discrete outcome space (Xd 2Xd Pd ), uniformly learnable C , and α ∈ (0 05], there is a pair of probability measures λ and γ that agree on C , yet |λ(B) − γ(B)| = α for uncountably many events B. 3.4. The Role of Finite Additivity We use infinite outcome spaces and infinite data to gain new insights into settings with finite outcomes and scarce data. The contrast between Theorems 3 and 4 thus reflects that continuous models fail, and discrete models succeed, as idealizations of finite settings. Why Asymptotic Learning Is Easy in Continuous Models In the continuous outcome space (Xc B Pc ), the amount of data tends to infinity, suggesting that learning is easier than in finite settings. On the other hand, B contains infinitely many events, suggesting that learning should be harder and statistical ambiguity more severe. The conclusion of Theorem 3
1384
NABIL I. AL-NAJJAR
that statistical ambiguity always disappears in the limit may therefore seem puzzling.20 The puzzle is explained by noting that the continuous model is loaded with structural assumptions and inductive biases. Although they may appear as innocuous regularity conditions, these assumptions and biases substantively drive (and, in my view, mislead) our intuition. I illustrate with two prototypical examples. Consider first the case Xc = [0 1] with the Borel sets generated by the usual metric topology. Here, a decision maker can eliminate statistical ambiguity in the limit by first learning the probabilities of the half-intervals and then using them to deduce those of the remaining Borel events. Non-Borel events are cast out as illegitimate, thus simplifying the learning problem by limiting the range of events the decision maker is able to contemplate. Consider next the case where Xc is countable with the discrete topology. In this case, the set of events B is 2Xc , so no event is a priori ruled out. Here, the mismatch with the finite-outcome-space intuition is that countable additivity requires probability distributions to be concentrated on negligible subsets of the outcome space. To make this precise, let {x1 } be an arbitrary enumeration of Xc . Fix a small α > 0 and define the (random) integer: N(s) = min ν({x1 xn } s) > 1 − α n
This is the smallest integer n such that the empirical distribution ν concentrates 1 − α mass on the finite set {x1 xn }. The family C = {{x1 xn }; n = 1 2 } of initial segments has a finite VC dimension and thus is uniformly learnable.21 Using the VC theorem, for any ε > 0 there is t¯ such that for all P ∈ Pc and t ≥ t¯, P ∞ s : P 1 xN(s) > 1 − α − ε > 1 − ε In words, without prior knowledge of P (other than that it belongs to Pc ), the decision maker can determine from finite-sample information the integer N(s), and hence the initial segment {x1 xN(s) } on which the true distribution is concentrated. Once this initial segment is known, the problem all but reduces to one with a fixed finite set of outcomes. Increasing the amount of data beyond t¯ corresponds (approximately) to case (i) of Theorem 1, where the set of outcomes is fixed but data increase without bound. This conflicts with the intuition, formalized in case (ii) of that theorem, that scarcity of data can be important when the set of outcomes is finite but rich enough. 20
Commenting on Theorem 4, a referee noted that “one might have conjectured a possibility result, presumably because intuitions live in metric spaces.” 21 This can be shown using an argument similar to that appearing in footnote 19.
DECISION MAKERS AS STATISTICIANS
1385
The Finite-Outcome-Space Motivation There is little doubt that individuals rely on cognitive devices, such as ordering or similarity, to organize information and guide learning when data is scarce (hence the opening quote of this paper). But to understand why these cognitive devices look the way they do, our model of an outcome space should act as a neutral backdrop against which they may arise as objects of choice, rather than being built into the primitives. Finite outcome spaces represent one such class of models as they embed no a priori inductive biases like notions of distance, ordering, or similarity. It would be odd to build into the primitives of a finite model a distinguished family of events as the only legitimate ones to consider or to restrict attention to distributions that place most of their mass on a small fraction of the total number of outcomes. Yet this is what the mathematical structure of events B and distributions Pc impose in the continuous model. These restrictions limit the scope of statistical ambiguity and diversity of beliefs by fiat. By contrast, the discrete space (Xd 2Xd Pd ), just like finite outcome spaces, is free from any such a priori structures. The Foundational Case for Finite Additivity Do finitely additive probabilities have a meaningful interpretation?22 Paradoxically, in the axiomatic foundations of decision theory, it is the requirement of countable additivity that is viewed as questionable and demands justification. Savage and de Finetti held that the fundamental axioms from which subjective probability is derived only imply finite additivity. Savage’s celebrated axiomatization, as well as many subsequent ones, was cast in a finitely additive setting. Countable additivity of subjective probability is an assumption to be introduced for expedience, not foundational considerations.23 de Finetti and Savage’s insistence on finite additivity is not an expression of a desire for technical generality or idiosyncratic modeling taste. Rather, it 22
Another concern is whether finitely additive models are tractable. They are certainly not as tractable as countably additive probabilities. However, the widespread misperception that none of the classical results of probability theory applies to them is just that—a misperception. In Appendix A.1 and in Al-Najjar (2007), I indicate that much of the classical theory applies once natural technical conditions are imposed. 23 de Finetti’s (1974, p. 123) view is reflected in the following quote: “Suppose we are given a countable partition into events Ei , and let us put ourselves into the subjectivistic position. An individual wishes to evaluate the pi : he is free to choose them as he pleases [ ] Someone tells him that in order to be coherent he can choose the pi in any way he likes, so long as the sum = 1 (it is the same thing as in the finite case, anyway!). The same thing?!!! You must be joking, the other will answer. In the finite case, this condition allowed me to choose the probabilities to be all equal, or slightly different, or very different; in short, I could express any opinion whatsoever. [Now] I am obliged to pick “at random” a convergent series which, however I choose it, is in absolute contrast to what I think. If not, you call me incoherent! In leaving the finite domain, is it I who has ceased to understand anything, or is it you who has gone mad?”
1386
NABIL I. AL-NAJJAR
reflects the methodological separation between (a) what constitutes structural assumptions about the choice setting and (b) the feasibility constraints facing the decision maker in a particular choice problem. As an example, take an outcome space X that has the cardinality of the continuum. In Savage’s model the set of acts F is the set of all functions that map points in X to consequences. Suppose, for whatever reason, that the decision maker wants to introduce a metric structure based on a linear order, perhaps to incorporate a notion of similarity between outcome, or some other concerns. This can be formalized as a choice of a bijection φ : X → [0 1] that imports the metric topology of [0 1] onto X. It would then be natural to consider the restriction to the set of acts Fφ that are measurable with respect to the Borel structure implied by φ. In Savage’s theory, the selection of a specific φ to represent, say, a notion of similarity between outcomes is modeled as the constraint that the decision maker must choose from the feasible set Fφ . The de Finetti–Savage case for finite additivity is that one should not confuse constraints like Fφ with the structure of the choice problem where all acts are permitted. This structure is invariant, while constraints are not. A common argument used to justify the removal of non-Borel sets from considerations is that they cannot be described in terms of finite sets of intervals and their limits.24 In Savage’s theory, describability is not a primitive but a constraint like any other. For example, one may find a linear order on [0 1], and the describability constraints it implies, intuitive. This paper takes a different point of view: structures like linear orders are devices decision makers use to facilitate learning. While linear orders may seem natural or canonical structures when describing prices or quantities, it is just as easy to think of examples where no obvious a priori structures exist: What is a natural linear order on a set of players in a large game, on the set of diets or medical conditions, or on past experiences with presidential elections or military contests?25 When we model these problems using finite outcomes, we choose to introduce structures like orders, metrics, or similarities, since without them learning would be impossible. But it would seem unreasonable to have these structures appear as part of the primitives. 4. DIVERSITY, AMBIGUITY, AND DECISION MAKING The main concern of this paper is with belief formation; that is, with questions like, “Where do beliefs come from and what makes them ‘reasonable?” An orthogonal, but equally important, question is, “What decisions would individuals make given their beliefs?” Here, I sketch how uniform learning may be integrated into standard models of decision making.26 24
For a formal model of undescribability, see Al-Najjar, Anderlini, and Felli (2006). To put this in perspective, there are 52! ≈ 8 × 1067 possible linear orders in a deck of 52 cards, roughly the number of atoms in a typical galaxy. 26 The working paper version contains a more formal and detailed discussion. 25
DECISION MAKERS AS STATISTICIANS
1387
An informal outline may be helpful. Many models in the literature represent beliefs as sets of probability measures to reflect decision makers with insufficient knowledge to form precise probabilistic beliefs. The set of probabilities in these models is usually derived axiomatically and interpreted as capturing the decision maker’s limited understanding of his environment. This paper proceeds in a different direction: I use an explicit model of learning to derive a set of probability measures μσ (s) consistent with empirical evidence; I then combine this objective information with subjective decision making criteria to produce choice behavior. To minimize repetition, we continue to use (X Σ P ) to stand for either the continuous outcome or the discrete outcome model. I also limit attention to infinite samples to streamline the discussion. Our focus will be on acts of the form f : X → R where we interpret f to be valued in utils in order to abstract from the decision maker’s risk attitude. BEWLEY’S INCOMPLETE PREFERENCES CRITERION:
(9) f dP ≥ g dP ∀P ∈ μσ (s) f ∗σs g ⇐⇒ X
X
Bewley (1986) axiomatized the behavior of a decision maker whose preference may be incomplete. His representation consists of a set of probability measures K and the criterion that f is preferred to g if and only if f yields higher expected payoff under any P ∈ K. Criterion (9) coincides with Bewley’s when K = μσ (s). Bewley’s model is sometimes informally interpreted as27 (i) the set K is a set of “objective distributions” representing the decision maker’s information and (ii) the decision maker prefers f to g if and only if f has higher expected payoff under any objective distribution. Although intuitively appealing, this interpretation has no formal basis in Bewley’s setup and axioms. The set K in his model is derived axiomatically from the decision maker’s preference and need not have any objective interpretation. Using the framework of this paper, we can formally interpret the set of measures μσ (s) as resulting from a learning process. When beliefs are determinate, μσ (s) collapse to a single measure P, in which case learning is complete and so is the preference σs . By contrast, when beliefs are indeterminate, ∗σs is 27 See, for instance, Bewley (1988) and, more recently, Gilboa, Maccheroni, Marinacci, and Schmeidler (2008).
1388
NABIL I. AL-NAJJAR
necessarily incomplete. Learning provides a motivation for what makes events (un)ambiguous and sheds light on how μσ (s) varies with samples.28 THE MAXIMIN EXPECTED UTILITY CRITERION:
◦ f σs g ⇐⇒ (10) inf f dP ≥ inf g dP P∈μσ (s)
P∈μσ (s)
X
X
This is the functional form introduced by Gilboa and Schmeidler (1989) with μσ (s) substituting for their subjectively derived set of measures. Gajdos, Hayashi, Tallon, and Vergnaud (2008) provided an axiomatic model of how objective information, in the form of a set of measures, can be incorporated into the subjective maximin expected utility setting. The difference is that the set μσ (s) in our case has a specific motivation in terms of frequentist learning, a motivation lacking in these authors’ more abstract formulation.29 If beliefs are asymptotically determinate, then μσ (s) is a singleton, ambiguity disappears, and the decision maker behaves exactly as a Bayesian. The framework of this paper makes it possible to relate the persistence of ambiguity to the failure of learning to pin down a unique distribution. THE BAYESIAN CRITERION:
f •σϕs g ⇐⇒ f dP ≥ g dP X
X
where P = ϕ(μσ (s)) and ϕ is a selection from the correspondence s →→ μσ (s). Here the decision maker selects an element of the set of measures μσ (s) and behaves as a Bayesian given this selection. This amounts to selecting a Bayesian completion of the incomplete preference ∗σs in the Bewley formulation (9). We can then shed some light on the question: Should individuals who have observed a large, common pool of data hold the same beliefs? A commonly held view is that differences in opinions are due only to differences in information. This is best expressed by Aumann (1987, pp. 12–13): People with different information may legitimately entertain different probabilities, but there is no rational basis for people who have always been fed precisely the same information to do so.
If beliefs are asymptotically determinate, μσ (s) is a singleton and the selection ϕ(μσ (s)) is unique. In this case, learning forces all individuals who observe 28 In particular, any two measures P P ∈ μσ (s) must agree on Cσ almost surely. Lehrer (2005) made a similar point in a very different context. 29 Gajdos et al. (2008) axiomatized a more general form where the inf in (10) is taken over a subset of μσ (s).
DECISION MAKERS AS STATISTICIANS
1389
a common pool of data to hold identical beliefs in the limit. But if beliefs are asymptotically indeterminate, then two individuals who observe the same data may hold different beliefs either because their subjective ϕ’s differ or because they use different learning strategies. In both cases, they draw different inferences from the same evidence, even in the limit. APPENDIX: PROOFS A.1. Strategic Product Measures Defining sampling for a continuous outcome space Xc is standard: we take as the sample space Ω the product Xc × Xc × · · · endowed with the Borel σalgebra generated by the product topology. In the discrete case Xd , on the other hand, we must appeal to concepts and results that may be unfamiliar to some readers. Here we give each coordinate the discrete topology and define the sample space as the product Ω = Xd × Xd × · · · with the product topology. As in the countably additive case, we take as the set of events the Borel σalgebra generated by the product topology on Ω. Suppose we are given a finitely additive probability measure λ on Xd . We are interested in defining the product measure λ∞ on Ω. If λ happens to be countably additive, a standard result is that a countably additive λ∞ can be uniquely defined. When λ is only finitely additive, the product measure need not be uniquely defined. Dubins and Savage (1965) dealt with this problem in their book on stochastic processes by introducing the concept of strategic products. These are product measures that satisfy natural disintegration properties (trivially satisfied when λ is countably additive). In a classic paper, Purves and Sudderth (1976) showed that any finitely additive λ on Xd has a unique extension to a strategic product λ∞ on the Borel σ-algebra on Ω. I do not provide the details of the Dubins and Savage (1965) concept of strategic products or Purves and Sudderth’s (1976) constructions because they are not essential for what follows. For the purpose of the present paper, what the reader should bear in mind is (a) the concept of strategic products is a natural restriction (for example, all product measures in the countably additive setting are strategic) and (b) Purves and Sudderth’s result permits extensions to the finitely additive setting of many of the major results in stochastic processes, including the Borel–Cantelli lemma, the strong law of large numbers, the Glivenko–Cantelli theorem, and the Kolmogorov 0–1 law. A.2. Proof of Theorem 2 The theorem is standard when the outcome space is finite or continuous. The main challenge is to provide arguments that do not require countable additivity. To avoid repetition, this proof applies to an outcome space X that stands
1390
NABIL I. AL-NAJJAR
for either Xc or Xd with the corresponding structures. Also, the notation P ∞ will always denote the strategic product of P (which, in the case of a countably additive P, coincides with the usual product). LEMMA A.1: Fix any uniformly learnable C and probability measure P. Then: P ∞ s : lim sup |ν t (A s) − P(A)| = 0 = 1 t→∞ A∈C
PROOF: From (A.7) we have that for every P ∈ P and ε > 0, ∞
P ∞ s : sup |ν t (A s) − P(A)| > ε < ∞ A∈C
t=1
As shown by Purves and Sudderth (1976), the Borel–Cantelli lemma applies in the strategic setting. This implies P ∞ s : ∃t¯∀t > t¯ sup |ν t (A s) − P(A)| ≤ ε = 1 A∈C
Take a sequence εn ↓ 0 and note that each event s : ∃t¯∀t > t¯ sup |ν t (A s) − P(A)| ≤ εn A∈C
is a tail event. Purves and Sudderth (1983) showed that P ∞ is countably additive on tail events, so P∞ s : ∃t¯∀t > t¯ sup |ν t (A s) − P(A)| ≤ εn = 1 A∈C
n
hence P ∞ s : lim sup |ν t (A s) − P(A)| = 0 = 1 t→∞ A∈C
Q.E.D.
For a family of sets C and sample s, define μC (s) = p : lim sup |p(A) − ν t (A s)| = 0 t→∞ A∈C
This is just the counterpart of μσ (s) for a single family of events C . For an arbitrary sample, this set of measures can be badly behaved or even empty. The following lemma characterizes it on a typical sample:
DECISION MAKERS AS STATISTICIANS
1391
LEMMA A.2: For any uniformly learnable C and probability measure P, we have, P ∞ -a.s., (A.1) μC (s) = p : sup |p(A) − P(A)| = 0 A∈C
PROOF: Lemma A.1 states that the event s : lim sup |ν t (A s) − P(A)| = 0 t→∞ A∈C
has P ∞ -probability 1. Thus, in the argument below, we restrict attention to samples s in this event. For any such s, given ε > 0, we have supA∈C |ν t (A s) − P(A)| < ε for all large t. If p ∈ μC (s), then supA∈C |p(A) − ν t (A s)| < ε for all large enough t. Then, for all large t, we have sup |p(A) − P(A)| ≤ sup |p(A) − ν t (A s)| + |ν t (A s) − P(A)| A∈C
A∈C
≤ sup |p(A) − ν t (A s)| + sup |ν t (A s) − P(A)| A∈C
A∈C
ε ε ≤ + = ε 2 2 so p is in the right-hand side of (A.1). Conversely, if p belongs to the set in the right-hand side of (A.1), then fixing α > 0 and taking t large enough, we have sup |p(A) − ν t (A s)| ≤ sup |p(A) − P(A)| + sup |P(A) − ν t (A s)| A∈C
A∈C
A∈C
≤ ε + α Since α is arbitrary, the conclusion follows.
Q.E.D.
PROOF OF THEOREM 2: For a learning strategy {(Cn εn tn )}∞ n=1 and integer ¯ we note that n, μσ (s) =
μCn¯ (s)
¯ n=12
Any event of the form s : μCn¯ (s) = p : sup |p(A) − P(A)| = 0 A∈Cn¯
1392
NABIL I. AL-NAJJAR
is a tail event and, by Lemma A.2, has P ∞ -probability 1. By Purves and Sudderth’s (1983) result that P ∞ is countably additive on tail events, the event s : μCn¯ (s) = p : sup |p(A) − P(A)| = 0 A∈Cn¯
¯ n=12
also has P ∞ -probability 1. From this it follows that ∞ P s: μCn¯ (s) = p : sup |p(A) − P(A)| = 0 = 1 ¯ n=12
A∈Cn¯
Q.E.D.
A.3. Proof of Theorem 3 This is essentially a consequence of two facts: (i) all complete separable metric spaces are “equivalent” to a Borel subset of [0 1] and (ii) on [0 1] knowing the probabilities of half-intervals is sufficient to determine the probability of all Borel sets. The technical details are as follows: By Royden (1968, Theorem 8, p. 326), there is a Borel subset B ⊂ [0 1] and a measurable bijection φ : Xc → B such that φ−1 is also measurable. For each r ∈ [0 1] define Ar = φ−1 ([0 r]) and let C = {Ar : r ∈ [0 1]}. That is, the collection C mimics the structure of half-intervals in [0 1]. Note that these sets need not preserve the geometric properties of the half-intervals (e.g., connectedness). They are, however, nested: Ar Ar whenever r < r . It is easy to verify that the family of sets C has VC dimension of 1.30 Consider any simple learning strategy {(Cn εn tn )}∞ n=1 with a constant Cn = C . Using Theorem 2, we have μσ (s) = {p : p(A) = P(A) ∀A ∈ C }
P ∞ -a.s.
Fix any sample path s for which the above holds and fix p ∈ μσ (s). To show that p and P are identical, we “transfer” p and P to the inter˜ val [0 1]. For every Borel set A ⊂ [0 1], define p(A) ≡ p(φ−1 (A)) and −1 ˜ P(A) ≡ P(φ (A)). Then by Royden (1968, Proposition 1, p. 318), P˜ and p˜ are probability measures on [0 1] that agree on the values they assign to all half-intervals, and thus must have the same distribution functions. ˜ hence p = P since φ is a Borel equivaFrom this, it follows that p˜ = P, lence. A.4. Proof of Theorem 4
Fix a discrete space (Xd 2Xd Pd ) and let (Xd 2Xd Pd ) be a subspace, where X is an infinite subset of Xd and Pd is the set of all finitely additive prob d
30 See Problem 13.15 of Devroye, Gyorfi, and Lugosi (1996, p. 231) for this obvious fact and its (slightly less obvious) converse.
DECISION MAKERS AS STATISTICIANS
1393
abilities that put unit mass on Xd . To show that beliefs on Xd are indeterminate, it suffices to display a subspace Xd on which they are. The strategy I follow is to focus on an increasing sequence {XN }∞ subsets N=1 of finite
of Xd and prove indeterminacy for the outcome space Xd ≡ XN . Since this procedure is applicable in any (infinite) outcome space Xd , to avoid redundant notation, assume for the remainder of the proof that Xd is countable. I start with following proposition which establishes the result for a single uniformly learnable family C . The general case will follow as a corollary: PROPOSITION A.1: There is a finitely additive probability measure λ on the discrete outcome space (Xd 2Xd ) such that for every uniformly learnable family of events C , there are uncountably many distinct (finitely additive) probability measures that agree with λ on C . The proof proceeds in three steps: (i) Construct a “nice” finitely additive probability measure λ on (Xd 2Xd ). (ii) Given any C , construct a class of perturbations of the density of λ with the property that they leave λ unaffected on C . (iii) Show that each such perturbation defines a finitely additive probability measure distinct from λ. A.4.1. Constructing λ Let {XN }∞ N=1 be an increasing sequence of finite subsets of Xd such that ηN−1 <
ηN N
where ηN ≡ #XN
This says that the cardinality of XN increases rapidly with N. Define the probability measure λN on 2X by λN (A) =
#(A ∩ XN ) #XN
That is, λN (A) is the frequency of the set A in XN . Let U be a free ultrafilter on the integers and for any sequence of real numbers xN , define the expression
U - lim xN = x N→∞
to mean that the set {N : |xN − x| < ε} belongs to U for every ε > 0. Then for any event A, define λ(A) ≡ U - lim λN (A) N→∞
1394
NABIL I. AL-NAJJAR
Intuitively, λ is a uniform distribution on the integers. It is immediate that λ is atomless (i.e., assigns zero mass to each point) and purely finitely additive. For readers not familiar with these concepts, the idea is to define the probability of the event A, λ(A), as a limit of the finite probabilities λN (A). If the sequence {λN (A) N = 1 2 } converges, then the statement that λ(A) ≡ limN→∞ λN (A) is equivalent to saying that the set of integers {N : |λN (A) − λ(A)| < ε} is cofinite (i.e., complement of a finite set) for every ε > 0. That is, λN (A) converges to λ(A) means that the set of N’s on which λN (A) and λ(A) are ε apart is small for all ε > 0, where “small” here means finite. The notion of ultrafilter generalizes this intuition by identifying a collection of large subsets of integers U . That U is free means that it contains all cofinite sets; that it is ultra means that each set of integers is either in U or its complement is. This immediately implies that the operation U - lim generalizes the usual limit and that any sequence must have a generalized U - lim. Ultrafilters is a standard mathematical tool that generalizes limits by selecting convergent subsequences in a consistent manner.31
A.4.2. Perturbations A perturbation is any function s : Xd → {1 − ε 1 + ε} with ε ∈ [0 1]. Let V denote the set of all perturbations. Endow V with the σ-algebra V generated by the product topology, that is, the one generated by all sets of the form {s : s(x) = 1 + ε} for some x ∈ Xd . Let π be the unique countably additive product measure on (V V ), assigning probability 0.5 to each of the events {s : s(x) = 1 + ε} That is, π is constructed by taking equal probability i.i.d. randomizations for s(x) ∈ {1 − ε 1 + ε}. Note that (V V π) is a standard countably additive probability space constructed using standard methods. The only finite additivity is in the measure λ. Fix an arbitrary N. For any event A ⊂ Xd , we use AN to denote the finite set A ∩ XN and define CN ≡ {AN : A ∈ C }. That is, CN is the appropriate projection of C on XN . If C has finite VC dimension v on Xd , then no subset of v + 1 points in Xd can be shattered by C . Then, a fortiori, no subset of v + 1 points in XN can be shattered by C , so the VC dimension of the family of events CN is at most v. A fundamental combinatorial result, due to Sauer (1972) (see also Devroye, Gyorfi, and Lugosi (1996, Theorem 13.3, p. 218)), states that given an outcome space of ηN points, any family of events of finite VC dimension v cannot contain more than 2(ηN )v events. That is, the cardinality of a family of subsets is polynomial in ηN with degree equal to its VC dimension. To appreciate this bound, recall that XN contains 2ηN events in all, so an implication of Sauer’s lemma is that being of finite VC dimension severely restricts how rich a family of events can be. For example, with ηN = 50, if C has a VC dimension of 5, say, then the ratio of the number of events in C to the power set is no more than 55 × 10−7 . 31 Bhaskara Rao and Bhaskara Rao (1983) provided formal definitions. Wikipedia has a nice article on the subject.
DECISION MAKERS AS STATISTICIANS
1395
This cardinality argument, while suggestive, does little for us in the limit: when the size of XN goes to infinity and, holding v fixed, both the cardinality of C and the power set go to infinity. In fact, it is possible to construct a family of events C in Xd of VC dimension 1, yet C has uncountable cardinality (see Devroye, Gyorfi, and Lugosi (1996, Problem 13.14, p. 231)). This necessitates a more indirect approach than just counting sets.
Since the perturbations are independent, Hoeffding’s inequality (see, e.g., Devroye, Gyorfi, and Lugosi (1996, Theorem 8.1, p. 122)) implies that for any subset AN ∈ CN , 1 2 s(x) − #AN > α ≤ 2e−2#AN α π s: #AN x∈A N
This inequality is not particularly useful without some bounds on #AN . So for 0 < β ≤ 05, let CNβ denote the family of events {AN : A = Xd , or A ∈ C and λN (A) > β}. Restricting attention to N’s with N1 < β, we have 2
2
2e−2#AN α ≤ 2e−2βηN α Since, by Sauer’s lemma, there are no more than 2(ηN )v events in CNβ , we obtain π(ZαβN ) ≤ 4(ηN )v e−(2βα where
2 )η
N
1 ≡ s : max AN ∈CNβ #AN x∈A −X
s(x) − #AN > α
ZαβN
N
N−1
Summing up over N, for fixed α and β, we obtain ∞
∞ 2 π(ZαβN ) ≤ 4 (ηN )v e−(2βα )ηN < ∞
N=1
N=1
By the Borel–Cantelli lemma (the usual version, since π is countably additive), the set Zαβ of perturbations that belong to infinitely many of the ZαβN ’s has π -measure 0. This implies (again using the countable additivity of π) that the event (A.2)
Q≡
∞ ∞
(Zα=1/kβ=1/k )c
k=1 k =1
has π-measure 1. In particular, Q is not empty. The preceding argument is the heart of the proof. Think of the indicator function χA of an event A with 0 < λ(A) < 1 as its density function with respect to the distribution λ. The
1396
NABIL I. AL-NAJJAR
idea is to perturb that density by tweaking it up and down by ε. Call a perturbation s neutral with respect to A if λ(A) = A s dλ. Any such perturbation s defines a new probability measure γ(A) ≡ A s dλ that leaves the probability of A intact yet differs from A at least on the event B ≡ {x : s(x) = 1 − ε}. The proposition is proven by showing the existence of perturbations s that accomplish this not just with respect to a single event A, but all events in C simultaneously. This argument, which culminates in Appendix A.4.3, is founded on the material above. The strategy is to draw, for each x, a value in {1 + ε 1 − ε} with equal probability and independently across the x’s. It is straightforward to check that, given a single fixed event A, any draw s will be π-almost surely neutral with respect to A. Since the intersection of countably many π-measure 1 sets has π-measure 1, this conclusion can be extended to any countable family of events {A1 A2 }. The trouble is in dealing with an uncountable family C —a case that is essential for the theory since many standard classes like half-intervals, half-spaces, and Borel sets are uncountable. A less direct and more subtle argument is needed. Here, the assumption that C has finite VC dimension plays a critical role via Sauer’s lemma. It is well known from the theory of large deviations that convergence in the (weak) law of large numbers is exponential in sample size. This implies that one can estimate the probabilities of larger families of events, provided their cardinalities do not grow too quickly. Sauer’s lemma delivers the slow rate of growth, asserting that a family with finite VC dimension must have a cardinality that is polynomial in the size of the outcome space. The difficulty, of course, is that neither large deviations nor Sauer’s lemma has much meaning in the limit, when t is infinite. In the proof, I first project the (possibly uncountable) family C on the finite sets XN , identify the (approximately) good perturbations, and bound their probabilities.
A.4.3. Perturbed Measures For a fixed s and any event A, define A+ = {x ∈ A : s(x) = 1 + ε} and A− = {x ∈ A : s(x) = 1 − ε} Although there is a well developed theory of integration with respect to finitely additive probabilities, for the purpose of this proof all we need is to define
(A.3) s(x) dλ ≡ (1 + ε)λ(A+ ) + (1 − ε)λ(A− ) A
and
s(x) dλN ≡ A
1 s(x) ηN x∈A N
= (1 + ε)λN (A+ ) + (1 − ε)λN (A− )
DECISION MAKERS AS STATISTICIANS
1397
LEMMA A.3: For all A ∈ C and s ∈ Q,
s(x) dλ = λ(A) A
PROOF: Fix a set A with β ≡ λ(A) > 0 and an α > 0. We deal with the case λ(A) = 0 separately. Belonging to Q implies that for all N large enough, s ∈ ZαβN ; thus, 1 max s(x) − #A N < α AN ∈CNβ #AN x∈A N
In this case, multiplying both sides by #AN /#ηN , we get #AN 1 max s(x) − #AN < α ≤ α AN ∈CNβ #ηN ηN x∈A N
Substituting in the definitions of A s(x) dλN and λN (A), we have that for all large enough N, (A.4) max s(x) dλN − λN (A) < α AN ∈CNβ
A
From the definition (A.3) and the properties of ultrafilters, there is a subsequence {Nk } such that32
(A.5) s(x) dλ − λ(A) A
≡ (1 + ε) U - lim λN (A+ ) + (1 − ε) U - lim λN (A− ) N→∞
N→∞
− U - lim λN (A) N→∞ = (1 + ε) lim λNk (A+ ) + (1 − ε) lim λNk (A− ) k→∞
k→∞
− lim λNk (A) k→∞ = lim s(x)dλNk − λNk (A) k→∞
A
32 To see this, fix a pair of sets A1 and A2 and an integer k. Then the set Uik ≡ {N : |λ(Ai ) − λN (Ai )| < k1 } belongs to U for i = 1 2 and so does their intersection (since ultrafilters are closed under finite intersections). Pick Nk ∈ U1k ∩ U2k . Repeating the process, we generate the desired sequence by picking Nk+1 > Nk in U1k+1 ∩ U2k+1 .
1398
NABIL I. AL-NAJJAR
From the fact that λ(A) = limk→∞ λNk (A), it follows that for all k large enough, ANk ∈ CNk β . Combining (A.4) and (A.5), we obtain s(x) dλ − λ(A) < α A
The conclusion of the lemma follows since α was arbitrary. Finally, the case where λ(A) = 0 follows from the fact that the above applies to the complement of A, since λ(Ac ) = 1, and the additivity of λ. Q.E.D. To conclude the proof of the proposition, fix an s ∈ Q and define γ(A) ≡ (1 + ε)λ(A+ ) + (1 − ε)λ(A− ) As noted earlier, this is just the integral A s(x) dλ of the function s with respect to λ. We first verify that γ is a finitely additive probability measure. From the additivity of the integral, it immediately follows that γ is an additive set function. Positivity of γ follows as long as ε ∈ [0 1]. Finally, note that Xd ∩ XN ∈ CN1 for each N, so s(x) dλ = 1 and (A.3) imply γ(Xd ) = 1. That λ and γ coincide on C (hence necessarily on C ) follows from Lemma A.3. All that remains to prove is that the perturbed measure γ must differ from λ on some (in fact, many) events outside C . Take the event B ≡ {x : s(x) = 1 − ε}. From s(x) dλ = 1 and (A.3), we have λ(B) = 05, yet
γ(B) ≡ s(x) dλ = (1 − ε)λ(B) = λ(B) (A.6) B
so B ∈ / C (since, by the earlier part of the argument, λ and γ coincide on C ). This completes the proof of Proposition A.1. Q.E.D. From Theorem 3, we know that this proof must break down somewhere if the outcome space were a complete separable metric space with countably additive probabilities. A natural question is, “At what stage was finite additivity needed and the implications of Theorem 3 avoided?” For example, the construction of the perturbation s by i.i.d. sampling is not possible in an uncountable, complete, separable outcome space with countably additive probabilities. The reason is that a typical sample path s is nonmeasurable so the perturbed measure γ(A) = s · χA dλ cannot be meaningfully defined. Of course, I do not claim that finding s via random sampling is the only feasible procedure to construct perturbations, but only point out that this particular procedure breaks down in standard spaces—as it should, given Theorem 3.
PROOF OF THEOREM 4: Given a learning strategy {(Cn εn tn )}∞ n=1 , index the events defined in (A.2) by n, writing each as Qn to make explicit its dependence on Cn . Consider now the event ∞ n=1
Qn
DECISION MAKERS AS STATISTICIANS
1399
and note that it must have π-probability 1. Let s be any element of this set. It is clear that the remainder of the argument in Appendix A.4.3 goes through unaltered. Q.E.D. PROOF OF COROLLARY 1: From (A.6) and the fact that λ(B) = 05, we can write γ(B) = λ(B) − 05ε so that |γ(B) − λ(B)| = 05ε Varying ε within the interval (0 1] yields the desired conclusion. That there are uncountably many such B’s follows from the fact that the distribution on admissible perturbations is atomless, and hence its support must be uncountable. Q.E.D. A.5. Proof of Theorem 1 Writing n = #Xf , the VC dimension of 2X f is n. The first claim follows from the fact that there is a constant K such that 2 sup P ∞ s : sup |ν t (A s) − P(A)| > ε < Kt VC e−tε /32 (A.7) P∈P
A∈C
See Devroye, Gyorfi, and Lugosi (1996).33 For the second part, a lower bound on the amount of data needed was shown by Ehrenfeucht, Haussler, Kearns, and Valiant (1989)34 to be (A.8)
t≥
VC − 1 32ε
Applying this bound with VC = n, and holding t and ε fixed while increasing n yields the result. REFERENCES AL -NAJJAR, N. I. (2007): “Finitely Additive Representation of Lp Spaces,” Journal of Mathematical Analysis and Applications, 330, 891–899. [1385] 33
For another take on the problem, see Pollard (1984). A characterization in terms of samples appears in Talagrand (1987). 34 See also Devroye, Gyorfi, and Lugosi (1996, Section 14.5).
1400
NABIL I. AL-NAJJAR
AL -NAJJAR, N. I., AND M. PAI (2008): “Coarse Decision Making,” Report, Northwestern University. [1378] AL -NAJJAR, N. I., L. ANDERLINI, AND L. FELLI (2006): “Undescribable Events,” Review of Economic Studies, 73, 849–868. [1386] AUMANN, R. J. (1987): “Correlated Equilibrium as an Expression of Bayesian Rationality,” Econometrica, 55, 1–18. [1388] BEWLEY, T. (1986): “Knightian Decision Theory: Part I,” Discussion Paper 807, Cowles Foundation. [1387] (1988): “Knightian Decision Theory and Econometric Inference,” Discussion Paper 868, Cowles Foundation. [1373,1387] BHASKARA RAO, K. P. S., AND M. BHASKARA RAO (1983): Theory of Charges. New York: Academic Press. [1381,1394] BILLINGSLEY, P. (1995): Probability and Measure (Third Ed.). New York: Wiley-Interscience. [1374,1381] DE FINETTI, B. (1974): Theory of Probability, Vols. 1 and 2. New York: Wiley. [1385] DEVROYE, L., L. GYORFI, AND G. LUGOSI (1996): A Probabilistic Theory of Pattern Recognition. Berlin: Springer Verlag. [1379,1392,1394,1395,1399] DUBINS, L. E., AND L. J. SAVAGE (1965): How to Gamble if You Must. Inequalities for Stochastic Processes. New York: McGraw-Hill. [1389] EHRENFEUCHT, A., D. HAUSSLER, M. KEARNS, AND L. VALIANT (1989): “A General Lower Bound on the Number of Examples Needed for Learning,” Information and Computation, 82, 247–261. [1399] GAJDOS, T., T. HAYASHI, J.-M. TALLON, AND J.-C. VERGNAUD (2008): “Attitude Toward Imprecise Information,” Journal of Economic Theory, 140, 27–65. [1388] GILBOA, I., AND D. SCHMEIDLER (1989): “Maxmin Expected Utility With Nonunique Prior,” Journal Mathematical Economics, 18, 141–153. [1388] GILBOA, I., F. MACCHERONI, M. MARINACCI, AND D. SCHMEIDLER (2008): “Objective and Subjective Rationality in a Multiple Prior Model,” Report, Collegio Carlo Alberto, Universita di Torino. [1387] HORN, A., AND A. TARSKI (1948): “Measures in Boolean Algebras,” Transactions of the American Mathematical Society, 64, 467–497. [1381] KALAI, G. (2003): “Learnability and Rationality of Choice,” Journal of Economic Theory, 113, 104–117. [1372] LEHRER, E. (2005): “Partially-Specified Probabilities: Decisions and Games,” Report, Tel-Aviv University. [1388] MILLER, G. (1981): “Trends and Debates in Cognitive Psychology,” Cognition, 10, 215–225. [1371] POLLARD, D. (1984): Convergence of Stochastic Processes. Berlin: Springer Verlag. [1399] PURVES, R. A., AND W. D. SUDDERTH (1976): “Some Finitely Additive Probability,” Annals of Probability, 4, 259–276. [1389,1390] (1983): “Finitely Additive Zero–One Laws,” Sankhy¯ a Series A, 45, 32–37. [1390,1392] ROYDEN, H. L. (1968): Real Analysis (Second Ed.). New York: MacMillan. [1379,1392] SALANT, Y. (2007): “On the Learnability of Majority Rule,” Journal of Economic Theory, 135, 196–213. [1372] SAUER, N. (1972): “On the Density of Families of Sets,” Journal of Combinatorial Theory, 13, 145–147. [1394] TALAGRAND, M. (1987): “The Glivenko–Cantelli Problem,” Annals of Probability, 15, 837–870. [1399] VAPNIK, V. N., AND A. Y. CHERVONENKIS (1971): “On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities,” Theory of Probability and Its Applications, 16, 264–280. [1375,1379]
DECISION MAKERS AS STATISTICIANS
1401
ZHANG (1999): “Qualitative Probabilities on λ-Systems,” Mathematical Social Sciences, 38, 11–20. [1381]
Dept. of Managerial Economics and Decision Sciences, Kellogg School of Management, Northwestern University, Evanston, IL 60208, U.S.A.; al-najjar@ northwestern.edu. Manuscript received October, 2007; final revision received February, 2009.
Econometrica, Vol. 77, No. 5 (September, 2009), 1403–1445
INFERENCE FOR CONTINUOUS SEMIMARTINGALES OBSERVED AT HIGH FREQUENCY BY PER A. MYKLAND AND LAN ZHANG1 The econometric literature of high frequency data often relies on moment estimators which are derived from assuming local constancy of volatility and related quantities. We here study this local-constancy approximation as a general approach to estimation in such data. We show that the technique yields asymptotic properties (consistency, normality) that are correct subject to an ex post adjustment involving asymptotic likelihood ratios. These adjustments are derived and documented. Several examples of estimation are provided: powers of volatility, leverage effect, and integrated betas. The first order approximations based on local constancy can be over the period of one observation or over blocks of successive observations. It has the advantage of gaining in transparency in defining and analyzing estimators. The theory relies heavily on the interplay between stable convergence and measure change, and on asymptotic expansions for martingales. KEYWORDS: Consistency, cumulants, contiguity, continuity, discrete observation, efficiency, equivalent martingale measure, Itô process, leverage effect, likelihood inference, realized beta, realized volatility, stable convergence.
1. INTRODUCTION AN IMPORTANT DEVELOPMENT in econometrics and statistics is the invention of estimation of financial volatility on the basis of high frequency data. The econometric literature first focused on instantaneous volatility (Foster and Nelson (1996), Comte and Renault (1998)). The econometrics of integrated volatility was pioneered by Andersen, Bollerslev, Diebold, and Labys (2001, 2003), Barndorff-Nielsen and Shephard (2001, 2002), and Dacorogna, Gençay, Müller, Olsen, and Pictet (2001). Earlier results in probability theory go back to Jacod (1994) and Jacod and Protter (1998). Our own work in this area goes back to Zhang (2001) and Mykland and Zhang (2006). Further references are given throughout in the Introduction and in Section 2.5. The quantities that can be estimated from high frequency data are not confined to volatility. Problems that are attached to the estimation of covariations between two processes are discussed, for example, by Barndorff-Nielsen and Shephard (2004a), Hayashi and Yoshida (2005), and Zhang (2009). There is a literature on power variations and bi- and multipower estimation (see Examples 1 and 2 in Section 2.5 for references). There is an analysis of variance/variation (ANOVA) based on high frequency observations (see Section 4.4.2). We shall see in this paper that one can also estimate such quantities as integrated betas and the leverage effect. 1 We are grateful to Oliver Linton, Nour Meddahi, Eric Renault, Neil Shephard, Dan Christina Wang, Ting Zhang, and a co-editor and two referees for helpful comments and suggestions. Financial support from the National Science Foundation under Grants DMS 06-04758 and SES 06-31605 is also gratefully acknowledged.
© 2009 The Econometric Society
DOI: 10.3982/ECTA7417
1404
P. A. MYKLAND AND L. ZHANG
The literature on high frequency data often relies on moment estimators derived from assuming local constancy of volatility and related quantities. To be specific, if ti 0 = t0 < t1 < · · · < tn = T , are observation times, it is assumed that one can validly make one period approximations of the form ti+1 fs dWs ≈ fti Wti+1 − Wti (1) ti
where {Wt } is a standard Brownian motion. The cited work on mixed normal distributions uses similar approximations to study stochastic variances. In the case of volatility, one can, under weak regularity conditions, make the approximation 2 T ti+1 (2) σt dWt − σt2 dt ti
i
≈
i
0
2
σti Wti+1 − Wti
2
−
σt2i (ti+1 − ti )
i
without affecting asymptotic properties (the error in (2) is of op (n−1/2 )). Thus the asymptotic distribution of realized volatility (sums of squared returns) can be inferred from discrete time martingale central limit theorems. In the special case where the σt2 process is independent of Wt , one can even talk about unbiasedness of the estimator. This raises two questions: (i) Can one always invoke approximations (1) and (2) or does the approximation in formula (1) only work for a handful of cases such as volatility? (ii) If one can pretend that volatility characteristics are constant from ti−1 to ti , then can one also pretend constancy over successive blocks of M (M > 1) observations, from, say ti−M to ti ? If this were true, a whole arsenal of additional statistical techniques would become available. This paper will show that, subject to some adjustments, the answer to both these questions is yes. There are two main gains from this. One is easy derivation of asymptotic results. The other is to give a framework for how to set up inference procedures as follows. If σt is treated as constant over a block of M observations, then the returns (the first differences of the observations) are simply Gaussian, and one can therefore think “parametrically” when setting up and analyzing estimators. Once parametric techniques have been used locally in each block, estimators of integrated quantities may then be obtained by aggregating local estimators. Any error incurred from this analysis can be corrected directly in the final asymptotic distribution, using adjustments that we provide. The advantages to thinking parametrically are threefold, as illustrated by examples in Section 4. T Efficiency: In the case of quantities like 0 |σ|rt dt, there can be substantial reduction in asymptotic variance (see Section 4.1).
INFERENCE FOR CONTINUOUS SEMIMARTINGALES
1405
Transparency: Section 4.2 shows that the analysis of integrated betas reduces to ordinary least squares regression. Similar considerations apply to the examples (realized quantiles, ANOVA) in Section 4.4. Definition of New Estimators: In the case of the leverage effect, blocking is a sine qua non, as will be clear from Sections 2.5 and 4.3. Local parametric inference appears to have been introduced by Tibshirani and Hastie (1987), and there is an extensive literature on the subject. A review is given in Fan, Farmen, and Gijbels (1998), and this paper should be consulted for further references. See also Chen and Spokoiny (2007) and Cizek, Härdle, and Spokoiny (2007) for recent papers in this area involving volatility. Our current paper establishes, therefore, the connection of high-frequencydata inference to local parametric inference. We make this link with the help of contiguity. It will take time and further research to harvest the existing knowledge in the area of local likelihood for use in high frequency semimartingale inference. In fact, the estimators discussed in the applications section (Section 4) are rather obvious once a local likelihood perspective has been adapted; they are more of a beginning than an end. For example, local adaptation is not considered. We emphasize that the main outcome of this paper is to provide direction on how to create estimators, and to provide an easy way to analyze them. It is, however, perfectly possible to derive asymptotic results for such estimators by other existing methods, as used in many of the papers cited above. In fact, direct proof will permit the most careful study of the precise conditions needed for consistency and mixed asymptotic normality for any given procedure. A different kind of blocking—pre-averaging—was used by Podolskij and Vetter (2009) and Jacod, Li, Mykland, Podolskij, and Vetter (2009) in the context of inference in the presence of microstructure noise. In these papers, the (latent) semimartingale is itself given a locally constant approximation. This approximation would not give rise to contiguity in the absence of noise, but we conjecture that contiguity results can be found under common types of microstructure. In the current paper, we do not deal with microstructure. This would be a study in itself and is deferred to a later paper. A follow-up discussion on estimation with moving windows and how to use this technology for asynchronous observations can be found in Mykland and Zhang (2009). The plan for the paper is that Section 2 discusses measure changes in detail and their relationship to high frequency inference. It then analyzes the one period (M = 1) discretization. Section 3 discusses longer block sizes (M > 1). Major applications are given in Section 4, with a summary of the methodology (for the scalar case) in Section 4.5. A Reader’s Guide: We emphasize that the two approximations (to block size M = 1 and then from M = 1 to M > 1) are quite different in their methodologies. If you are only interested in the one period approximation, the material to read is Section 2 and Appendix A.1. (Though consequences for estimation
1406
P. A. MYKLAND AND L. ZHANG
of the leverage effect are discussed in Section 4.3.) The block (M > 1) approximation is mainly described in Sections 3 and 4, and Appendices A.2 and A.3. An alternative way to read this paper is to head for Section 4.5 first; this section should in any case be consulted early on and kept in mind while reading the rest of the paper. 2. APPROXIMATE SYSTEMS We here discuss the discretization to block size M = 1. As a preliminary, we define some notation, and discuss measure change and stable convergence. This section can be read independently of the rest of the paper. 2.1. Data Generating Mechanism In general, we shall work with a broad class of continuous semimartingales, namely Itô processes. (p)
DEFINITION 1: A p-variate process Xt = (Xt(1) Xt )T is called an Itô process provided it satisfies (3)
dXt = μt dt + σt dWt
X0 = x0
where μt and σt are adapted locally bounded random processes, of dimension p and p × p, respectively, and Wt is a p-dimensional Brownian motion. The underlying filtration will be called (Ft ). The probability distribution will be called P. If we set (4)
ζt = σt σtT
(where T in this case means transpose), then the (matrix) integrated covariance process is given as (5)
X Xt =
t
ζu du 0
The process (5) is also known as the quadratic covariation of X. We shall sometimes use “integrated volatility” as shorthand in the scalar (p = 1) case. We shall suppose that the process Xt is observed at times 0 = t0 < t1 < · · · < tn = T . Thus, for the moment, we assume synchronous observation of all the p components of the vector Xt . We explaine in Mykland and Zhang (2009) how the results encompass the asynchronous case.
INFERENCE FOR CONTINUOUS SEMIMARTINGALES
1407
ASSUMPTION 1 —Sampling Times: In asymptotic analysis, we suppose that tj = tnj (the additional subscript will sometimes be suppressed). The grids Gn = {0 = tn0 < tn1 < · · · < tnn = T } will not be assumed to be nested when n varies. We then do asymptotics as n → ∞. The basic assumption is that (6)
max |tnj − tnj−1 | = o(1)
1≤i≤n
We also suppose that the observation times tnj are nonrandom, but they are allowed to be irregularly spaced. By conditioning, this means that we include the case of random times independent of the Xt process. We thus preclude dependence between the observation times and the process. Such dependence does appear to exist in some cases (cf. Renault and Werker (2009)), and we hope to return to this question in a later paper. 2.2. A Simplifying Strategy for Inference When carrying out inference for observations in a fixed time interval [0 T ], the process μt cannot be consistently estimated. This follows from Girsanov’s theorem (see, for example, Chapter 3.5 of Karatzas and Shreve (1991)). For most purposes, μt simply drops out of the calculations: it is only a nuisance parameter. It is also a nuisance in that it complicates calculations substantially. To deal with this most effectively, we shall borrow an idea from asset pricing theory, and consider a probability distribution P ∗ which is measure theoretically equivalent to P and under which Xt is a (local) martingale (Ross (1976), Harrison and Kreps (1979), Harrison and Pliska (1981); see also Duffie (1996)). Specifically, under P ∗ , (7)
dXt = σt dWt ∗
X0 = x0
where Wt ∗ is a P ∗ -Brownian motion. Following Girsanov’s theorem, T dP ∗ 1 T T log =− (8) σt−1 μt dWt − μt (σt σtT )−1 μt dt dP 2 0 0 with (9)
dWt ∗ = dWt + σt−1 μt dt
Our plan is now to carry out the analysis under P ∗ and adjust results back to P using the likelihood ratio (Radon–Nikodym derivative) dP ∗ /dP. SpecifiT T cally, suppose that θ is a quantity to be estimated (such as 0 σt2 dt, 0 σt4 dt, or the leverage effect). An estimator θˆ n is then found with the help of P ∗ and an asymptotic result is established whereby, say, (10)
n1/2 (θˆ n − θ) → N(b a2 ) L
1408
P. A. MYKLAND AND L. ZHANG
under P ∗ . It then follows directly from the measure theoretic equivalence that n1/2 (θˆ n − θ) also converges in law under P. In particular, consistency and rate of convergence are unaffected by the change of measure. We emphasize that this is due to the finite (fixed) time horizon T . The asymptotic law may be different under P ∗ and P. While the normal distribution remains, the distributions of b and a2 (if random) may change. The main concept is stable convergence. DEFINITION 2: Suppose that all relevant processes (Xt , σt , etc.) are adapted to filtration (Ft ). Let Zn be a sequence of FT -measurable random variables. We say that Zn converges stably in law to Z as n → ∞ if Z is measurable with respect to an extension of FT so that for all A ∈ FT and for all bounded continuous g, EIA g(Zn ) → EIA g(Z) as n → ∞. The same definition applies to triangular arrays. In the context of (10), Zn = n1/2 (θˆ n − θ) and Z = N(b a2 ). For further discussion of stable convergence, see Rényi (1963), Aldous and Eagleson (1978), Chapter 3 of Hall and Heyde (1980, p. 56), Rootzén (1980), and Section 2 of Jacod and Protter (1998, pp. 169–170). With this tool in hand, assume that the convergence in (10) is stable. Then the same convergence holds under P. The technical result is as follows. PROPOSITION 1: Suppose that Zn is a sequence of random variables which converges stably to N(b a2 ) under P ∗ . By this we mean that N(b a2 ) = b + aN(0 1), where N(0 1) is a standard normal variable independent of FT ; also a and b are FT -measurable. Then Zn converges stably in law to b + aN(0 1) under P, where N(0 1) remains independent of FT under P. dP ∗ dP PROOF: EIA g(Zn ) = E ∗ dP ∗ IA g(Zn ) → E dP ∗ IA g(Z) = EIA g(Z) by unidP dP Q.E.D. form integrability of dP ∗ IA g(Zn ) and since dP ∗ is FT -measurable.
Proposition 1 substantially simplifies calculations and results. In fact, the same strategy will be helpful for the localization results that come next in the paper. It will turn out that the relationship between the localized and the continuous processes can also be characterized by absolute continuity and likelihood ratios. REMARK 1: It should be noted that after adjusting back from P ∗ to P, the process μt may show up in expressions for asymptotic distributions. For instances of this, see Examples 3 and 5 below. One should always keep in mind that drift most likely is present and may affect inference. To use the measure change (8) in the subsequent development, we impose the following condition.
INFERENCE FOR CONTINUOUS SEMIMARTINGALES
1409
ASSUMPTION 2—Structure of the Instantaneous Volatility: We assume that (p) the matrix process σt is itself an Itô processes and that if λt is the smallest eigen(p) value of σt , then inft λt > 0 a.s. 2.3. Main Result Concerning One Period Discretization Our main result in this section is that for the purposes of high frequency inference, one can replace the system (7) by the approximation (11)
Pn∗ :
Xtnj+1 = σtnj W˘ tnj+1
for j = 0 n − 1; X0 = x0
where Xtnj+1 = Xtnj+1 − Xtnj , and similarly for W˘ tnj+1 and tnj+1 . One can view (11) as holding σt constant for one period, from tnj to tnj+1 . We call this a one period discretization (or localization). We are not taking a position on what the W˘ t process looks like in continuous time, or even on whether it exists for other t than the sampling times tnj . The only assumption is that the random variables W˘ tnj+1 are independent for different j (for fixed n) and that W˘ tnj+1 has conditional distribution N(0 Itnj+1 ). We here follow the convention from options pricing theory whereby, when the measure changes, the process (Xt ) does not change, while the driving Brownian motion changes. To formally describe the nature of our approximations, we go through two definitions: DEFINITION 3—Specification of the Time Discrete Process Subject to Measure Change: We have (12)
= Xtnj Ut(1) nj = σtnj σ W tnj σ σ tnj Ut(2) nj Ut(2) Utnj = Ut(1) nj nj
for j = 0 n. Here, the quantity σ W t is a three-dimensional (p × p × p) object (tensor) consisting of elements σ (r1 r2 ) W (r3 ) t (r1 = 1 p r2 = 1 p r3 = 1 p), where the prime denotes differentiation with respect to time. Similarly, σ σ t is a four-dimensional tensor with elements of the form σ (r1 r2 ) σ (r3 r4 ) t . Finally, denote by Xnj the σ-field generated by Utnι , ι = 0 j. We note here that σ W t and σ σ t are the usual continuous time quadratic variations, but they are only observed at the times tnj . Through Ut(2) , nj however, we do incorporate information about the continuous time system into discrete time observations: the σt process, the leverage effect (via the tensor σ W t ), and the volatility of volatility (via σ σ t ).
1410
P. A. MYKLAND AND L. ZHANG
For each n, the approximate probability Pn∗ will live on the filtration (Xnj )0≤j≤n as follows: DEFINITION 4—Specification of the First Order Approximation: Define the probability Pn∗ recursively as follows: (i) U0 has same distribution under Pn∗ as under P ∗ . given U0 Utnj is (ii) For j ≥ 0, the conditional Pn∗ distribution of Ut(1) nj+1 given by (11). given U0 Utnj (iii) For j ≥ 0, the conditional Pn∗ distribution of Ut(2) nj+1 (1) ∗ Utnj+1 is the same as under P . To the extent that conditional densities are defined, one can describe the relationship between P ∗ and Pn∗ as f Utn1 Utnj Utnn |U0 (13) =
n n f Ut(1) |U U f Ut(2) |U0 Utnj−1 Ut(1) 0 tnj−1 nj nj nj j=1
altered from P ∗ to Pn∗
j=1
unchanged from P ∗ to Pn∗
where f (y|x) is the density of the regular conditional distribution of y given x with respect to a reference (say, Lebesgue) measure. To state the main theorem, define (14)
d ζˇ t = σt−1 dζt (σ T )−1 t
and (15)
(r r2 r3 )
kt 1
= ζˇ (r1 r2 ) W (r3 ) t [3]
where the [3] means that the right hand side of (15) is a sum over three terms, where r3 can change position with either r1 or r2 : ζˇ (r1 r2 ) W (r3 ) t [3] = ζˇ (r1 r2 ) W (r3 ) t + ζˇ (r1 r3 ) W (r2 ) t + ζˇ (r3 r2 ) W (r1 ) t (note that ζˇ (r1 r2 ) W (r3 ) t is symmetric in its two first arguments). For further discussion of this notation, (r r r ) see Chapter 2.3 of McCullagh (1987, pp. 29–30). Note that ktnj1 2 3 is measurable with respect to the σ-field Xnj generated by Utnι , ι = 0 j. Finally, set p 1 T (r1 r2 r3 ) 2 Γ0 = (16) dt kt 24 0 r r r =1 1 2 3
In the univariate case, we have the representations (17)
kt = 3
1 2 1 σ W t = 6 σ W t = 6log σ W t 2 σt σt
INFERENCE FOR CONTINUOUS SEMIMARTINGALES
1411
and (18)
1 Γ0 = 24
T
k2t dt 0
We now state the main result for one period discretization. THEOREM 1: P ∗ and Pn∗ are mutually absolutely continuous on the σ-field Xnn generated by Utnj , j = 0 n. Furthermore, let (dP ∗ /dPn∗ )(Utn0 Utnj Utnn ) be the likelihood ratio (Radon–Nikodym derivative) on Xnn . Then L dP ∗ 1 1/2 (19) Utn0 Utnj Utnn → exp Γ0 N(0 1) − Γ0 dPn∗ 2 stably in law, under Pn∗ , as n → ∞. N(0 1) is independent of FT . Based on Theorem 1, one can (for a fixed time period) carry out inference under the model (11), and asymptotic results will transfer back to the continuous model (7) by absolute continuity. This is much the same strategy as the one to eliminate the drift described in Section 2.2. The main difference is that we use an asymptotic version of absolute continuity. This concept is known as contiguity and is well known in classical statistical literature (see Remark 2 below). We state the following result in analogy with Proposition 1. A sequence Zn is called tight if every subsequence has a further subsequence which converges in law (see Chapter VI of Jacod and Shiryaev (2003)). Tightness is the compactness concept which goes along with convergence in law. COROLLARY 1: Suppose that Zn (say, n1/2 (θˆ n − θ)) is tight in the sense of stable convergence under Pn∗ . The same statement then holds under P ∗ and P. The converse is also true. In particular, if an estimator is consistent under Pn∗ , it is also consistent under P (and P). Unlike the situation in Section 2.2, the stable convergence in Corollary 1 does not assure that n1/2 (θˆ n − θ) is asymptotically independent of the normal distribution N(0 1) in Theorem 1. It only assures independence from FT -measurable quantities. The asymptotic law of n1/2 (θˆ n − θ) may, therefore, require an adjustment from Pn∗ to P ∗ . ∗
REMARK 2: Theorem 1 says that P ∗ and the approximation Pn∗ are contiguous in the sense of Hájek and Sidak (1967, Chapter IV), LeCam (1986), LeCam and Yang (2000), and Jacod and Shiryaev (2003, Chapter IV). This follows from Theorem 1 since dP ∗ /dPn∗ is uniformly integrable under Pn∗ (since the sequence dPn∗ /dP ∗ is nonnegative, the limit also integrates to 1 under P ∗ ).
1412
P. A. MYKLAND AND L. ZHANG
REMARK 3: A nonzero σ W t can occur in cases other than those what is usually termed “leverage effect.” An important instance of this occurs in Section 4.2, where σ W t can be nonzero due to the nonlinear relationship between two securities. 2.4. Adjusting for the Change From P ∗ to Pn∗ Following (11), write (20)
W˘ tnj+1 = σt−1 Xtnj+1 nj
Under the approximating measure Pn∗ , W˘ tnj+1 has distribution N(0 Itnj+1 ) and is independent of the past. Define the third order Hermite polynomials by hr1 r2 r3 (x) = xr1 xr2 xr3 − r1 r2 r3 x δ [3], where, again, [3] represents the sum over all three possible terms for this form, and δr2 r3 = 1, if r2 = r3 , and = 0, otherwise. In the univariate case, h111 (x) = x3 − 3x. Set (21)
(0) n
M
p n−1 W˘ tnj+1 1 (r1 r2 r3 ) 1/2 = (tnj+1 ) ktnj hr1 r2 r3 12 j=0 (tnj+1 )1/2 r r r =1 1 2 3
(r r2 r3 )
Note that ktnj1
is Xnj -measurable. The adjustment result is now as follows:
THEOREM 2: Assume the setup in Theorem 1. Suppose that under Pn∗ , (Zn Mn(0) ) converges stably to a bivariate distribution b + aN(0 I), where N(0 I) is a bivariate standard normal vector independent of FT , and where the vector b = (b1 b2 )T and the symmetric 2 × 2 matrix a are FT -measurable. Set A = aaT . It is then the case that Zn converges stably under P ∗ to b1 + A12 + (A11 )1/2 N(0 1), where N(0 1) is independent of FT . Note that under the conditions of Theorem 1, Mn(0) converges stably under P to a (mixed) normal distribution with mean zero and (random, but FT measurable) variance Γ0 (so b2 = 0 and A22 = Γ0 ). Thus, when adjusting from Pn∗ to P ∗ , the asymptotic variance of Zn is unchanged, while the asymptotic bias may change. ∗ n
REMARK 4: The logic behind this result is as follows. On the one hand, the asymptotic variance remains unchanged in Theorem 2 as a special case of a stochastic process property (the preservation of quadratic variation under limit operations). We refer to the discussion in Chapter VI.6 in Jacod and Shiryaev (2003, pp. 376–388), for a general treatment. On the other hand, it follows from the proof of Theorem 1 that (22)
log
1 dP ∗ = Mn(0) − Γ0 + op (1) ∗ dPn 2
INFERENCE FOR CONTINUOUS SEMIMARTINGALES
1413
Thus, to the extent that the random variables Zn are correlated with Mn(0) , their asymptotic mean will change from Pn∗ to P ∗ . This change of mean is precisely the value A12 , which is the asymptotic covariance of Zn and Mn(0) . This is a standard phenomenon in situations of contiguity (cf. Hájek and Sidak (1967)). 2.5. Some Initial Examples The following discussion is meant for illustration only. The in-depth applications are in Section 4. We here only consider one-dimensional systems (p = 1). EXAMPLE 1—Integral of Absolute Powers of X: For r > 0, it is customary T n to estimate 0 |σt |r dt by a scaled version of j=1 |Xtnj |r . A general theory for this is given by Barndorff-Nielsen and Shephard (2004b) and Jacod (1994, 2008). For the important cases r = 2 and r = 4, see also Barndorff-Nielsen and Shephard (2002), Jacod and Protter (1998), Mykland and Zhang (2006), Zhang (2001), and other work by the same authors. To reanalyze this estimator with the technology of this paper, note that under r/2 , whereby Pn∗ , the law of |Xtnj+1 |r given Xnj is |σtnj N(0 1)|r tnj+1 (23)
r r r/2 En∗ Xtnj+1 | Xnj = σtnj E|N(0 1)|r tnj+1 r 2r r Var∗n Xtnj+1 | Xnj = σtnj Var |N(0 1)|r tnj+1 r Cov∗n Xtnj+1 W˘ tnj+1 | Xnj = 0
Thus, a natural estimator of θ = (24)
θˆ n =
T 0
|σt |r dt becomes
n−1 r 1 1−r/2 Xtnj+1 tnj+1 r E|N(0 1)| j=0
Absolute normal moments can be expressed analytically as in (56) in Section 4.1 below. n−1 From (23), it follows that θˆ n − j=0 |σtnj |r tnj+1 is the end point of a martingale orthogonal to W and with discrete time quadratic variation n−1 2 (Var(|N(0 1)|r ))/(E|N(0 1)|r )2 j=0 |σtnj |2r tnj+1 . By the usual martingale central limit considerations (Jacod and Shiryaev (2003)), and since θ − n−1 r −1 j=0 |σtnj | tnj+1 = Op (n ), it follows that (25)
1/2 T Var(|N(0 1)|r ) L 2r ˆ T σt dH(t) n (θn − θ) → Z × (E|N(0 1)|r )2 0 1/2
1414
P. A. MYKLAND AND L. ZHANG
stably in law under Pn∗ , where Z is a standard normal random variable. Here, H(t) is the asymptotic quadratic variation of time (AQVT), given by (26)
n (tnj+1 − tnj )2 n→∞ T t ≤t
H(t) = lim
nj+1
provided that the limit exists. For further references on this quantity, see Zhang (2001, 2006) and Mykland and Zhang (2006). that in the case of equally spaced observations, θˆ n is proportional to Note n r j=1 |Xtnj | ; also H(t) = t. To get from convergence under Pn∗ to convergence under P ∗ , we note that |N(0 1)|r is uncorrelated with N(0 1) and N(0 1)3 . We therefore obtain from Theorems 1 and 2 that the stable convergence in (25) holds under P ∗ . The same is true under the true probability P by Proposition 1. EXAMPLE 2 —Bi- and Multipower Estimators: The same considerations as in Example 1 apply to bi- and multipower estimators (see, in particular, Barndorff-Nielsen and Shephard (2004b) and Barndorff-Nielsen, Graversen, Jacod, Podolskij, and Shephard (2006)). The derivations are much the same. In particular, no adjustment is needed from Pn∗ to P ∗ . EXAMPLE 3—Sum of Third Moments: We here consider quantities of the form (27)
Zn =
n−1 3 n Xtnj+1 T j=0
To avoid clutter, we shall look at the equally spaced case only (tnj+1 = t = T/n for all j n). We shall see in Section 4.3 that quantities similar to (27) can be parlayed into estimators of the leverage effect. For now, we just show what the simplest calculation will bring. An important issue, which sets (27) apart from most other cases, is that there is a need for an adjustment from Pn∗ to P ∗ , and also from P ∗ to P. By the same reasoning as in Example 1, En∗ Xt3nj+1 | Xnj = 0 (28) Var∗n Xt3nj+1 | Xnj = σt6nj Var(N(0 1)3 )t 3 = 15σt6nj t 3 Cov∗n Xt3nj+1 W˘ tnj+1 | Xnj = σt3nj Cov(N(0 1)3 N(0 1))t 2 = 3σt3nj t 2
INFERENCE FOR CONTINUOUS SEMIMARTINGALES
1415
L
Thus, Zn is the end point of a Pn∗ martingale and Zn → N(b a2 ) stably under Pn∗ , where T (29) σt3 dWt ∗ b=3 0
a2 = 6
T
σt6 dt 0
REMARK 5—Sample of Calculation: To see in more detail how (29) comes about, let Vt (n) be the P ∗ martingale for which VT(n) = Zn . Let (Xt Vt ) be the process corresponding to the limiting distribution of (Xt Vt (n) ) under Pn∗ . (The prelimiting process is only defined on the grid points tni .) From the two last equations in (28), and by interchanging limits and quadratic variation (Chapter VI.6 in Jacod and Shiryaev (2003, pp. 376-388), cf. Remark 4 above), we get t V V t = 15 (30) σu6 du 0
V W ∗ t = 3
t
σu6 du 0
Now consider the representation dVt = ft dWt ∗ + gt dBt where Bt is a Brownian motion independent of FT (this is by Lévy’s theorem; see, for example, Theorem II.4.4 of Jacod and Shiryaev (2003, p. 102), or Theorem 3.16 of Karatzas and Shreve (1991, p. 157).). From (30), ft2 dt + gt2 dt = 15σt6 dt ft dt = 3σt6 dt In particular, gt2 = 6σt6 . This yields (29). What happens here is that the full quadratic variation of Vt splits into a bias and a variance term. This is due to the nonzero covariation of V and W ∗ . In this example, b = 0. Even more interestingly, the distributional result needs to be adjusted from Pn∗ to P ∗ . To see this, denote h3 (x) = x3 − 3x (the third Hermite polynomial in the scalar case). Then (31) Cov∗n Xt3nj+1 h3 W˘ tnj+1 /t 1/2 | Xnj t 1/2 = σt3nj Cov N(0 1)3 h3 (N(0 1)) t 2 = 6σt3nj t 2
1416
P. A. MYKLAND AND L. ZHANG
Thus, if Mn(0) is as given in Section 2.4, it follows that (Zn Mn(0) ) converge jointly, and stably, under Pn∗ to a normal distribution, where the asymptotic covariance is 1 T (32) kt σt3 dt A12 = 2 0 3 = σ 2 XT 2 since kt σt3 dt = 3σt−2 ζ W t σt3 dt = 3 dζ Xt = 3 dσ 2 Xt . Thus, by TheoL rem 2, under P ∗ , Zn → N(b a2 ) stably, where a2 is as in (29), while T 3 b = 3 (33) σt3 dWt ∗ + σ 2 XT 2 0 We thus have a limit which relates to the leverage effect, which is interesting, but unfortunately obscured by the rest of b , and by the random term with variance a2 . There is finally a need to adjust from P ∗ to P. From (9), we have dWt ∗ = dWt + σt−1 μt dt. It follows that T 3
b =3 (34) σt3 (dWt + σt−1 μt dt) + σ 2 XT 2 0 Thus, b is unchanged from P ∗ to P, but has different distributional properties. In particular, μt now appears in the expression. This is unusual in the high frequency context. It seems to be a general phenomenon that if there is random bias under P ∗ , then μ will occur in the expression for bias under P. This happens again in Example 5 in Section 4.3. A direct derivation of this same limit is given in Example 6 of Kinnebrock and Podolskij (2008). In their notation, σt dt = 2σ −2 dσ 2 Xt . 3. HOLDING σ CONSTANT OVER LONGER TIME PERIODS 3.1. Setup We have shown in the above that it is asymptotically valid to consider systems where σ is constant from one time point to the next. We shall in the following show that it is also possible to consider approximate systems where σ is constant over longer time periods. We suppose that there are Kn intervals of constancy, of the form (τni−1 τni ], where (35) Hn = 0 = τn0 < τn1 < · · · < τnKn = T ⊆ Gn
INFERENCE FOR CONTINUOUS SEMIMARTINGALES
1417
If we set (36)
Mni = #{tnj ∈ (τni−1 τni ]} = number of intervals (tnj−1 tnj ] in (τni−1 τni ]
we shall suppose that (37)
max Mni = O(1) as i
n → ∞
from which it follows that Kn is of exact order O(n). We now define the approximate measure, called Qn , given by (38)
X0 = x0 ; for each i = 1 Kn : Q Xtnj+1 = στni−1 Wtnj+1
for tnj+1 ∈ (τni−1 τni ]
To implement this, we use a variation over Definition 4. Formally, we define the approximation as follows. DEFINITION 5—Block Approximation: Define the probability Qn recursively as follows: (i) U0 has the same distribution under Qn as under P ∗ . given U0 Utnj (ii) For j ≥ 0, the conditional Qn distribution of Ut(1) nj+1 Q is given by (38), where Wtnj+1 is conditionally normal with mean zero and variance Itnj+1 . given U0 Utnj (iii) For j ≥ 0, the conditional Qn distribution of Ut(2) nj+1 (1) ∗ Utnj+1 is the same as under P . We can now describe the relationship between Qn and Pn∗ , as follows. Let the Gaussian log likelihood be given by (39)
1 1
(x; ζ) = − log det(ζ) − xT ζ −1 x 2 2
We then obtain the following statement directly. PROPOSITION 2: The likelihood ratio between Qn and Pn∗ is given by (40)
dQn Utn0 Utnj Utnn ∗ dPn =
Xtnj+1 ; ζτni−1 tnj+1
log
i
τni−1 ≤tnj <τni
− Xtnj+1 ; ζtnj tnj+1
1418
P. A. MYKLAND AND L. ZHANG
DEFINITION 6: To measure the extent to which we hold the volatility constant, we define the asymptotic decoupling delay (ADD) by (41) (tnj − τni−1 ) K(t) = lim n→∞
i
tnj ∈(τni−1 τni )∩[0t]
provided the limit exists. From (6) and (37), every subsequence has a further subsequence for which K(·) exists (by Helly’s theorem; see, for example, Billingsley (1995, p. 336). Thus one can take the limits to exist without any major loss of generality. Also, when the limit exists, it is Lipschitz continuous. In the case of equidistant observations and equally sized blocks of M observations, the ADD takes the form (42)
1 K(t) = (M − 1)t 2 3.2. Main Contiguity Theorem for the Block Approximation
We obtain the following main result, which is proved in Appendix A.2. THEOREM 3—Contiguity of Pn∗ and Qn : Suppose that Assumptions 1 and 2 are satisfied. Assume that the asymptotic decoupling delay (K, equation (41)) exists. Set 1 −1 (43) XtTnj+1 ζt−1 X − ζτ−1 t Zn(1) = t nj+1 nj+1 nj ni−1 2 i t ∈[τ τ ) nj
ni−1
ni
and let Mn(1) be the end point of the Pn∗ -martingale part of Zn(1) (see (A.25) and (A.27) in Appendix A.2 for the explicit formulas). Define 1 T (44) tr(ζt−2 ζ ζ t ) dK(t) Γ1 = 2 0 where tr denotes the trace of the matrix. Then, as n → ∞, Mn(1) converges stably in law under Pn∗ to a normal distribution with mean zero and variance Γ1 . Also, under Pn∗ , (45)
log
1 dQn = Mn(1) − Γ1 + op (1) dPn∗ 2
Furthermore, if Mn(0) is as defined in (21), then the pair (Mn(0) Mn(1) ) converges stably under Pn∗ to (Γ01/2 V0 Γ11/2 V1 ), where V0 and V1 are independent and identically distributed (i.i.d.) N(0 1), and independent of FT .
INFERENCE FOR CONTINUOUS SEMIMARTINGALES
1419
The theorem says that Pn∗ and the approximation Qn are contiguous (cf. Remark 2 in Section 2.3). By the earlier Theorem 1, it follows that Qn and P ∗ (and P) are contiguous. In particular, as before, if an estimator is consistent under Qn , it is also consistent under P ∗ and P. Rates of convergence (typically n1/2 ) are also preserved, but the asymptotic distribution may change. EXAMPLE 4: For a scalar process of the form dXt = μt dt + σt dWt , and with equidistant observations of X, Γ1 in (44) can be written M − 1 T −4 2 2
(46) σt σ σ t dt Γ1 = 4 0 T From (17) and (18), Γ0 = 38 0 σt−6 (σ 2 X t )2 dt. Thus, Γ0 is related to the leverage effect, while Γ1 is related to the volatility of volatility. In the case of a Heston (1993) model, where dσt2 = κ(α − σt2 ) dt + γσt dBt and B is a Brownian motion correlated with W , dB W t = ρ dt, one obtains T T 3 1 2 2 −2 Γ0 = (ργ) (47) σt dt Γ1 = γ (M − 1) σt−2 dt 8 4 0 0 REMARK 6 —Which Probability?: We have now done several approximations. The true probability is P and we are proposing to behave as if it is Qn . We thus have the alterations of probability (48)
log
dP dP ∗ dPn∗ dP = log + log + log ∗ ∗ dQn dP dPn dQn
To make matters slightly more transparent, we have stated Theorem 3 under the same probability (Pn∗ ) as Theorems 1 and 2. Since computations would normally be made under Qn , however, we note that Theorem 2 applies equally Q if one replaces Pn∗ by Qn , and Mn(0) by Mn(0Q) , given as in (21), with Wtnj+1 replacing W˘ tnj+1 . (Since Mn(0Q) = Mn(0) + op (1).) Similarly, if one lets Mn(1Q) be the end point of the Qn -martingale part of −Zn(1) , one gets the same stable convergence under Qn . Obviously, (45) should be replaced by (49)
log
1 dPn∗ = Mn(1Q) − Γ1 + op (1) dQn 2
and Mn(1Q) = −Mn(1) + Γ1 + op (1). 3.3. Measure Change and Hermite Polynomials The three measure changes in Remark 6 turn out to all have a representation in terms of Hermite polynomials.
1420
P. A. MYKLAND AND L. ZHANG
Recall that the standardized Hermite polynomials are given by hr1 (x) = xr1 , hr1 r2 (x) = xr1 xr2 − δr1 r2 , and hr1 r2 r3 (x) = xr1 xr2 xr3 − xr1 δr2 r3 [3], where, again, [3] represents the sum over all three possible combinations, and δr2 r3 = 1, if r2 = r3 , and = 0 otherwise. In the scalar case, h1 (x) = x, h11 (x) = x2 − 1, and h111 (x) = x3 − 3x. From Remark 6, (50)
Q p n−1 Wtnj+1 1 (r1 r2 r3 ) 1/2 = (tnj+1 ) ktnj hr1 r2 r3 M 12 j=0 (tnj+1 )1/2 r1 r2 r3 =1 1 στni−1 tr στTni−1 ζt−1 − ζτ−1 Mn(1Q) = − nj ni−1 2 i t ∈(τ τ ] (0Q) n
nj
× h··
ni−1
ni
Q
Wtnj+1 (tnj+1 )1/2
Similarly, define a discretized version of M (G) = (51)
Mn(GQ) =
T 0
σt−1 μt dWt ∗ by
Q n−1 Wtnj+1 (tnj+1 )1/2 στ−1 μ h τni−1 · ni−1 (tnj+1 )1/2 j=0
(G is for Girsanov; h· is the vector of first order Hermite polynomials, similarly h·· is the matrix of second order such polynomials). We also set T ΓG = (52) μTt (σtT σt )−1 μt dt 0
We therefore can summarize of our results: (53)
dP 1 = Mn(GQ) − ΓG + op (1) dP ∗ 2 ∗ 1 dP = Mn(0Q) − Γ0 + op (1) log ∗ dPn 2 log
log
1 dPn∗ = Mn(1Q) − Γ1 + op (1) dQn 2
Furthermore, by the Hermite polynomial property, we obtain that these three martingales have, by construction, zero predictable covariation (under Qn ). In particular, the triplet (Mn(GQ) Mn(0) Mn(1) ) converges stably to (M (G) Γ01/2 V0 Γ11/2 V1 ), where V0 and V1 are i.i.d. N(0 1), and independent of FT . REMARK 7: The term Mn(GQ) is in many ways different from Mn(0) and Mn(1) . The convergence of the former is in probability, while the latter converge only
INFERENCE FOR CONTINUOUS SEMIMARTINGALES
1421
in law. Thus, for example, the property discussed in Remark 4 (see also Theorem 4 in the next section) does not apply to Mn(GQ) . If Zn and Mn(GQ) have joint covariation, this yields a smaller asymptotic variance for Zn , but also bias. For instances of this, see Example 3 in Section 2.5 and Example 5 in Section 4.3. 3.4. Adjusting for the Change From P ∗ to Qn The adjustment result is now similar to that of Section 2.4: THEOREM 4: Assume the setup in Theorems 1–3. Suppose that under Qn , (Zn Mn(0) Mn(1) ) converges stably to a trivariate distribution b + aN(0 I), where N(0 I) is a trivariate vector independent of FT , where the vector b = (b1 b2 b3 )T and the symmetric 3 × 3 matrix a are FT -measurable. Set A = aaT . Then Zn converges stably under P ∗ to b1 + A12 + A13 + (A11 )1/2 N(0 1), where N(0 1) is independent of FT . Recall that b2 = b3 = A23 = 0, A22 = Γ0 , and A33 = Γ1 . The proof is the same as for Theorem 2. Theorem 4 states that when adjusting from Qn to P ∗ , the asymptotic variance of Zn is unchanged, while the asymptotic bias may change. 4. FIRST APPLICATIONS We here discuss various applications of our theory. For simplicity, assume in the following that sampling is equispaced (so tnj = tn = T/n for all j). The question of irregular sampling is discussed in Mykland and Zhang (2009). Except in Sections 4.2 and 4.4.2, we also take (Xt ) to be a scalar process. We take the block size M to be independent of i (except possibly for the first and last block, and this does not matter for asymptotics). Define 2 1 Xtnj − X τni (54) σˆ τ2ni = t n (Mn − 1) t ∈(τ τ ] nj
X τni =
1 Mn
ni
ni+1
Xtnj =
tnj ∈(τni τni+1 ]
1 Xτni+1 − Xτni Mn
To analyze estimators, denote by Yni the information at time τni . Note that Yni = Xnj , where j is such that tnj = τni . 4.1. Estimation of Integrals of |σt |r We return to the question of estimating T |σt |r dt θ= 0
1422
P. A. MYKLAND AND L. ZHANG
n We shall not use estimators of the form j=1 |Xtnj |r , as in Example 1. We show how to get more efficient estimators by using the block approximation. 4.1.1. Analysis We observe that under Qn , the Xtnj+1 are i.i.d. N(0 στ2ni tn ) within each block. From the theory of uniformly minimum variance unbiased (UMVU) estimation (see, for example, Lehmann (1983)), the optimal estimator of |στni |r is r 2 r/2 στ = c −1 ˆ τni (55) M−1r σ ni This also follows from sufficiency considerations. Here, cMr is the normalizing constant which gives unbiasedness, namely 2 r/2 χM (56) cMr = E M r +M r/2 2 2 = M M 2 where χ2M has the standard χ2 distribution with M degrees of freedom, and is the Gamma function. Our estimator of θ (which is blockwise UMVU under Qn ) therefore becomes r στni (57) θˆ n = (Mt) i
It is easy to see that θˆ n asymptotically has no covariation with any of the Hermite polynomials in Section 3.3 and so, by standard arguments, T 1/2 cM−12r L (58) −1 σt2r dt n1/2 (θˆ n − θ) → N(0 1) T M 2 cM−1r 0 stably in law, under P (and P ∗ , Pn∗ , and Qn ). This is because, under Qn , r Var (Mt) (59) στni Yni i
r/2 2 χM−1 = σ (Mt) c Var (M − 1) T M cM−12r 2r = στni (Mt) −1 2 n cM−1r 2r τni
2 −2 M−1r
INFERENCE FOR CONTINUOUS SEMIMARTINGALES
1423
REMARK 8—Not Taking Out the Mean: One can replace σˆ τ2ni by (60)
σ˜ τ2ni =
1 t n Mn
2 Xtnj
tnj ∈(τni τni+1 ]
and take (61)
r στ = c −1 σ˜ 2 r/2 Mr τni ni
and define θ˜ n accordingly. The above analysis goes through. The (random) asymptotic variance becomes T cM2r TM 2 − 1 (62) σt2r dt cMr 0 4.1.2. Asymptotic Efficiency We note that for large M, (63)
asymptotic variance of n1/2 (θˆ n − θ) ↓ T
r2 2
T
σt2r dt 0
This is also the minimal asymptotic variance of the parametric maximum likelihood estimator (MLE) when σ 2 is constant. Thus, by choosing M on the large side, say M = 20, one can get close to parametric efficiency (see Figure 1). To see the gain from the procedure, compare to the asymptotic variance of the estimator in Example 1, which can be written as T c12r T − 1 σt2r dt 2 c1r 0 Compared to the variance in (63), the earlier estimator has asymptotic relative efficiency (ARE) (64)
ARE(estimator from Example 1) asymptotic variance in (63) asymptotic variance of estimator from Example 1 −1 r 2 c12r −1 = 2 2 c1r =
Note that except for r = 2, ARE < 1. Figure 1 gives a plot of the ARE as a function of r. As one can see, there can be substantial gain from using the proposed estimator (57).
1424
P. A. MYKLAND AND L. ZHANG
T FIGURE 1.—Asymptotic relative efficiency (ARE) of three estimators of θ = 0 |σ|rt dt as a function of r. The dotted curve corresponds to the traditional estimator, which is proportional to nj=1 |Xtnj |r . The solid and dashed lines are the ARE’s of the block based estimators using, respectively, σˆ (solid) and σ˜ (dashed). Block sizes M = 20 and M = 100 are given. The ideal value is ARE = 1. Blocking is seen to improve efficiency, especially away from r = 2. There is some cost to removing the mean in each block (the difference between the dashed and the solid curve).
REMARK 9: In terms of asymptotic distribution, there is further gain in using ˜ AREM (θ) ˆ = M/(M − the estimator from Remark 8. Specifically, AREM (θ)/ 1). This is borne out by Figure 1. However, it is likely that the drift μ, as well as the block size M, would show up in a higher order bias calculation. This would make σ˜ less attractive. In connection with estimating the leverage effect, it is crucial to use σˆ rather than σ˜ (cf. Section 4.3). REMARK 10: We emphasize again that M has to be fixed in the present calculation, so that the ideal asymptotic variance on the right hand side of (63) is only approximately attained. It would be desirable to build a theory where M → ∞ as n → ∞. Such a theory would presumably be able to pick up any biases due to the blocking. 4.2. Integrated Betas (1) t
(p)
Consider processes X Xt and Yt which are observed synchronously at times 0 = tn0 < tn1 < · · · < tnn = T . Suppose that these processes are related by (65)
dYt =
p k=1
(k) β(k) + dZt t dXt
with
(k) X Z t = 0 for all t and k
INFERENCE FOR CONTINUOUS SEMIMARTINGALES
1425
T We consider the question of estimating θ(k) = 0 β(k) t dt. This estimation problem is conceptually closely related to the realized regressions studied in Barndorff-Nielsen and Shephard (2004a) and Dovonon, Goncalves, and Meddahi (2008). The ANOVA in Mykland and Zhang (2006) is concerned with the residuals in this same model. Under the approximation Qn , in each block τni−1 < tnj ≤ τni the regression (65) becomes, for the observables, (66)
Ytnj =
p
(k) β(k) τni−1 Xtnj + Ztnj
k=1 (1) βˆ (p) It is therefore natural to take the estimator (βˆ τ(1) τni−1 ) of (βτni−1 ni−1 β(p) τni−1 ) to be the regular least squares estimator (without intercept) based on (p)
Xtnj Ytnj ) inside the block. The overall estithe observables (Xt(1) nj mate of the vector of θ’s is then θˆ n(k) = (67) βˆ (k) τni−1 Mt i
From the unbiasedness of linear regression, we inherit that n1/2 (θˆ n − θ) is the end point of a (Yni Qn ) martingale, with discrete time quadratic covariation matrix CovQn βˆ τni−1 − βτni−1 | Yni−1 n(Mt)2 (68) i
be the smallest σ-field To see how the martingale property follows, let Yni−1 containing Yni−1 and σ(Xtnj τni−1 < tnj ≤ τni ). The precise implication of
the classical unbiasedness is that EQn (βˆ τni−1 − βτni−1 | Yni−1 ) = 0, whence the stated martingale property follows by the law of iterated expectations (or tower property). To compute (68), note that from standard regression theory (see, e.g., Weisberg (1985, p. 44)),
= VarQn Ztnj | Yni−1 × (X T X)−1 CovQn βˆ τni−1 − βτni−1 | Yni−1 (69)
, where k = where, with some abuse of notation, X is the matrix of Xt(k) nj 1 p, and the tnj are in block number i. Now observe that under Qn , the conditional distribution of X given Yni−1 is that of M independent rows, each row being a p-variate normal distribution with mean zero and covariance matrix X X τni−1 tn . (Recall that the prime here denotes differentiation with respect to time t.) Hence, X T X has a Wishart distribution with scale matrix X X τni−1 tn and M degrees of freedom. (We refer to Mardia, Kent, and
1426
P. A. MYKLAND AND L. ZHANG
Bibby (1979, p. 66) for the definition of the Wishart distribution.) It follows that (Mardia, Kent, and Bibby (1979, p. 85)) (70)
−1 EQn (X T X)−1 | Yni−1 = X X τni−1 tn−1 /(M − p − 1)
) = Z Z τni−1 tn , we finally get that Since VarQn (Ztnj | Yni−1
(71)
CovQn βˆ τni−1 − βτni−1 | Yni−1 −1 = Z Z τni−1 X X τni−1 /(M − p − 1)
It follows that the limit of (68) is (72)
MT M −p−1
T
Z Z t (X X t )−1 dt
0
For the same reasons as in Sections 2.5 and 4.1 it then follows that n1/2 (θˆ n − θ) converges stably to a multivariate mixed normal distribution, with mean zero and covariance matrix given by (72), under all of Qn , Pn∗ , P ∗ , and P. 4.3. Estimation of Leverage Effect We here seek to estimate σ 2 XT . We have seen in Example 3 that this quantity can appear in asymptotic distributions, and we shall here see how the sum of third powers can be refined into an estimate of this quantity. The natural estimator would be (73)
2 X = σ T
σˆ τ2ni+1 − σˆ τ2ni Xτni+1 − Xτni
i
where σˆ τ2ni and X τni are given above in (54). It turns out, however, that this estimator is asymptotically biased, as follows: PROPOSITION 3: Let M ≥ 2. In the equally spaced case, under both P ∗ and P, and as n → ∞, (74)
1/2 T 4 L 1 2 6 2 X → σ σ X + N(0 1) × σ dt T 2 M −1 0 t
stably in law, where N(0 1) is independent of FT .
INFERENCE FOR CONTINUOUS SEMIMARTINGALES
1427
The derivation of this result, along with that of the result in Example 5 below, is given in Appendix A.3. This appendix gives what we think is a typical way to show results based on the general theory of Sections 2 and 3. Accordingly, we define an asymptotically unbiased estimator of leverage effect by 2 X = 2 σ (75) σˆ τ2ni+1 − σˆ τ2ni Xτni+1 − Xτni T i 2 X = 2σ 2 X . Following Proposition 3, In other words, σ T T
(76)
L 2 X − σ 2 X → c 1/2 N(0 1) σ T M
stably under P ∗ and P, where T 16 cM = (77) σ 6 dt M −1 0 t 2 X comes from error induced by It is important to note that the bias in σ T both the one period and the multiperiod discretizations (the adjustment from P ∗ to Pn∗ , then to Qn ). Thus, this is an instance where naïve discretization does not work. 2 X is not consistent. By choosing large M, For fixed M, the estimator σ T however, one can make the error as small as one wishes.
REMARK 11: It is conjectured that there is an optimal rate of M = O(n1/2 ) 2 X − σ 2 X is as n → ∞. The presumed optimal convergence rate of σ T T Op (n−1/4 ), in analogy with the results in Zhang (2006). This makes sense because there is an inherited noisy measure σˆ t2 of σt2 in the definition the estima2 X ; see (75). The problem of estimating σ 2 X is therefore similar tor σ T T to estimating volatility in the presence of microstructure noise. It would clearly be desirable to have a theory for the case where M → ∞ with n, but this is beyond the scope of this paper. EXAMPLE 5—The Role of μ: The Effect of Not Removing the Mean From the Estimate of σ 2 : In the development above, the drift μ did not surface. This example gives evidence that the drift can matter. We shall see that if one does not take out the drift when estimating σ 2 , μ can appear in the asymptotic bias. Suppose that one wishes to use the estimator (75), but replacing σˆ τ2ni by the 2 X is then estimator σ˜ τ2ni from (60). An estimator analogous to σ T (78)
with mean
2 X σ T
=2
σ˜ τ2ni+1 − σ˜ τ2ni Xτni+1 − Xτni
i
1428
P. A. MYKLAND AND L. ZHANG
We show in the proof for Example 5 in Appendix A.3 that, for M ≥ 2, with mean L M − 2 4 T 3 2 2 σ XT σ XT − (79) → σ (dWt + σt−1 μt dt) M M 0 t 1/2 M +1 T 6 σ dt + N(0 1) 16 M2 0 t Hence, with this estimator, μ does show up in asymptotic expressions. The estimation of leverage effect is therefore a case where it is important to remove the mean in each block. 4.4. Other Examples We here summarize two additional examples of application that have been studied more carefully elsewhere. 4.4.1. Realized Quantile-Based Estimation of Integrated Volatility This methodology has been studied in a recent paper by Christensen, Oomen, and Podolskij (2008). In the case of fixed block size and no microstructure, their results (Theorems 1 and 2) can be deduced from Theorems 1 and 3 of this paper. The key observation is that if V is the kth quantile among 2 , where U(k) Xtnj , with τni−1 < tnj ≤ τni , then EQn (V 2 | Yni−1 ) = στ2ni−1 EU(k) is the kth quantile of M i.i.d. standard normal random variables. Blockwise L-statistics can be constructed similarly. We emphasize that the paper by Christensen, Oomen, and Podolskij (2008) goes much further in developing the quantile-based estimation technology, including increasing block size and allowing for microstructure. 4.4.2. Analysis of Variance/Variation A related problem to the one discussed above in Section 4.2 is that of analysis of variance/variation (Zhang (2001) and Mykland and Zhang (2006)). We are again in the situation of the regression (65), but now the purpose is to estimate Z ZT , that is, the residual quadratic variation of Y after regressing on X. Blocking can here be used in much the same way as in Section 4.2. 4.5. Abstract Summary of Applications We here summarize the procedure which is implemented in the applications section above. We remain in the scalar case. In the type of problems we have considered, the parameter θ to be estimated can be written as θ= (80) θni + Op (n−1 ) i
INFERENCE FOR CONTINUOUS SEMIMARTINGALES
1429
where, under the approximating measure, θni is approximately an integral from τni−1 to τni . Estimators are of the form θˆ ni (81) θˆ n = i
where θˆ ni uses M or (in the case of the leverage effect) 2M increments. If one sets Zni = nα (θˆ ni − θni ), we need that Zni is a martingale under Qn . α can be 0, 1/2, or any other number smaller than 1. We then show in each individual case that, in probability, T (82) VarQn (Zni | Yni−1 ) → ft2 dt 0
i
Q n
−W
Q τni
Q τni−1
Cov Zni W
| Yni−1 →
T
gt dt 0
i
for some functions (processes) ft and gt . We also find the following limits in probability: 1 (83) CovQn Zni (tnj+1 )1/2 ktnj A12 = lim 12 n→∞ i t ∈(τ τ ] nj
ni−1
ni
Wtnj+1 × h3 and Yni−1 (tnj+1 )1/2 1 Q A13 = − lim Covn Zni − ζτ−1 στ2ni−1 ζt−1 nj ni−1 n→∞ 2 t ∈(τ τ ] i Q
nj
× h2
ni−1
ni
Q Wtnj+1 Yni−1 (tnj+1 )1/2
We finally obtain the following statement: THEOREM 5 —Summary of Method in the Scalar Case: In the setting described and subject to regularity conditions, T 1/2 L α ˆ 2 2 (84) (ft − gt ) dt n (θn − θn ) → b + A12 + A13 + N(0 1) 0
stably in law under P ∗ and P, with N(0 1) independent of FT . b is given by T T ∗ (85) gt dWt = gt (dWt + σt−1 μt dt) b= 0
0
1430
P. A. MYKLAND AND L. ZHANG
5. CONCLUSION The main finding of the paper is that one can in broad generality use first order approximations when defining and analyzing estimators. Such approximations require an ex post adjustment involving asymptotic likelihood ratios, and these are given. Several examples are provided in Section 4. The theory relies heavily on the interplay between stable convergence and measure change, and on asymptotic expansions for martingales. We here give a technical summary of the findings. The paper deals with two forms of discretization: to block size M = 1 and then to block size M > 1. Each of these forms has to be adjusted for by using an asymptotic measure change. Accordingly, the asymptotic likelihood ratios ∗ can be called dP∞ /dP and dQ∞ /dP. There is similarity here to the measure ∗ change dP /dP used in option pricing theory, where P ∗ is an equivalent martingale measure (a probability distribution under which the drift of an underlying process has been removed; for our purposes, discounting is not an issue); for more discussion and references, see Section 2.2. In fact, for the reasons given in that section, we can, for simplicity, assume that the probabilities Pn∗ and Qn also are such that the (observed discrete time) process has no drift. It is useful to write the likelihood ratio decomposition (86)
log
∗ dQ∞ dQ∞ dP∞ dP ∗ = log + log + log ∗ dP dP∞ dP ∗ dP
We saw in Section 3.3 that these three likelihood ratios (LR) are of similar form and can be represented in terms of Hermite polynomials of the increments of the observed process. The connections are summarized in Table I. TABLE I MEASURE CHANGES (LIKELIHOOD RATIOS) TIED TO THREE PROCEDURES MODIFYING PROPERTIES OF THE OBSERVED PROCESSa Type of Approximation
One period discretization (M = 1) Multiperiod discretization (block M > 1) Removal of drift
Compensating LR
Size of LR Is Related to
Order of Relevant Hermite Polynomial
∗ dP∞ /dP ∗
Leverage effect
3
∗ dQ∞ /dP∞
Volatility of volatility
2
dP ∗ /dP
Mean
1
a P is the true probability distribution, P ∗ is the equivalent martingale measure (as in option pricing theory). P ∗ n t is the probability for which (1) is exact, and Qn is the probability for which one can use t i−M fs dWs ≈ fti−M (Wti − i Wti−M ). The two measure changes dPn∗ /dP ∗ and dQn /dPn∗ have asymptotic limits, denoted by subscript ∞. This
connects to the statistical concept of contiguity (cf. Remark 2).
INFERENCE FOR CONTINUOUS SEMIMARTINGALES
1431
The three approximations all lead to adjustments that are absolutely continuous. This fact means that for estimators, consistency and rate of convergence are unaffected by the the approximation. It turned out that asymptotic variances are similarly unaffected (Remark 4 in Section 2.4). Asymptotic distributions can be changed through their means only (Sections 2.4 and 3.4). We emphasize that this is not the same as introducing inconsistency. A number of unsolved questions remain. The approach provides a tool for analyzing estimators; it does not always give guidance as to how to define estimators in the first place. Also, the theory requires block sizes (M) to stay bounded as the number of observations increases. It would be desirable to have a theory where M → ∞ with n. This is not possible with the likelihood ratios we consider, but may be available in other settings, such as with microstructure noise. Causality effects from observation times to the process, such as in Renault and Werker (2009), would also need an extended theory. APPENDIX: PROOFS A.1. Proofs of Theorems 1 and 2 To avoid having asterisks (∗) everywhere, use the notation P for P ∗ until the end of the proof of Theorem 1 only, and without loss of generality. This is only a matter of notation. One understands the differential σt dWt to be p (r r ) (r ) a p-dimensional vector with r1 th component r2 =1 σt 1 2 dWt 2 . To study the properties of this approximation, consider the following “strong approximation.” Set (A.1)
dσt = σ˜ t dt + ft dWt + gt dBt
where ft is a tensor and gt dBt is a matrix, with B a Brownian motion independent of W (g and B can be tensor processes). For example, component (r1 r2 ) p (r r r ) of the matrix ft dWt is r3 =1 ft 1 2 3 dW (r3 ) . Note that σt is an Itô process by Assumption 2. Then tnj+1 Xtnj+1 = σtnj Wtnj+1 + σt − σtnj dWt (A.2) tnj
= σtnj Wtnj+1 + ftnj
tnj+1 tnj
t
dWu dWt
tnj
+ dB dW term + higher order terms It will turn out that the two first terms on the right hand side will matter in our approximation. Note first that by taking quadratic covariations, one obtains
(r r r ) ft 1 2 3 = σ (r1 r2 ) W (r3 ) t (A.3)
1432
P. A. MYKLAND AND L. ZHANG
To proceed with the proof, some further notation is needed. Define (A.4)
d σ˘ t = σt−1 dσt −1 (r r ) (r4 r2 r3 ) (r r r ) f˘t 1 2 3 = σ˘ (r1 r2 ) W (r3 ) t = (σt ) 1 4 ft ; p
r4 =1 (r r r ) (r r ) σ˘ t 1 2 and f˘t 1 2 3 are not symmetric in (r1 r2 ). However, since dζt = d(σt σtT ) = σt dσt + (σt dσt )T + dt terms, we obtain from (14) that d ζˇ t = σt−1 dσt + (σt−1 dσt )T + dt terms. Hence
(A.5)
(r r )
(r r r ) (r r r ) ζˇ 1 2 W (r3 ) t = f˘t 1 2 3 + f˘t 2 1 3
Also (A.6)
(r r2 r3 )
kt 1
(r r r ) = ζˇ (r1 r2 ) W (r3 ) t [3] = f˘t 1 2 3 [6]
Finally, we let t = T/n (the average tnj+1 ). PROOF OF THEOREM 1: Note that, from (20) and (A.2), tnj+1 t ˘ ˘ Wtnj+1 = Wtnj+1 + ftnj (A.7) dWu dWt tnj
tnj
+ dB dW term + higher order terms In the representation (A.7), we obtain, up to Op (t 5/2 ), (A.8)
(r3 ) (r1 ) (r2 ) Wtnj+1 Wtnj+1 |Ftnj cum3 W˘ tnj+1 tnj+1 t (r1 s2 s3 ) (s ) (s3 ) ˘ = cum dWt 2 dWu ftnj tnj
s2 s3
tnj
(r3 ) (r2 ) Wtnj+1 Wtnj+1 Ftnj =
(r s s ) f˘tnj1 2 3 cum
tnj+1
tnj
s2 s3
(r3 ) (r2 ) Wtnj+1 Wtnj+1 Ftnj =
s2 s3
(r s s ) f˘tnj1 2 3 Cov
t
(s ) dWu(s3 ) dWt 2
tnj
tnj+1
t (s3 ) u
dW tnj
tnj
dt δ
s2 r2
Wtnj+1 Ftnj [2] (r3 )
1433
INFERENCE FOR CONTINUOUS SEMIMARTINGALES tnj+1
(r r s ) 1 2 3 ˘ Cov = ftnj
tnj
s3
tnj+1
=
dt
tnj
tnj
(r r s ) f˘tnj1 2 3 Cov
dt
tnj
(r3 ) dWu∗(s3 ) dt Wtnj+1 Ftnj [2] t
tnj
s3
tnj+1
=
t
(r3 ) dWu∗(s3 ) Wtnj+1 F tnj [2]
(r r s ) f˘tnj1 2 3 (t − tnj )δs3 r3 [2]
s3
1 2 (r r r ) = tnj+1 f˘tnj1 2 3 [2] 2 where [2] represents the swapping of r2 and r3 (see McCullagh (1987, pp. 29–30) of for a discussion of the notation). In the third transition, we have used the third Bartlett type identity for martingales. Hence (r3 ) (r1 ) (r2 ) W˘ tnj+1 W˘ tnj+1 |Ftnj cum3 W˘ tnj+1 (A.9) 1 2 (r r r ) f˘tnj1 2 3 [6] + Op t 5/2 = tnj+1 2
1 2 ˇ (r1 r2 ) = tnj+1 W (r3 ) t [3] + Op t 5/2 ζ nj 2 (r3 ) (r1 ) (r2 ) 1/2 1/2 /tnj+1 W˘ tnj+1 /tnj+1 W˘ tnj+1 / by symmetry. Set κr1 r2 r3 = cum3 (W˘ tnj+1 1/2 tnj+1 |Ftnj ), and similarly for other cumulants. From (15) and (A.9),
(A.10)
1 1/2 (r1 r2 r3 ) ktnj + Op (t) κr1 r2 r3 = tnj+1 2
˜ + d martingale), At the same time (dζ = ζdt (r1 ) (r2 ) Xtnj+1 |Ftnj (A.11) Cov Xtnj+1 tnj+1 (r r ) (r r ) (r r ) ζu 1 2 − ζtnj1 2 duFtnj = tnj+1 ζtnj1 2 + E tnj
(r1 r2 ) nj+1 tnj
= t
ζ
+E
(r r2 )
u
du tnj
= tnj+1 ζtnj1
tnj+1
˜ζv(r1 r2 ) dvFt nj
tnj
1 2 ˜ (r1 r2 ) + tnj+1 ζtnj + Op (t 3 ) 2
1 2 so that Cov(W˘ tnj+1 W˘ tnj+1 |Ftnj ) = tnj+1 δr1 r2 + Op (t 2 ) and
(r )
(A.12)
r r
(r )
1 2 + Op (t) κr1 r2 = δtnj
1434
P. A. MYKLAND AND L. ZHANG
(r) Since X is a martingale, we also have κr = E(W˘ tnj+1 |Ftnj ) = 0. In the notation of Chapter 5 of McCullagh (1987), we take λr1 r2 = δr1 r2 , and let the other λ’s be zero. From now on, we also use the summation convention. By the development in Chapter 5.2.2 of McCullagh, obtain the Edgeworth ex1/2 pansion for the density fnj+1 of W˘ tnj+1 /tnj+1 given Ftnj , on the log scale as
(A.13)
1 r1 r2 r3 κ hr1 r2 r3 (x) 3! 1 1 + (κr1 r2 − λr1 r2 )hr1 r2 (x) + κr1 r2 r3 r4 hr1 r2 r3 r4 (x) 2 4! [10] + κr1 r2 r3 κr4 r5 r6 hr1 r2 r3 r4 r5 r6 (x) 6! 2 1 r1 r2 r3 − hr1 r2 r3 (x) + Op t 3/2 κ 72
log fnj+1 (x) = log φ(x; δr1 r2 ) +
where we for simplicity have used the summation convention. Note that the three last lines contain terms of order Op (t) (or smaller). We note, following formula (5.7) in McCullagh (1987, p. 149), that hr1 r2 r3 = hr1 hr2 hr3 − hr1 δr2 r3 [3], with hr1 = δr1 r2 xr2 . Observe that (A.14)
Zr1 = hr1
W˘ tnj+1 (tnj+1 )1/2
=
r2 δr1 r2 W˘ tnj+1
(tnj+1 )1/2
Under the approximating measure, therefore, the vector consisting of elements Zr1 is conditionally normally distributed with mean zero and covariance matrix δr1 r2 . It follows that (A.15)
hr1 r2 r3
W˘ tnj+1 (tnj+1 )1/2
= Zr1 Zr2 Zr3 − Zr1 δr2 r3 [3]
Under the approximating measure, therefore, En (hr1 r2 r3 (W˘ tnj+1 /(tnj+1 )1/2 )| Ftnj ) = 0, while (A.16)
Covn hr1 r2 r3
W˘ tnj+1 (tnj+1 )1/2
= δr1 r4 δr2 r5 δr3 r6 [6]
hr4 r5 r6
W˘ tnj+1 Ftnj (tnj+1 )1/2
INFERENCE FOR CONTINUOUS SEMIMARTINGALES
1435
where the [6] refers to all six combinations where each δ has one index from {r1 r2 r3 } and one from {r4 r5 r6 }. It follows that (A.17)
Varn =
W˘ tnj+1 1 r1 r2 r3 κ hr1 r2 r3 F tnj 3! (tnj+1 )1/2
1 r1 r2 r3 r4 r5 r6 κ κ 36 W˘ tnj+1 W˘ tnj+1 hr4 r5 r6 × Covn hr1 r2 r3 Ftnj (tnj+1 )1/2 (tnj+1 )1/2
1 = κr1 r2 r3 κr4 r5 r6 δr1 r4 δr2 r5 δr3 r6 6 1 r1 r2 r3 r4 r5 r6 = tnj+1 ktnj ktnj δr1 r4 δr2 r5 δr3 r6 + Op t 3/2 24 by symmetry of the κ’s. Thus (A.18)
Varn
tnj+1 ≤t p
t
→ 0
W˘ tnj+1 1 r1 r2 r3 κ hr1 r2 r3 F tnj 3! (tnj+1 )1/2
1 r1 r2 r3 r4 r5 r6 k ku δr1 r4 δr2 r5 δr3 r6 du 24 u
under Pn∗ , still using the summation convention. Note that (A.18), with t = T , is the same as Γ0 in (16). By the same methods, and since Hermite polynomials of different orders are orthogonal under the approximating measure, (A.19)
Covn hr1 r2 r3
tnj+1 ≤t
W˘ tnj+1 (tnj+1 )1/2
hr4
p Ftnj → 0 (tnj+1 )1/2 W˘ tnj+1
By the methods of Jacod and Shiryaev (2003), it follows that (A.20)
ˇ (0) = M n
n−1 W˘ tnj+1 1 r1 r2 r3 κ hr1 r2 r3 3! (tnj+1 )1/2 j=0
converges stably in law to a normal distribution with random variance Γ0 . (Note ˇ (0) = M (0) + Op (t 1/2 ) from (21) and that we are still using the summathat M n n tion convention.) We now observe that, in the notation of (A.13), (A.21)
n−1 W˘ tnj+1 dP ∗ log = (log fnj+1 − log φ) dPn∗ (tnj+1 )1/2 j=0
1436
P. A. MYKLAND AND L. ZHANG
ˇ (0) and its discrete time By the same reasoning as above, the terms other than M n ∗ ∗ ˇ (0) − 1 Γ0 + op (1) quadratic variation (A.18) go away. Thus log(dP /dPn ) = M n 2 and the result follows. Q.E.D. REMARK 12: The proof of Theorem 1 uses the Edgeworth expansion (A.13). The proof of the broad availability of such expansions in the martingale case goes back to Mykland (1993, 1995a, 1995b), who used a test function topology. The formal existence of Edgeworth expansions in our current case is proved by iterating the expansion (A.2) as many times as necessary and bounding the remainder. In the diffusion case, similar arguments have been used in the estimation and computation theory in Aït-Sahalia (2002). PROOF OF THEOREM 2: It follows from the development in the proof of Theorem 1 that (A.22)
log
dP ∗ 1 = Mn(0) − Γ0 + op (1) ∗ dPn 2 L
where Mn(0) is as defined in equation (21). Write that, under Pn∗ , (Zn Mn(0) ) → (Z M) with M = Γ01/2 V1 and Z = b1 + c1 M + c2 V2 , where V1 and V2 are independent and standard normal (independent of FT ). Denote the distribution of ∗ to avoid confusion. (Z M) as P∞ It follows that, for bounded and continuous g, and by uniform integrability, 1 ∗ ∗ (0) (A.23) E g(Zn ) = En g(Zn ) exp Mn − Γ0 (1 + o(1)) 2 1 → Eg(Z) exp M − Γ0 2 1 ∗ = E∞ g b1 + c1 Γ01/2 V1 + c2 V2 exp Γ01/2 V1 − Γ0 2 ∞ 1 ∗ = E∞ g b1 + c1 Γ01/2 v + c2 V2 exp Γ01/2 v − Γ0 (2π)−1/2 2 −∞ 1 × exp − v2 dv 2 ∞ ∗ E∞ g b1 + c1 Γ01/2 u + Γ01/2 + c2 V2 = −∞
1 2 exp − u du (u = v − Γ01/2 ) 2
−1/2
× (2π)
∗ = E∞ g(Z + c1 Γ0 )
1437
INFERENCE FOR CONTINUOUS SEMIMARTINGALES
The result then follows since c1 Γ0 = A12 .
Q.E.D.
A.2. Proof of Theorem 3 (1) n
Let Z
be given by (43). Set
1 (1) −1 = XtTnj+1 ζt−1 − ζτ−1 Xtnj+1 tnj+1 Znt nj nj+1 ni−1 2 (1) . Set Aj = ζt1/2 ζ −1 ζ 1/2 − I. and note that Zn(1) = j Znt nj τni−1 tnj nj+1 Since Xtnj is conditionally Gaussian, we obtain (under Pn∗ ) (A.24)
(A.25)
(1) 1 = − tr(Aj ) | X EPn∗ Znt nt nj nj+1 2
and (A.26)
conditional variance of
(1) Znt = nj+1
1 tr(A2j ) 2
Finally, let Mn(1) be the (end point of the) martingale part (under Pn∗ ) of Zn(1) , so that 1 (A.27) Mn(1) = Zn(1) + tr(Aj ) 2 j If · ·G represents discrete time predictable quadratic variation on the grid G , then equation (A.26) yields (A.28)
Mn(1) Mn(1)
G
=
1 tr(A2j ) 2 j
Now note that, analogous to the development in Zhang (2001, 2006), Mykland and Zhang (2006), and Zhang, Mykland, and Aït-Sahalia (2005), (A.29)
(1) G 1 −2 2 tr ζτni−1 ζtnj − ζτni−1 Mn Mn(1) = 2 j =
1 −2 tr ζτni−1 ζ ζtnj − ζ ζτni−1 + op (1) 2 j
=
1 −2 tr ζτni−1 ζ ζ τni−1 (tnj − τni−1 ) + op (1) 2 j
=
1 2
0
T
tr(ζt−2 ζ ζ t ) dK(t) + op (1) = Γ1 + op (1)
1438
P. A. MYKLAND AND L. ZHANG
where K is the ADD given by equation (41). At this point, observe that Assumption 2 entails, in view of Lemma 2 in Mykland and Zhang (2006), that (A.30)
sup tr(A2j ) → 0
as
n → ∞
j
Since also, (A.31)
for r > 2
|tr(Arj )| ≤ tr(A2j )r/2
it follows that (A.32)
log
dQn 1 = Zn(1) + ∗ dPn 2 i t = Zn(1) + =Z
(1) n
log det ζtnj − log det ζτni−1
nj ∈(τni−1 τni ]
1 log det(I + Aj ) 2 j
tr(A2j ) tr(A3j ) 1 tr(Aj ) − + + ··· + 2 j 2 3
= Mn(1) −
1 1 tr(A2j ) + tr(A3j ) + · · · 4 j 6 j
= Mn(1) −
G 1 (1) Mn Mn(1) + op (1) 2
Now let Mn(1) Mn(1) be the quadratic variation of the continuous martingale that coincides at points tnj with the discrete time martingale leading up to the end point Mn(1) . By a standard quarticity argument (as in the proof of Remark 2 in Mykland and Zhang (2006)), (A.29)–(A.31) and the conditional (1) yield that Mn(1) Mn(1) = Mn(1) Mn(1) G + op (1). The stanormality of Znt nj+1 ble convergence to a normal distribution with variance Γ1 then follows by the same methods as in Zhang, Mykland, and Aït-Sahalia (2005). The result is thus proved. Q.E.D. A.3. Proofs Concerning the Leverage Effect (Section 4.3) PROOF OF PROPOSITION 3 : We here show how to arrive at the final result in Proposition 3. This serves as a fairly extensive illustration of how to apply the theory developed in the earlier sections. By rearranging terms, write 2 X = (A.33) σ στ2ni+1 − στ2ni Xτni+1 − Xτni T i
INFERENCE FOR CONTINUOUS SEMIMARTINGALES
+
i
−
1439
σˆ τ2ni − στ2ni Xτni − Xτni−1 σˆ τ2ni − στ2ni Xτni+1 − Xτni + Op (n−1 )
i
where the Op (n−1 ) term comes from edge effects. Note that by conditional Gaussianity, both the two last sums in (A.33) are Qn -martingales with respect to the σ fields Yni . They are also orthogonal in the sense that (A.34)
CovQn σˆ τ2ni − στ2ni Xτni − Xτni−1 2 σˆ τni − στ2ni Xτni+1 − Xτni | Yni = 0
Under Qn and conditionally on the information up to time τni−1 , σˆ τ2ni = στ2ni χ2M−1 /(M −1) and X τni = στni (t/M)1/2 N(0 1), where χ2M−1 and N(0 1) are independent. It follows that (A.35) VarQn σˆ τ2ni − στ2ni Xτni − Xτni−1 | Yni 2 = στ4ni (M − 1)−2 Xτni − Xτni−1 Var(χ2M−1 ) 2 = 2στ4ni (M − 1)−1 Xτni − Xτni−1 Hence, under Qn , the quadratic variation of converges to (A.36)
2 M −1
i
(σˆ τ2ni − στ2ni )(Xτni − Xτni−1 )
T
σt6 dt 0
At the same time, it is easy to see that this sum has asymptotically zero covariation the increments of Mn(0Q) and Mn(1Q) , and also with W Q . Hence 2 with ˆ τni − στ2ni )(Xτni − Xτni−1 ) converges stably under P to a normal distribui (σ tion with mean zero and variance (A.36). The situation with the other sum i (σˆ τ2ni − στ2ni )(Xτni+1 − Xτni ) is more complicated. First of all, (A.37) VarQn σˆ τ2ni − στ2ni Xτni+1 − Xτni | Yni 2 χM−1 − 1 N(0 1) = στ6ni (Mt) Var M −1 =
2 σ 6 (Mt) M − 1 τni
1440
P. A. MYKLAND AND L. ZHANG
Hence the asymptotic quadratic variation is T 2 (A.38) σ 6 dt M −1 0 t The sum is asymptotically uncorrelated with W Q , since (A.39) CovQn σˆ τ2ni − στ2ni Xτni+1 − Xτni Wτni+1 − Wτni | Yni 2 χM−1 − 1 N(0 1) N(0 1) = στ3ni (Mt) Cov M −1 = 0 Overall, under Qn , we have the stable convergence (A.40)
σˆ
2 τni
−σ
2 τni
L Xτni+1 − Xτni → N(0 1)
i
2 M −1
1/2
T 6 t
σ dt
0
There is, however, covariation between this sum and Mn(0Q) . It is shown be3 low in Remark 13 (see equation (A.47)) that A12 = 2M σ 2 XT , where A12 has the same meaning as in Theorems 2 and 4 (in Sections 2.4 and 3.4, respectively). Similarly, there is covariation with Mn(1Q) , and one can show that σ 2 XT . Thus, by Theorem 4, under P ∗ , we have (stably) A13 = M−3 2M (A.41) σˆ τ2ni − στ2ni Xτni+1 − Xτni i
1/2 T 1 2 2 σ XT + N(0 1) σt6 dt 2 M −1 0 Because of the orthogonality (A.34), and since i (στ2ni+1 − στ2ni )(Xτni+1 − Xτni ) − σ 2 XT = Op (n−1/2 ) by Proposition 1 of Mykland and Zhang (2006), 2 X − 1 σ 2 X converges stably (under P ∗ ) to a normal it follows that σ T 2 distribution with mean as in equation (A.41) and variance contributed by the second and third terms on the right hand side of (A.33). We have thus shown Proposition 3. Q.E.D. L
→
REMARK 13 —Sample of Calculation: To see how the reasoningworks in the case of covariations, consider the case of covariation between i (σˆ τ2ni − στ2ni )(Xτni+1 − Xτni ) and Mn(0Q) . We proceed as follows. If hr is the rth (scalar) Hermite polynomial, set Q hr Wtnj /t 1/2 (A.42) Gri = tnj ∈(τni τni+1 ]
INFERENCE FOR CONTINUOUS SEMIMARTINGALES
1441
note that (A.43)
Xτni+1 − Xτni = στni t 1/2 G1i στ2ni 1 2 2 2 G2i − G1i + 1 σˆ τni − στni = M −1 M
At the same time, (A.44)
Mn(0Q) =
1 (t)1/2 kτni G3i + op (1) 12 i
The covariance for each i increment becomes 1 (A.45) CovQn σˆ τ2ni − στ2ni Xτni+1 − Xτni (t)1/2 kτni G3i Yni 12 kτni στ3ni 1 1 2 Q = t Covn G2i − G1i + 1 G1i G3i Yni 12 M −1 M kτni στ3ni 1 = (Mt) 2 M since, by orthogonality of the Hermite polynomials and by normality, 1 (A.46) CovQn G2i − G21i + 1 G1i G3i Yni M = cumQ3n (G1i G2i G3i | Yni ) 1 cumQ4n (G1i G1i G1i G3i | Yni ) M = M cum3 h1 (N(0 1)) h2 (N(0 1) h3 (N(0 1))) − cum4 h1 (N(0 1)) h1 (N(0 1)) h1 (N(0 1)) h3 (N(0 1)) −
= 6(M − 1) The covariation with Mn(0Q) therefore converges to (A.47)
1 A12 = 2M =
as in (32).
T
kt σt3 dt 0
3 σ 2 XT 2M
1442
P. A. MYKLAND AND L. ZHANG
PROOF FOR EXAMPLE 5: In analogy with (73), define with mean
(A.48)
2 X σ T
=
σ˜ τ2ni+1 − σ˜ τ2ni Xτni+1 − Xτni
i
We have the representation (A.49)
σ˜ τ2ni − στ2ni =
στ2ni M
G2i
the terms analogous to those in (A.33). The analysis of We 2now consider 2 ( σ ˜ − σ )(X ) is unaffected by this change, except that (A.36) τni − Xτ τni τni i T 6 ni−1 2 is replaced by M 0 σt dt. However, this is not true for the term i (σ˜ τ2ni − στ2ni )(Xτni+1 − Xτni ), which we analyze in the following paragraph. Observe that σ˜ τ2ni = (A.50)
M−1 M
σˆ τ2ni + (X τni )2 /t. Hence,
σ˜ τ2ni − στ2ni Xτni+1 − Xτni
i
M − 1 2 σˆ τni − στ2ni Xτni+1 − Xτni M i 3 1 Kn 1 T 2 + σ dXt + op (1) Xτni+1 − Xτni − M T i M 0 t
=
where Kn = n/M and the op (1) term comes (only) from the approximation of T − M1 i στ2ni (Xτni+1 − Xτni ) by − M1 0 σt2 dXt . It is easy to see that the first two terms on the right hand side of (A.50) have zero Qn covariation and hence, asymptotically, zero P ∗ covariation (Remark 4 in Section 2.4). Since we are thus in a position to easily aggregate the normal parts of the limiting distributions, we obtain the limit of the first term from (A.41) and the limit of the second term from Example 3 in Section 2.5. Hence, stably under P ∗ , with U1 and U2 as independent standard normal, (A.51)
σ˜ τ2ni − στ2ni Xτni+1 − Xτni
i L
→
1/2 T 2 M −1 1 2 σ XT + U1 σt6 dt M 2 M −1 0 T T 1/2 1 3 2 3 ∗ 6 3 + σt dWt + σ XT + U2 6 σt dt M 2 0 0
INFERENCE FOR CONTINUOUS SEMIMARTINGALES
T 1 σ 2 dXt M 0 t 2 T 3 M +2 2 σ XT = σ dWt ∗ + M 0 t 2M 1/2 2M + 4 T 6 + N(0 1) σt dt M2 0
1443
−
Since the terms in (A.50) have zero Qn covariation with Xτni−1 ), the result in Example 5 follows.
i
(σ˜ τ2ni − στ2ni )(Xτni − Q.E.D.
REFERENCES AÏT-SAHALIA, Y. (2002): “Maximum-Likelihood Estimation of Discretely-Sampled Diffusions: A Closed-Form Approximation Approach,” Econometrica, 70, 223–262. [1436] ALDOUS, D. J., AND G. K. EAGLESON (1978): “On Mixing and Stability of Limit Theorems,” Annals of Probability, 6, 325–331. [1408] ANDERSEN, T. G., T. BOLLERSLEV, F. X. DIEBOLD, AND P. LABYS (2001): “The Distribution of Realized Exchange Rate Volatility,” Journal of the American Statistical Association, 96, 42–55. [1403] (2003): “Modeling and Forecasting Realized Volatility,” Econometrica, 71, 579–625. [1403] BARNDORFF-NIELSEN, O., S. GRAVERSEN, J. JACOD, M. PODOLSKIJ, AND N. SHEPHARD (2006): “A Central Limit Theorem for Realised Power and Bipower Variations of Continuous Semimartingales,” in From Stochastic Calculus to Mathematical Finance, The Shiryaev Festschrift, ed. by Y. Kabanov, R. Liptser, and J. Stoyanov. Berlin: Springer Verlag, 33–69. [1414] BARNDORFF-NIELSEN, O. E., AND N. SHEPHARD (2001): “Non-Gaussian Ornstein–UhlenbeckBased Models and Some of Their Uses in Financial Economics,” Journal of the Royal Statistical Society, Series B, 63, 167–241. [1403] (2002): “Econometric Analysis of Realized Volatility and Its Use in Estimating Stochastic Volatility Models,” Journal of the Royal Statistical Society, Series B, 64, 253–280. [1403,1413] (2004a): “Econometric Analysis of Realised Covariation: High Frequency Based Covariance, Regression and Correlation in Financial Economics,” Econometrica, 72, 885–925. [1403,1425] (2004b): “Power and Bipower Variation With Stochastic Volatility and Jumps” (with discussion), Journal of Financial Econometrics, 2, 1–48. [1413,1414] BILLINGSLEY, P. (1995): Probability and Measure (Third Ed.). New York: Wiley. [1418] CHEN, Y., AND V. SPOKOINY (2007): “Robust Risk Management. Accounting for Nonstationarity and Heavy Tails,” Technical Report, Weierstrass Institute, Berlin. [1405] CHRISTENSEN, K., R. OOMEN, AND M. PODOLSKIJ (2008): “Realised Quantile-Based Estimation of the Integrated Variance,” Research Paper 2009-27, CREATES, University of Aarhus, Denmark. [1428] CIZEK, P., W. HÄRDLE, AND V. SPOKOINY (2007): “Adaptive Pointwise Estimation in TimeInhomogeneous Time Series Models,” Technical Report, Weierstrass Institute, Berlin. [1405] COMTE, F., AND E. RENAULT (1998): “Long Memory in Continuous-Time Stochastic Volatility Models,” Mathematical Finance, 8, 291–323. [1403] DACOROGNA, M. M., R. GENÇAY, U. MÜLLER, R. B. OLSEN, AND O. V. PICTET (2001): An Introduction to High-Frequency Finance. San Diego: Academic Press. [1403] DOVONON, P., S. GONCALVES, AND N. MEDDAHI (2008): “Bootstrapping Realized Multivariate Volatility Measures,” Technical Report, University of Montreal. [1425]
1444
P. A. MYKLAND AND L. ZHANG
DUFFIE, D. (1996): Dynamic Asset Pricing Theory. Princeton, NJ: Princeton University Press. [1407] FAN, J., M. FARMEN, AND I. GIJBELS (1998): “Local Maximum Likelihood Estimation and Inference,” Journal of the Royal Statistcal Society, Series B, 60, 591–608. [1405] FOSTER, D., AND D. NELSON (1996): “Continuous Record Asymptotics for Rolling Sample Variance Estimators,” Econometrica, 64, 139–174. [1403] HÁJEK, J., AND Z. SIDAK (1967): Theory of Rank Tests. New York: Academic Press. [1411,1413] HALL, P., AND C. C. HEYDE (1980): Martingale Limit Theory and Its Application. Boston: Academic Press. [1408] HARRISON, M., AND D. KREPS (1979): “Martingales and Arbitrage in Multiperiod Securities Markets,” Journal of Economic Theory, 20, 381–408. [1407] HARRISON, M., AND S. PLISKA (1981): “Martingales and Stochastic Integrals in the Theory of Continuous Trading,” Stochastic Processes and Their Applications, 11, 215–260. [1407] HAYASHI, T., AND N. YOSHIDA (2005): “Covariance Estimation of Non-Synchronously Observed Diffusion Processes,” Bernoulli, 11, 359–379. [1403] HESTON, S. (1993): “A Closed-Form Solution for Options With Stochastic Volatility With Applications to Bonds and Currency Options,” Review of Financial Studies, 6, 327–343. [1419] JACOD, J. (1994): “Limit of Random Measures Associated With the Increments of a Brownian Semimartingale,” Technical Report, Université de Paris VI. [1403,1413] (2008): “Asymptotic Properties of Realized Power Variations and Related Functionals of Semimartingales,” Stochastic Processes and Their Applications 118, 517–559. [1413] JACOD, J., AND P. PROTTER (1998): “Asymptotic Error Distributions for the Euler Method for Stochastic Differential Equations,” Annals of Probability, 26, 267–307. [1403,1408,1413] JACOD, J., AND A. N. SHIRYAEV (2003): Limit Theorems for Stochastic Processes (Second Ed.). New York: Springer Verlag. [1411-1413,1415,1435] JACOD, J., Y. LI, P. A. MYKLAND, M. PODOLSKIJ, AND M. VETTER (2009): “Microstructure Noise in the Continuous Case: The Pre-Averaging Approach,” Stochastic Processes and Their Applications, 119, 2249–2276. [1405] KARATZAS, I., AND S. E. SHREVE (1991): Brownian Motion and Stochastic Calculus (Second Ed.). New York: Springer Verlag. [1407,1415] KINNEBROCK, S., AND M. PODOLSKIJ (2008): “A Note on the Central Limit Theorem for Bipower Variation of General Functions,” Stochastic Processes and Their Applications, 118, 1056–1070. [1416] LECAM, L. (1986): Asymptotic Methods in Statistical Decision Theory. New York: Springer Verlag. [1411] LECAM, L., AND G. YANG (2000): Asymptotics in Statistics: Some Basic Concepts (Second Ed.). New York: Springer Verlag. [1411] LEHMANN, E. (1983): Theory of Point Estimation. New York: Wiley. [1422] MARDIA, K. V., J. KENT, AND J. BIBBY (1979): Multivariate Analysis. London: Academic Press. [1425,1426] MCCULLAGH, P. (1987): Tensor Methods in Statistics. London: Chapman & Hall. [1410,1433,1434] MYKLAND, P. A. (1993): “Asymptotic Expansions for Martingales,” Annals of Probability, 21, 800–818. [1436] (1995a): “Embedding and Asymptotic Expansions for Martingales,” Probability Theory and Related Fields, 103, 475–492. [1436] (1995b): “Martingale Expansions and Second Order Inference,” Annals of Statistics, 23, 707–731. [1436] MYKLAND, P. A., AND L. ZHANG (2006): “ANOVA for Diffusions and Itô Processes,” Annals of Statistics, 34, 1931–1963. [1403,1413,1414,1425,1428,1437,1438,1440] (2009): “The Econometrics of High Frequency Data,” in Statistical Methods for Stochastic Differential Equations, ed. by M. Kessler, A. Lindner, and M. Sørensen. New York: Chapman & Hall/CRC Press (forthcoming). [1405,1406,1421]
INFERENCE FOR CONTINUOUS SEMIMARTINGALES
1445
PODOLSKIJ, M., AND M. VETTER (2009): “Estimation of Volatility Functionals in the Simultaneous Presence of Microstructure Noise and Jumps,” Bernoulli (forthcoming). [1405] RENAULT, E., AND B. J. WERKER (2009): “Causality Effects in Return Volatility Measures With Random Times,” Journal of Econometrics (forthcoming). [1407,1431] RÉNYI, A. (1963): “On Stable Sequences of Events,” Sanky¯ a, Series A, 25, 293–302. [1408] ROOTZÉN, H. (1980): “Limit Distributions for the Error in Approximations of Stochastic Integrals,” Annals of Probability, 8, 241–251. [1408] ROSS, S. M. (1976): “The Arbitrage Theory of Capital Asset Pricing,” Journal of Economic Theory, 13, 341–360. [1407] TIBSHIRANI, R., AND T. HASTIE (1987): “Local Likelihood Estimation,” Journal of the American Statistican Association, 82, 559–567. [1405] WEISBERG, S. (1985): Applied Linear Regression (Second Ed.). New York: Wiley. [1425] ZHANG, L. (2001): “From Martingales to ANOVA: Implied and Realized Volatility,” Ph.D. Thesis, Department of Statistics, The University of Chicago. [1403,1413,1414,1428,1437] (2006): “Efficient Estimation of Stochastic Volatility Using Noisy Observations: A Multi-Scale Approach,” Bernoulli, 12, 1019–1043. [1414,1427,1437] (2009): “Estimating Covariation: Epps Effect and Microstructure Noise,” Journal of Economoetrics (forthcoming). [1403] ZHANG, L., P. A. MYKLAND, AND Y. AÏT-SAHALIA (2005): “A Tale of Two Time Scales: Determining Integrated Volatility With Noisy High-Frequency Data,” Journal of the American Statistical Association, 100, 1394–1411. [1437,1438]
Dept. of Statistics, The University of Chicago, 5734 South University Ave., Chicago, IL 60637, U.S.A.;
[email protected]; http://galton. uchicago.edu/~mykland and Dept. of Finance, The University Illinois at Chicago, 601 South Morgan Street, MC 168, Chicago, IL 60607, U.S.A.;
[email protected]; http://tigger.uic. edu/~lanzhang/. Manuscript received September, 2008; final revision received February, 2009.
Econometrica, Vol. 77, No. 5 (September, 2009), 1447–1479
TESTING HYPOTHESES ABOUT THE NUMBER OF FACTORS IN LARGE FACTOR MODELS BY ALEXEI ONATSKI1 In this paper we study high-dimensional time series that have the generalized dynamic factor structure. We develop a test of the null of k0 factors against the alternative that the number of factors is larger than k0 but no larger than k1 > k0 . Our test statistic equals maxk0
1. INTRODUCTION HIGH-DIMENSIONAL FACTOR MODELS with correlated idiosyncratic terms have been extensively used in recent research in finance and macroeconomics. In finance, they form the basis of the Chamberlain and Rothschild (1983) extension of the arbitrage pricing theory. They have been used in portfolio performance evaluation, in the analysis of the profitability of trading strategies, in testing implications of the arbitrage pricing theory, and in the analysis of bond risk premia. In macroeconomics, such models have been used in business cycle analysis, in forecasting, in monitoring economic activity, in construction of inflation indexes and in monetary policy analysis (see Breitung and Eickmeier (2006) for a survey of work in these areas). More recent macroeconomic applications include the analysis of international risk sharing, the identification of global shocks, the analysis of price dynamics, and the estimation of the dynamic, stochastic general equilibrium models. An important question to be addressed by any study which uses factor analysis is how many factors there are. The number of factors must be known to implement various estimation and forecasting procedures. Moreover, that number often has interesting economic interpretations and important theoretical consequences. In finance and macroeconomics, it can be interpreted as the number of the sources of nondiversifiable risk and the number of the fundamental shocks driving the macroeconomic dynamics, respectively. In consumer demand theory, the number of factors in budget share data provides crucial information about the demand system (see Lewbel (1991)). For example, the 1 I am very grateful to Serena Ng, a co-editor and three anonymous referees for helpful comments and suggestions.
© 2009 The Econometric Society
DOI: 10.3982/ECTA6964
1448
ALEXEI ONATSKI
number of factors must be exactly two for aggregate demands to exhibit the weak axiom of revealed preference. Although there have been many studies which develop consistent estimators of the number of factors (for some recent work in this area, see Bai and Ng (2007) and Hallin and Liska (2007)), the corresponding estimates of the number of factors driving stock returns and macroeconomic time series often considerably disagree. In finance, the estimated number of factors ranges from one to more than ten. In macroeconomics, there is an ongoing debate (see Stock and Watson (2005)) whether the number of factors is only two or, perhaps, as many as seven. The purpose of this paper is to develop formal statistical tests of various hypotheses about the number of factors in large factor models. Such tests can be used, for example, to decide between competing point estimates or to provide confidence intervals for the number of factors. We consider T observations X1 XT of n-dimensional vectors that have the generalized dynamic factor structure introduced by Forni, Hallin, Lippi, and Reichlin (2000). In particular, Xt = Λ(L)Ft + et where Λ(L) is an n × k matrix of possibly infinite polynomials in the lag operator L; Ft is a kdimensional vector of factors at time t; and et is an n-dimensional vector of correlated stationary idiosyncratic terms. We develop the following test of the null that there are k = k0 factors at a particular frequency of interest ω0 , say a business cycle frequency, versus the alternative that k0 < k ≤ k1 . T • √First, compute the discrete Fourier transforms (d.f.t.’s) Xˆ j ≡ t=1 Xt × e−iωj t / T of the data at frequencies ω1 ≡ 2πs1 /T ωm ≡ 2πsm /T approximating ω0 where s1 sm are integers such that sj ± sk = 0(mod T ) for j = k sj = 0 sj = T/2, and maxj |ωj − ω0 | ≤ 2π(m + 1)/T . • Next, compute a test statistic γi − γi+1 k0
R ≡ max
where γi is the ith largest eigenvalue of the smoothed periodogram estimate m 1 ˆ ˆ j=1 Xj Xj of the spectral density of the data at frequency ω0 . Here and 2πm throughout the paper, the prime attached to a complex-valued matrix denotes the conjugate-complex transpose of the matrix. • Finally, reject the null if and only if R is above a critical value given in Table I. For example, the 5% critical value for statistic R of the test of k = k0 versus k0 < k ≤ k1 , where k0 = 3 and k1 = 10 is in the fifth row from the bottom and the second column from the right in the table. It equals 8.29. A Matlab code dinamico.m, which implements the test, is available as Supplemental Material from the Econometrica website (Onatski (2009b)). We prove that our test statistic is asymptotically pivotal under the null hypothesis and that it explodes under the alternative. We find that its asymptotic distribution, as n, m, and T go to infinity so that n/m remains in a compact subset of (0 ∞) and T grows sufficiently faster than n is a function of
1449
LARGE FACTOR MODELS TABLE I CRITICAL VALUES OF THE TEST STATISTIC R k1 − k 0
The Test’s Size %
1
2
3
4
5
6
7
8
15 10 9 8 7 6 5 4 3 2 1
2.75 3.33 3.50 3.69 3.92 4.20 4.52 5.02 5.62 6.55 8.74
362 431 449 472 499 531 573 626 691 815 1052
415 491 513 537 566 603 646 697 779 906 1167
454 540 562 591 624 657 701 763 848 993 1256
489 577 603 631 662 700 750 816 906 1047 1342
520 613 639 668 700 741 795 861 964 1127 1426
545 642 667 695 732 774 829 906 1011 1175 1488
570 666 692 725 759 804 859 936 1044 1213 1525
the Tracy–Widom distribution. The Tracy–Widom distribution (see Tracy and Widom (1994)) refers to the asymptotic joint distribution of a few of the largest eigenvalues of a particular Hermitian random matrix as the dimensionality of the matrix tends to infinity. Although our assumption that T grows faster than n is not in the spirit of the recent literature which lets n grow as fast or even faster than T our Monte Carlo analysis shows that the test works well even when n is much larger than T . In Section 3, we discuss possible theoretical reasons for this Monte Carlo finding. The main idea behind our test is as follows. Suppose there are k dynamic factors in Xt Then there will be k static factors in Xˆ j . Therefore, the k largest eigenvalues γ1 γk of the sample covariance matrix of Xˆ j should explode, whereas the rest of the eigenvalues should have the same asymptotic distribution as the largest eigenvalues in the zero-factor case. Now, the zero-factor case corresponds to Xˆ j ’s being asymptotically complex normal, independent d.f.t.’s of Xt (see Theorem 4.4.1 of Brillinger (1981)). The asymptotic distribution of the scaled and centered largest eigenvalues of the corresponding sample covariance matrix is known to be Tracy–Widom (see El Karoui (2007) and Onatski (2008)). However, the common asymptotic centering and scaling for the eigenvalues do depend on the unknown details of the correlation between the entries of vector Xˆ j . Our statistic R gets rid of both the unknown centering and scaling parameters, which makes it asymptotically pivotal under the null. Under the alternative, R is larger than or equal to the ratio (γk − γk+1 )/(γk+1 − γk+2 ) which explodes because γk explodes while γk+1 and γk+2 stay bounded. Our test procedure can be interpreted as formalizing the widely used empirical method of the number of factors determination based on the visual
1450
ALEXEI ONATSKI
inspection of the scree plot introduced by Cattell (1966). The scree plot is a line that connects the decreasing eigenvalues of the sample covariance matrix of the data plotted against their respective order numbers. In practice, it often happens that the scree plot shows a sharp break where the true number of factors ends and “debris” corresponding to the idiosyncratic influences appears. Our test statistic effectively measures the curvature of the frequency-domain scree plot at a would-be break point under the alternative hypothesis. When the alternative hypothesis is true, the curvature asymptotically goes to infinity. In contrast, under the null, this curvature has a nondegenerate asymptotic distribution that does not depend on the model’s parameters. The rest of the paper is organized as follows. Section 2 describes the model and states our assumptions. Section 3 develops the test. Section 4 considers the special case of the approximate factor model. Section 5 contains Monte Carlo experiments and comparisons with procedures proposed in the previous literature. Section 6 applies our test to macroeconomic and financial data. Section 7 discusses the choice of our test statistic and the workings of the test under potential misspecifications of the factor model. Section 8 concludes. Technical proofs are contained in the Appendix. 2. THE MODEL AND ASSUMPTIONS Consider a double sequence of random variables {Xit i ∈ N t ∈ Z} which admits a version of the generalized dynamic k-factor structure of Forni et al. (2000): (1)
Xit = Λi1 (L)F1t + · · · + Λik (L)Fkt + eit
∞ (u) u where Λij (L) equals u=0 Λ(u) ij L and factor loadings Λij factors Fjt and idiosyncratic terms eit satisfy Assumptions 1, 2, 3, and 4 stated below. Suppose that we observe Xit with i = 1 n and t = 1 T . Let us denote the data vector at time t, (X1t Xnt ) as Xt (n) and denote its idiosyncratic part, (e1t ent ) , as et (n) Further, let χt (n) ≡ Xt (n) − et (n) be the systematic part of Xt (n). Then Xt (n) = χt (n) + et (n) and χt (n) =
∞
Λ(u) (n)Ft−u
u=0
where Ft ≡ (F1t Fkt ) denotes the vector of factors at time t and Λ(u) (n) denotes the n × k matrix with i jth entry Λ(u) ij .
LARGE FACTOR MODELS
1451
ASSUMPTION 1: The factors Ft follow an orthonormal white noise process. For each n, vector et (n) is independent from Fs at all lags and leads, and follows a stationary zero-mean process. This assumption is standard. It is somewhat stronger than Assumption 1 in Forni et al. (2000), which requires only the orthogonality of et (n) and Fs . ∞ ASSUMPTION 2: (i) The factor loadings are such that u=0 Λ(u) (n) (1 + u) = O(n1/2 ) where Λ(u) (n) denotes the square root of the largest eigenvalue of Λ(u) (n) Λ(u) (n). (ii) The idiosyncratic terms are jointly Gaussian with autocovariances cij (u) ≡ Eeit+u ejt satisfying u (1 + |u|)|cij (u)| < ∞ uniformly in i, j ∈ N. Assumption 2 implies regularity of the d.f.t.’s of χt (n) and et (n) local to ∞ the frequency of interest ω0 Let Λˆ s denote u=0 Λ(u) (n)e−iuωs where ωs , s = 0 1 m, are the frequencies √ defined in the Introduction. Further, for any T process yt let yˆs ≡ t=1 yt e−iωs t / T denote the d.f.t.’s of yt at the frequencies ωs . For example, Xˆ s (n) Fˆs , and eˆ s (n), are the d.f.t.’s of Xt (n) Ft , and et (n), respectively. Assumption 2(i) guarantees that, for T sufficiently larger than max(n m) χˆ s (n) is well approximated by Λˆ 0 Fˆs for all s = 1 m (see Lemma 4 in the Appendix). Note that in the special case when Λij (L) do not depend on L so that Λij (L) = Λij(0) Assumption 2(i) holds if the loadings are bounded functions of i and j Indeed, then Lemma 2 in the Appendix implies that Λ0 (n) ≤ n k ( i=1 j=1 (Λij(0) )2 )1/2 = O(n1/2 ) In this special case, χˆ s (n) equals Λˆ 0 Fˆs exactly. Assumption 2(ii) guarantees that, for T sufficiently larger than max(n m), the real and imaginary parts of vectors eˆ s (n) with s = 1 m are well approximated by independent and identically distributed (i.i.d.) Gaussian vectors (see Lemma 5 in the Appendix). To establish such an approximation without assuming the Gaussianity of eit , we will need another assumption: ASSUMPTION 2: (ii)(a) Let εjt , j ∈ N t ∈ Z, be i.i.d. random variables with Eεjt = 0 Eεjt2 = 1, and μp ≡ E(|εjt |p ) < ∞ for some p > 2. Further, let ∞ ujt , j ∈N, be linear filters of εjt so that ujt = Cj (L)εjt = k=0 cjk εjt−k with ∞ supj>0 k=0 k|cjk | < ∞. We assume that the idiosyncratic terms in (1) can be represented ∞linear combinations of such filters. That is, ∞ by square summable eit = j=1 Aij ujt , where supi>0 j=1 A2ij < ∞. Assumption 2(ii)(a) allows for rich patterns of cross-sectional and temporal dependence of the idiosyncratic terms. In addition, it allows for substantial departures from the Gaussianity. For example, the moments of order higher than p of the idiosyncratic terms may be infinite.
1452
ALEXEI ONATSKI
The remaining two assumptions concern the asymptotic behavior of the spectral density matrices at frequency ω0 of et (n) and χt (n), which we denote as Sne (ω0 ) and Snχ (ω0 ) respectively Let l1n ≥ · · · ≥ lnn be the eigenvalues of Sne (ω0 ). Denote by Hn the spectral distribution of Sne (ω0 ), that is, Hn (λ) = 1 − n1 #{i ≤ n : lin > λ}, where #{·} denotes the number of elements in −1 the let cmn be the unique root in [0 l1n ) of the equation indicated set. Further, 2 (λcmn /(1 − λcmn )) dHn (λ) = m/n ASSUMPTION 3: As n and m tend to infinity so that m/n remains in a compact subset of (0 ∞), lim sup l1n < ∞, lim inf lnn > 0, and lim sup l1n cmn < 1. The inequalities of Assumption 3 have the following meaning. The inequality lim sup l1n < ∞ guarantees that the cumulative effects of the idiosyncratic causes of variation at frequency ω0 on the n observed cross-sectional units remain bounded as n → ∞ It relaxes Assumption 3 of Forni et al. (2000) that the largest eigenvalue of Sne (ω) is uniformly bounded over ω ∈ [−π π] which is a crucial identification requirement for generalized dynamic factor models. The inequality lim inf lnn > 0 requires that the distribution of the ω0 frequency component of the stationary process et (n) does not become degenerate as n → ∞. It is needed in the technical parts of the proofs of Lemma 1, Lemma 5, and Theorem 1. The inequality lim sup l1n cmn < 1 is crucial for Onatski (2008), which we rely on in our further analysis. For this inequality to hold, it is sufficient that Hn weakly converges to a distribution H∞ with density bounded away from zero in the vicinity of the upper boundary of support lim sup l1n Hence, the inequality essentially requires that relatively large eigenvalues of Sne (ω0 ) do not scatter too much as n goes to infinity. Intuitively, such a requirement rules out situations where a few weighted averages of the idiosyncratic terms cause unusually large variation at frequency ω0 so that they can be misinterpreted as common dynamic factors. ASSUMPTION 4: The kth largest eigenvalue of Snχ (ω0 ) diverges to infinity faster than n2/3 . Assumption 4 relaxes the standard requirement that the factors’ cumulative effects on the cross-sectional units rise linearly in n (see, for example, Assumption A4 in Hallin and Liska (2007)). Allowing for cumulative factor effects which grow slower than linearly in n is important, because such a slower growth better corresponds to the current practice of creating larger macroeconomic data sets by going to higher levels of disaggregation instead of adding variables carrying genuinely new information. 3. THE TEST Before we explain the workings of our test, let us recall the notions of the complex Gaussian and complex Wishart distributions. We say that an ndimensional vector Y has a complex Gaussian distribution NnC (β Σ) if and
LARGE FACTOR MODELS
1453
only if the 2n-dimensional vector Z ≡ (Re Y Im Y ) which stacks the real and imaginary parts of Y has a usual Gaussian distribution N2n
1 Re Σ Re β Im β 2 Im Σ
− Im Σ Re Σ
Further, if Y1 Ym are independent n-dimensionalNnC (0 Σ) variates, then m we say that the n × n matrix-valued random variable j=1 Yj Yj has a complex Wishart distribution WnC (m Σ) of dimension n and degrees of freedom m. Our test is based on the following three observations. First, Assumption 2(i) implies that the vectors of the d.f.t. of our data admit an approximate factor structure in the sense of Chamberlain and Rothschild (1983). That is, Xˆ s (n) = Λˆ 0 Fˆs + eˆ s (n) + Rs (n) where, as Lemma 4 in the Appendix shows, Rs (n) with s = 1 m can be made uniformly arbitrarily small for T which is sufficiently larger than max(n m). Second, as is well known (see, for example, Theorem 4.4.1 of Brillinger (1981)), for fixed n and m, Assumption 2(ii) (or 2(ii)(a)) and our choice of the frequencies ωs , s = 1 m as multiples of 2π/T imply that eˆ 1 eˆ m converge in distribution to m independent NnC (0 2πSne (ω0 )) vectors and, hence, m the smoothed periodogram estimate Sˆne (ω0 ) ≡ s=1 eˆ s eˆ s /2πm of Sne (ω0 ) converges in distribution to a complex Wishart WnC (m Sne (ω0 )/m) random matrix. As Lemma 5 in the Appendix shows, the complex Wishart approximation of Sˆne (ω0 ) remains good even if n and m are not fixed as long as T grows sufficiently faster than n and m m Finally, let γ1 ≥ · · · ≥ γn be the eigenvalues of s=1 Xˆ s Xˆ s /2πm. Then, since Xˆ s have an approximate k-factor structure asymptotically, the eigenvalues γ1 γk must explode as n m → ∞ sufficiently slower than T while the rest of the eigenvalues must approach the eigenvalues of Sˆne (ω0 ) whose distribution becomes arbitrarily close to the distribution of the eigenvalues of a complex Wishart WnC (m Sne (ω0 )/m) matrix. The last observation implies that as n m → ∞ sufficiently slower than T our statistic R ≡ maxk0
1454
ALEXEI ONATSKI
LEMMA 1—Onatski (2008): Let Assumption 3 hold. Define n λcmn 1 μmn = 1+ dHn (λ) cmn m 1 − λcmn and 3 1/3 λcmn n 1 1+ dHn (λ) σmn = 2/3 m cmn m 1 − λcmn Then, for any positive integer r, as m and n tend to infinity so that n/m remains in a compact subset of (0 ∞), the joint distribution of the first r cen−1 −1 tered and scaled eigenvalues σmn (λ1 − μmn ) σmn (λr − μmn ) of matrix C e Wn (m Sn (ω0 )/m)weakly converges to the r-dimensional Tracy–Widom distribution of type 2. The univariate Tracy–Widom law of type 2 (henceforth denoted as TW2 ) refers to a distribution with a cumulative distribution function F(x) ≡ ∞ exp(− x (x − s)q2 (s) ds), where q(s) is the solution of an ordinary differential equation q (s) = sq(s) + 2q3 (s), which is asymptotically equivalent to the Airy function Ai(s) as s → ∞. It plays an important role in large random matrix theory because it is the asymptotic distribution of the scaled and normalized largest eigenvalue of a matrix from the so-called Gaussian unitary ensemble (GUE) as the size of the matrix tends to infinity. The GUE is the collection of all n × n Hermitian matrices with i.i.d. complex Gaussian N1C (0 1/n) lower triangular entries and (independent from them) i.i.d. real Gaussian N1 (0 1/n) diagonal entries. Let d1 ≥ · · · ≥ dn be eigenvalues of a matrix from GUE. Define d˜i = n2/3 (di − 2). Tracy and Widom (1994) studied the asymptotic distribution of a few of the largest eigenvalues of matrices from GUE when n → ∞. They described the asymptotic marginal distributions of d˜1 d˜r , where r is any fixed positive integer, in terms of a solution of a completely integrable system of partial differential equations. If we are interested in the asymptotic distribution of the largest eigenvalue only, the system simplifies to the single ordinary differential equation given above. The joint asymptotic distribution of d˜1 d˜r is called a multivariate TW2 distribution. Lemma 1 implies that as long as the distribution of γk+1 γk+r is well approximated by the distribution of λ1 λr , we can test our null hypothesis k = k0 by checking whether the scaled and centered eigenvalues γk0 +1 γk0 +r come from the multivariate TW2 distribution. Our test statistic R is designed so as to get rid of the unknown scale and center parameters σmn and μmn which makes such a testing strategy feasible. THEOREM 1: Let Assumptions 1, 2(i), 3, and 4 hold, and let n m, and T go to infinity so that n/m remains in a compact subset of (0 ∞). Then, for
LARGE FACTOR MODELS
1455
any positive integer r, if either Assumption 2(ii) holds and m = o(T 3/8 ) or Assumption 2(ii)(a) holds and m = o(T 1/2−1/p log−1 T )6/13 , the joint distribution of −1 −1 (γk+1 − μmn ) σmn (γk+r − μmn ) weakly converges to the r-dimensional σmn TW 2 distribution. The proof of the theorem is given in the Appendix. As the Monte Carlo analysis in the next section suggests, the rates required by Theorem 1 are sufficient but not necessary for the theorem to hold. Our test works well even when n is much larger than T . Two possible theoretical reasons for such good performance are as follows. First, Lemma 1 can, probably, be generalized to cases when n/m → ∞. El Karoui (2006) obtained such a generalization in the special spherical Wishart case. Second, although making n compatible with T hurts the quality of the Gaussian approximation for d.f.t.’s, the Gaussianity is likely unnecessary for the Tracy–Widom asymptotics to work. This is because the Tracy–Widom limit appears to be universal for a class of random matrices much wider than the class of complex Wishart matrices. The first important universality results were recently obtained by Soshnikov (2002) and Péché (2009). Further development of such results remains a challenge for mathematicians. Theorem 2 formally states the properties of our test. Its proof is given in the Appendix. THEOREM 2: Under the conditions of Theorem 1, if k = k0 , statistic R conλ −λ verges in distribution to max0
1456
ALEXEI ONATSKI
only if R˜ is above a critical value given in Table I. It can be shown that this test procedure is valid if the following modifications of Assumptions 1–4 hold: ASSUMPTION 1m: The factors Ft follow a fourth-order zero-mean stationary process with nondegenerate variance, with autocovariances Γij (u) ≡ EFit Fjt+u decaying to zero as u tends to infinity, and with fourth order cumulants cum(Fit1 Fjt2 Frt3 Fl0 ) decaying to zero as max(|t1 | |t2 | |t3 |) tends to infinity. For each n, vector et (n)is independent from Fs at all lags and leads, and follows a stationary Gaussian zero-mean process. ASSUMPTION 2m: (i) Λ(u) ij = 0 for any u = 0. (ii) cij (u) ≡ Eeit ejt−u are zero for any u = 0. ASSUMPTION 3m: As n and T tend to infinity so that n/T remains in a compact subset of (0 ∞), lim sup l1n < ∞, lim inf lnn > 0, and lim sup l1n cT/2n < 1, where l1n lnn refer to the eigenvalues of Eet (n)et (n) and cT/2n is defined as in Assumption 3, with Hn replaced by the spectral distribution of Eet (n)et (n). ASSUMPTION 4m: The kth largest eigenvalue of Λ(0) (n)Λ(0) (n) diverges to infinity faster than n2/3 . We have the following theorem. THEOREM 3: Under Assumptions 1m–4m, statistic R˜ behaves in the same way as statistic R in Theorem 2. Our test of the null of k = k0 approximate factors versus the alternative of k0 < k ≤ k1 is consistent and has asymptotically correct size. Proof of this theorem is similar to the proof of Theorem 2. It is available from the Supplementary Appendix posted on the Econometrica website (Onatski (2009b)). 5. A MONTE CARLO STUDY We approximate the 10-dimensional TW2 distribution by the distribution of the 10 largest eigenvalues of a 1000 × 1000 matrix from the Gaussian unitary ensemble. We obtain an approximation for the latter distribution by simulating 30,000 independent matrices from the ensemble and numerically computing their 10 first eigenvalues. The left panel of Figure 1 shows the empirical distribution function of the largest eigenvalue centered by 2 and scaled by n2/3 = 10002/3 . It approximates the univariate TW2 distribution. The right panel of Figure 1 shows the empirical distribution function of the ratio (x1 −x2 )/(x2 −x3 ), where xi denotes the ith largest eigenvalue of a matrix
LARGE FACTOR MODELS
1457
FIGURE 1.—Left panel: CDF of the univariate TW2 distribution. Right panel: CDF of the statistic (x1 − x2 )/(x2 − x3 ) where x1 x2 and x3 have joint multivariate TW2 distribution.
from GUE. It approximates the asymptotic cumulative distribution function (CDF) of our test statistic (γk0 +1 − γk0 +2 )/(γk0 +2 − γk0 +3 ) for the test of the null of k0 factors against an alternative that the number of factors is k0 + 1 The graph reveals that it is not uncommon to see large values of this statistic under the null. In particular, the first eigenvalue of the sample covariance matrix may be substantially larger than the other eigenvalues even when data have no factors at all. This observation suggests that ad hoc methods of determining the number of factors based on visual inspection and separation of the eigenvalues of the sample covariance matrix into a group of “large” and a group of “small” eigenvalues may be misleading. According to Theorems 1 and 2, the approximate asymptotic critical values of our test of k = k0 versus k0 < k ≤ k1 equal the corresponding percentiles of the empirical distribution of max0
1458
ALEXEI ONATSKI
the first-order autocorrelations of the estimated idiosyncratic components in the Stock and Watson (2002) data set. The k-dimensional factor vectors Ft are iid N(0 Ik ). The filters Λij (L) are randomly generated (independently from Ft ’s and eit ’s) by each one of the following two devices: • Moving Average (MA) Loadings: Λij (L) = aij(0) (1 + aij(1) L)(1 + aij(2) L) with i.i.d. and mutually independent coefficients aij(0) ∼ N(0 1) aij(1) ∼ U[0 1] and aij(2) ∼ U[0 1]. • Autoregressive (AR) Loadings: Λij (L) = bij(0) (1 − bij(1) L)−1 (1 − bij(2) L)−1 with i.i.d. and mutually independent coefficients bij(0) ∼ N(0 1), bij(1) ∼ U[08 09], and bij(2) ∼ U[05 06]. We borrowed the AR loadings design, including the distributional assumptions on bij(0) bij(1) , and bij(2) , from Hallin and Liska (2007). An analysis of the partial autocorrelation function for the estimated systematic part of the Stock– Watson data shows that an AR(2) is a reasonable model for the systematic part of that data. k We normalize the systematic components j=1 Λij (L)Fjt and the idiosyncratic components eit so that their variances equal 04 + 005k and 1 − (04 + 005k) respectively. Hence, for example, the factors explain 50% of the data variation in a two-factor model and 75% of the data variation in a seven-factor model. Figure 2 shows the p-value discrepancy plots and size–power curves2 for our test of H0 : k = 2 versus H1 : 2 < k ≤ 7. The power of the test is computed when k = 7. The graphs are based on 8000 replications of data with (n T ) =
FIGURE 2.—Size–power properties of the test of H0 : k = 2 versus H1 : 2 < k ≤ 7. (n m T ) = (150 65 500). 2 On the x axis and y axis of a p-value discrepancy plot, we have the nominal size of a test and the difference between the finite-sample size and the nominal size, respectively. A size–power curve is a plot of the power against the true size of a test.
1459
LARGE FACTOR MODELS
(150 500). Such a choice of n and T mimics the size of the Stock–Watson data set. The d.f.t.’s used by the test are computed at frequencies ωi = 2πi/T , i = 1 65, and, therefore, m = 65. We checked that the figure does not change much for m on a grid 35:5:125. Note that, for our choice of T = 500 and m = 65, the interval [ω1 ωm ] covers a wide range of cycles from 500/65 ≈ 8 periods per cycle to 500 periods per cycle. Therefore, we cannot unambiguously name a specific frequency ω0 which is approximated by ω1 ωm . In the rest of the paper, instead of specifying a particular ω0 we simply report the range of frequencies ω1 ωm used by the test. As can be seen from Figure 2, the size of our test is only slightly larger than the nominal size and the power is around 100% for the test of actual sizes greater than 1%. As a robustness check, we change the distribution of the idiosyncratic terms’ innovations uit from N(0 1) to the centered chi-squared distribution χ2 (1) − 1 and to the Student t(5) distribution. For AR loadings, we find that the worst size discrepancy of the nominal 5% size test equals 0.027 and corresponds to the chi-squared case. For MA loadings, the worst size discrepancy of the nominal 5% size test equals 0.021 and corresponds to the t(5) case. For both AR and MA loadings, we find no notable changes in the size–power curves. Table II describes the sensitivity to changes in the sample size and the sensitivity to the choice of the approximating frequencies of the size and power of the test of H0 : k = 2 versus H1 : 2 < k ≤ 7. Panels A and B of the table correTABLE II THE ACTUAL SIZE OF THE NOMINAL 5% SIZE TEST AND THE POWER OF THE TRUE 5% SIZE TEST WHEN k = 7 FOR DIFFERENT SAMPLE SIZES n: T: m:
1000 250 60
250 1000 70
150 500 65
A. ωs = 2πs/T AR, size AR, power % variation due to 2 factors
64 100 67
61 100 83
MA, size MA, power % variation due to 2 factors
53 100 66
B. ωs = 2π([T/2] + s − m − 1)/T AR, size 51 AR, power 44 % variation due to 2 factors 21 MA, size MA, power % variation due to 2 factors
56 88 26
500 150 40
100 250 40
250 100 40
70 150 30
150 70 30
70 70 30
67 100 76
68 92 63
76 94 73
82 33 51
89 27 69
84 15 45
87 13 46
56 100 67
62 100 67
57 100 65
72 100 67
66 99 58
68 86 67
68 81 56
65 51 56
57 45 15
47 47 17
50 49 24
54 49 18
63 56 72
53 51 19
72 58 12
56 64 12
47 52 96
56 50 14
54 23 31
59 47 17
67 66 46
62 50 21
72 25 48
71 14 48
1460
ALEXEI ONATSKI
spond to ωs = 2πs/T and ωs = 2π([T/2] + s − m − 1)/T respectively. The first of the above choices of ωs covers relatively low frequencies, while the second choice covers relatively high frequencies. Note that although the simulated data are scaled so that two factors explain 50% of the all-frequency variation, the factors’ explanatory power is larger at low frequencies and smaller at high frequencies. The raws of Table II, which are labeled “% variation due to 2 factors,” show how much of the variation of the simulated series Xit in the interval [ω1 ωm ] is due to its two-factor systematic component χit .3 As can be seen from the table, the size of the test remains good for most of the sample-size–frequency-range combinations. It remains good even in the cases when, contrary to the assumption of Theorem 1, n is larger than T . We discussed possible theoretical reasons for such good performance in Section 3 above. Not surprisingly, the power of the test strongly depends on the explanatory power of factors and on whether n is large or small. In the AR section of panel B, where factors explain only 2% of the data variation, the test does not have power at all. More important, as the table shows, the explanatory power of factors, and hence the power of the test, may strongly depend on the frequency range chosen. Such a dependence may be both advantageous and harmful for researchers. On one hand, if there is a priori knowledge about the frequency content of the factors, the frequency range can be chosen so that the test is more powerful. Further, if the interest is in studying a particular frequency range, say, business cycle frequencies, the analysis will not be contaminated by factors operating at very different frequencies. On the other hand, the test may fail to detect the factors if the systematic variation happens to be low in the frequency range chosen. One way to address this problem is to combine the tests at, say, N different frequencies. A natural combined test statistic would be maxj=1N Rj , where Rj are the test statistics of the separate tests. Assuming that N stays fixed as m, n, and T go to infinity and that the different approximating frequency ranges converge to distinct frequencies, it is possible to show4 that under the null, maxj=1N Rj is asymptotically distributed as maxj=1N max0
LARGE FACTOR MODELS
1461
longer be asymptotically independent and the asymptotic distribution of the combined test statistic will not be as above. We leave the formal analysis of combined test procedures for future research. In the remaining two Monte Carlo experiments, we focus on the sample size (n T m) = (150 500 65) and on the frequency range ωs = 2πs/T , s = 1 m. First, we explore in more detail the sensitivity of the size and power of our test to changes in the proportion of the systematic variation in the data. We renormalize the systematic and idiosyncratic components so that their variances equal α(04 + 005k) and 1 − α(04 + 005k) respectively. For k = 2 factors, our benchmark choice α = 1 corresponds to 50% of the all-frequency variation being systematic. For α = 02 the proportion of the systematic variation drops to 10%. When α decreases from 1 to 0.8, 0.6, 0.4, and 0.2, the proportion of the systematic variation at the frequency range [ω1 ωm ] decreases from 76 to 70, 61, 50, and 34% for AR loadings, and from 67 to 59, 50, 39, and 24% for MA loadings, respectively. We find that the size of our test is not sensitive to the choice of α but the power of the test is. For AR loadings and α = 1 08 02, the power of the test of actual size 5% equals 100, 100, 97, 58, and 8%, respectively. For MA loadings and α = 1 08 02, the power equals 100, 100, 96, 39, and 5%, respectively. In our last Monte Carlo experiment, we study the effect of increasing crosssectional correlation in idiosyncratic terms. We find that increasing parameter ρ in the equation υit = ρυi−1t + uit up to ρ = 08 does not notably affect either the size or the power of the test. However, for very large values of ρ, the size and the power deteriorate. For ρ = 098, the actual size of the nominal 5% size test equals 10.5 and 10.6% for AR and MA loadings, respectively, while the power of the actual 5% size test completely disappears. For ρ = 095, the actual size equals 8.6 and 7.8%, while the power equals 13.3 and 7.7% for AR and MA loadings, respectively. In addition to the above Monte Carlo exercises, we did a Monte Carlo analysis of the approximate factor model version of our test. The analysis shows that its size and power properties are very good. To save space, we will not report these results here. 5.2. Comparison to Other Tests To the best of our knowledge, there are no alternatives to our test in the general generalized dynamic factor setting.5 However, in the special case of the approximate factor models, one can also use tests proposed by Connor and 5
Jacobs and Otter (2008) developed a test that can be used when the dynamic factor loadings Λij (L) are lag polynomials of known finite order r This test uses fixed-n/large-T asymptotics. Although it works well for n < 40 it is very strongly oversized for n ≥ 70
1462
ALEXEI ONATSKI
Korajczyk (1993) and Kapetanios (2005). Let us briefly describe these alternatives. Connor and Korajczyk (1993) tested the null of p = p0 approximate factors against the alternative that p = p0 + 1 Their test uses the fixed-T /large-n asymptotics and is based on the idea that the explanatory power of an extra p0 + 1 factor added to the model should be small under the null and large under the alternative. Kapetanios (2005) tested the null of p = p0 approximate factors against the alternative that p0 < p ≤ p1 . He employed a subsampling method to approximate the asymptotic distribution of a test statistic λp0 +1 − λp1 +1 , where λi is the ith largest eigenvalue of the sample covariance matrix of the data T 1 t=1 Xt Xt . He made high-level assumptions about the existence of a scaling T of λp0 +1 − λp1 +1 which converges in distribution to some unknown limit law, about properties of such a law, and about the functional form of the scaling constant. To compare our test with the Connor–Korajczyk and the Kapetanios tests,6 k we simulate data from an approximate factor model Xit = j=1 Λij Fjt + eit , where Λij are i.i.d. N(0 1); Fjt = 085Fjt−1 + εjt with εjt i.i.d. N(0 1); eit are as in Section 5.1 with ρi ∼ U[−08 08] ρ = 02, or, alternatively, ρ = 07; and k j=1 Λij Fjt and eit are scaled to have equal variances for each i. Table III reports the actual size of the nominal 5% size tests and the power of the actual 5% size tests for the hypothesis of two versus three approximate factors. The size and power computations are based on 10,000 Monte Carlo replications. The row labeled “Onatski (dynamic)” corresponds to our dynamic factor test TABLE III SIZE–POWER PROPERTIES OF THE ALTERNATIVE TESTS Size (in %) n: T:
Power (in %)
70 70
150 70
70 150
70 70
150 70
70 150
ρ = 02 Onatski (dynamic) Onatski (approximate) Kapetanios Connor–Korajczyk
721 605 108 225
837 621 127 240
709 621 686 232
949 967 100 767
992 100 100 895
100 100 100 996
ρ = 07 Onatski (dynamic) Onatski (approximate) Kapetanios Connor–Korajczyk
701 652 243 445
709 644 193 374
703 590 176 624
575 610 100 336
931 943 100 598
949 945 100 695
6
We are grateful to George Kapetanios for sharing his codes with us.
LARGE FACTOR MODELS
1463
with ωs = 2πs/T s = 1 30 The row labeled “Onatski (approximate)” corresponds to our approximate factor test described in Section 4. As can be seen from the table, our tests are well sized for both relatively weak (ρ = 02) and relatively strong (ρ = 07) cross-sectional correlation of the idiosyncratic terms. The power of our tests is always greater than 90%, except for the case of n = T = 70 and ρ = 07 when it drops to about 60%. The size and power properties of the Connor–Korajczyk test are clearly worse than those of our tests. The power of the Kapetanios test is excellent. However, the test is substantially more oversized than our tests. The size distortion for the Kapetanios test becomes very large for the case of the relatively strong correlation of the idiosyncratic terms. 5.3. Using the Test to Determine the Number of Factors Although our test’s primary objective is testing hypotheses about the number of factors, it can also be used to determine the number of factors in a data set as follows. Suppose that it is known a priori that k1 ≤ k ≤ k2 . Using our test of asymptotic size α, test H0 : k = k1 versus H1 : k1 < k ≤ k2 . If H0 is not rejected, stop. The estimated number of factors is k1 . If H0 is rejected, test H0 : k = k1 + 1 versus H1 : k1 + 1 < k ≤ k2 . Repeat the procedure until H0 is not rejected and take the corresponding number of factors as the estimate. The estimate equals the true number of factors with probability approaching 1 − α as n grows. In the Monte Carlo simulations below, we choose k1 = 1, k2 = 4, and α equal to the maximum of 001 and the p-value of the test of H0 : k = 0 versus H1 : 0 < k ≤ k1 . That is, α is calibrated so that our test has enough power to reject the a priori false null of k = 0 Our Monte Carlo design is the same as in Section 5.1, except we now magnify the idiosyncratic part of the simulated data by σ ≥ 1 which makes the determination of the number of factors a more difficult problem. We compare the above estimator to the Bai and Ng (2007) and the Hallin and Liska (2007) estimators with the following parameters (denoted as in the ˆ 1k statistic for corresponding papers). For the Bai–Ng estimator, we use the D the residuals of VAR(4), set the maximum number of static factors at 10, and consider δ = 01 and m = 2. For the Hallin–Liska estimator, we use the information criterion ICT2;n with penalty p1 (n T ), set the truncation parameter √ MT at [07 T ], and consider the subsample sizes (nj Tj ) = (n − 10j T − 10j) with j = 0 1 3 so that the number of the subsamples is J = 4 We chose the penalty multiplier c on a grid 0.01:0.01:3 using Hallin and Liska’s second “stability interval” procedure. To make our estimator comparable to the alternatives, which impose the correct restriction that the number of factors is the same across different frequen-
1464
ALEXEI ONATSKI
cies, we compute our estimates for frequencies on a grid in the [0 π] range.7 We then weight these frequencies proportionally to their factor information content, as measured by the square of the sum of the four largest eigenvalues of the corresponding estimate of the data’s spectral density. Our final estimate of the number of factors equals the estimate which agrees for most of the frequencies on the grid counted according to their weights. Table IV reports the percentages of 500 Monte Carlo (MC) replications that ˆ The best possible estimadeliver 1 2 3, and 4 estimated number of factors k ˆ tor has k = 2 for 100% of the replications. For AR loadings, the Hallin–Liska estimator performs the best. Our estimator is almost as good as the Hallin– Liska estimator and is much better than the Bai–Ng estimator, which tends TABLE IV PERCENTAGE OF 500 MONTE CARLO REPLICATIONS RESULTING IN ESTIMATES 1 2 3, OR 4 OF THE TRUE NUMBER OF FACTORS k = 2 ACCORDING TO DIFFERENT ESTIMATORS. THE ESTIMATORS ARE CONSTRAINED TO BE IN THE RANGE FROM 1 TO 4 Onatski n
T
σ2
MA Loadings 70 70 1 70 70 2 70 70 4
kˆ =
Hallin–Liska
Bai–Ng
1
2
3
4
1
2
3
4
1
2
3
4
0 0 12
100 100 80
0 0 6
0 0 2
0 0 85
100 100 15
0 0 0
0 0 0
1 1 52
99 99 48
0 0 0
0 0 0
100 100 100
120 120 120
1 3 6
0 0 3
100 100 94
0 0 3
0 0 0
0 0 31
100 100 69
0 0 0
0 0 0
0 0 56
100 100 44
0 0 0
0 0 0
150 150 150
500 500 500
1 8 16
0 0 2
100 100 96
0 0 2
0 0 0
0 0 49
100 100 51
0 0 0
0 0 0
0 0 100
100 100 0
0 0 0
0 0 0
AR Loadings 70 70 1 70 70 2 70 70 4
16 29 62
77 66 27
6 4 7
1 1 4
12 18 88
88 82 12
0 0 0
0 0 0
21 26 81
45 74 19
32 0 0
2 0 0
100 100 100
120 120 120
1 3 6
0 10 35
95 81 51
4 7 8
1 2 6
0 4 42
100 96 58
0 0 0
0 0 0
79 11 70
16 89 30
4 0 0
1 0 0
150 150 150
500 500 500
1 8 16
0 0 5
97 98 83
2 1 7
1 1 5
0 0 6
100 100 94
0 0 0
0 0 0
88 1 96
12 99 4
0 0 0
0 0 0
√ More precisely, we consider a linear grid of [07 T ] frequencies in a [0 π − 2πm ] range. Each T frequency ω0 on the grid is approximated by frequencies of the form ω0 + 2π j with j = 1 m T We set m = 30 40, and 65 for the data sizes (n T ) = (70 70) (100 120), and (150 500) respectively. 7
LARGE FACTOR MODELS
1465
to considerably underestimate the number of factors for the smallest and the largest σ 2 considered. For MA loadings and relatively less noisy data (relatively small σ 2 ), all estimators work very well. For very noisy data, however, our estimator outperforms both the Hallin–Liska and the Bai–Ng estimators, which tend to underestimate the number of factors. The reason why the excellent performance of the Hallin–Liska estimator deteriorates as the data get noisier is as follows. The Hallin–Liska automatic determination of the penalty multiplier utilizes the fact that as long as the multiplier is too small or too large, the estimated number of factors should change from a wrong number of factors to the true number as n increases, whereas if the multiplier is appropriate, the estimate should remain equal to the true number of factors. However, for very noisy data, even if the multiplier is chosen correctly, the stabilization of the estimator at the true number of factors requires n to be very large. Hence, choosing relatively small n in our MC experiments leads to the breakdown in the automatic determination procedure. One possible remedy is to make the subsample sizes used by the Hallin– Liska estimator larger. For example, instead of nj = n − 10j and Tj = n − 10j consider nj = n−3j and Tj = n−3j. Such a change indeed improves the quality of the estimator in the noisy data cases. For MA loadings and n = T = 70 the estimator’s percentage of correct answers increases from 15 to 89% for σ 2 = 4. At the same time, the quality of the estimator for σ 2 = 1 and σ 2 = 2 remains very good. It overestimates the true number of factors only in 2% of the corresponding MC experiments. 6. APPLICATION In this section, we test different hypotheses about the number of dynamic factors in macroeconomic time series and about the number of dynamic factors driving excess stock returns. The literature is full of controversy about the number of dynamic factors driving macroeconomic and financial data. Stock and Watson (2005) estimated seven dynamic factors in their data set. Giannone, Reichlin, and Sala (2005) found evidence supporting the existence of only two dynamic factors. Very recently, Uhlig (2009), in his discussion of Boivin, Giannoni, and Mojon (2009), demonstrated that in the European data, there might be no common factors at all and that the high explanatory power of few principal components of the data reported by Boivin, Giannoni, and Mojon (2009) may be an artifact of the high persistence of the individual time series in their data. Similarly, previous studies of factors driving excess stock returns often do not agree and report from one to six such factors (see Onatski (2008) for a brief review of available results). In this section we consider two macroeconomic and one financial data sets. The first macroeconomic data set is the same as in Stock and Watson (2002). It includes n = 148 monthly time series for the United States from 1959:1 to 1998:12 (T = 480). The variables in the data set were transformed, standardized, and screened for outliers as described in Stock and Watson (2002).
1466
ALEXEI ONATSKI
The second macroeconomic data set is a subset of the data from Boivin, Giannoni, and Mojon (2009). It includes n = 243 quarterly time series for Germany, France, Italy, Spain, the Netherlands, and Belgium (33 series per country plus 12 international series) from 1987:Q1 to 2007:Q3 (T = 83). The series were regressed on current oil price inflation and short term interest rates as in Boivin, Giannoni, and Mojon (2009) and the standardized residuals were taken as the data to which we apply our test. We are grateful to Stock and Watson (2002) and Boivin, Giannoni, and Mojon (2009) for sharing their data sets with us. Our financial data set is provided by the Center for Research in Security Prices. It includes n = 972 monthly excess returns on stocks traded on the NYSE, AMEX, and NASDAQ during the entire period from January 1983 to December 2006. Since previous empirical research suggests that the number of common risk factors may be different in January and non-January months, we drop January data, which leaves us with T = 264 For our test, we use the d.f.t.’s of the Stock–Watson data at frequencies ωj = 2πsj /480 with sj ∈ {4 40}; the d.f.t.’s of the Boivin–Giannoni–Mojon data at frequencies ωj = 2πsj /83 with sj ∈ {2 30}; and the d.f.t.’s of the stock return data at frequencies ωj = 2πsj /264 with sj ∈ {1 60}. We used two criteria to make these choices of the approximating frequencies. First, for the macroeconomic data, we wanted to include the business cycle frequencies, but exclude cycles longer than 10 years. Second, we wanted to have at least 30 approximating frequencies, so that our asymptotic analysis may apply. Table V reports the values of the ratio (γi − γi+1 )/(γi+1 − γi+2 ) for different i. Using these values, a reader can compute the R statistic for the test of her favorite null against her favorite alternative. Then she can use the critical values in Table I to perform the test. For example, an interesting hypothesis to test for the stock return data is H0 : k = 1 against H1 : 1 < k ≤ 3 The value of our R statistic for this test is the maximum of (γ2 − γ3 )/(γ3 − γ4 ) and (γ3 − γ4 )/(γ4 − γ5 ), which equals 3.20. Comparing this value to the critical values from Table I, we find that we cannot reject the null with the 15% size test. TABLE V VALUES OF THE RATIO (γi − γi+1 )/(γi+1 − γi+2 ) FOR DIFFERENT i i ω0 (in years per cycle)
Stock–Watson Data ω0 ∈ [1 10]
1
2
3
4
5
6
7
9.90
5.01
0.75
3.44
1.78
0.75
1.28
Boivin–Giannoni–Mojon Data ω0 ∈ ( 128 10 124 ] 3.76
1.64
1.14
2.17
0.74
5.44
0.39
Stock Return Data ω0 ∈ ( 124 22]
3.20
1.97
0.96
2.00
1.00
2.37
7.43
LARGE FACTOR MODELS
1467
A particularly interesting hypothesis for the Stock–Watson macroeconomic data is that of two versus three to seven dynamic factors. The value of the R statistics for such a test is 3.44. We cannot reject the null with the 15% size test. At the same time, the null of zero factors versus the alternative of one or two factors can be rejected with the 2% size test. The hypothesis of zero factors is especially interesting for the Boivin– Giannoni–Mojon data (we are grateful to Harald Uhlig for suggesting that we analyze these data). As mentioned above, Uhlig (2009) pointed out that the high explanatory power of a few principal components extracted from these data may be purely an artifact of the high temporal persistence of the data. The data are very persistent because they represent year-to-year growth rates of macroeconomic indicators (in contrast, the Stock–Watson data mostly represent the monthly growth rates). For the Boivin–Giannoni–Mojon data, we cannot reject the null of zero factors versus any of the alternatives: 1 1 to 2 1 to 7 dynamic factors by the 5% size test. The null of zero factors versus the alternative of one or two factors can be rejected only by the 15% size test. The p-values for the test of zero factors versus 1 to 3 1 to 7 alternatives are even larger. The large temporal persistence of the idiosyncratic terms in the Boivin– Giannoni–Mojon data takes us somewhat outside the setting of the Monte Carlo experiments in Section 5. To check the finite-sample performance of our test in the high-persistence environment, we simulate 8000 data sets as in Uhlig (2009) (we thank Harald Uhlig for sharing his codes with us). These data sets consist of 243 independent times series (hence, no common factors exist) of length 83 with the first-order autocorrelations equal to the first-order autocorrelations of the corresponding Boivin–Giannoni–Mojon series. The data were transformed exactly as the Boivin–Giannoni–Mojon data and the tests of zero versus 1 1 to 2 1 to 5 dynamic factors were performed. The actual sizes of our 5% nominal size tests (based on 8000 simulations) were 7.9, 7.9, 8.3, 8.4, and 8.4% for the alternatives of 1, 1 to 2 1 to 5 factors, respectively. To assess the power of the test to detect the five factors extracted by Boivin, Giannoni, and Mojon (2009), we add the five-factor systematic components estimated as in that paper to the series simulated as above (normalized to match the standard deviations of the estimated idiosyncratic components). We find that the power of the nominal 5, 10, and 15% tests of H0 : k = 0 versus H1 : 0 < k ≤ 5 equals 51, 75, and 88%, respectively. We conclude that there is only very mild evidence that there are one or two common dynamic factors in the Boivin–Giannoni–Mojon data. If such factors exist, they might explain less variation in the data than suggested by the high explanatory power of the few principal components extracted from the data.
1468
ALEXEI ONATSKI
7. DISCUSSION Our test can be based on other statistics than γi − γi+1 k0
R ≡ max
Indeed, under the assumptions of Theorem 1, any function f (γk0 +1 γk0 +r ) which is invariant with respect to scaling and normalizing of its arguments converges in distribution to a function of the r-dimensional TW2 under the null. It is not difficult to see that, except for a zero probability domain where some γi coincide, all such functions have the form g(ρ), where γk0 +i − γk0 +i+1 i = 1 r − 2 ρ≡ γk0 +i+1 − γk0 +i+2 A theoretically sound choice of g(ρ) would be the ratio of the likelihood under the alternative to the likelihood under the null, L1 (ρ)/L0 (ρ). However, the alternative does not specify any information about the first k − k0 elements of ρ except that (γk − γk+1 )/(γk+1 − γk+2 ) tends to infinity. A feasible choice of g(ρ) would be a pseudolikelihood ratio when L1 (ρ) is an “improper density” that diverges to infinity as any of its first k1 − k0 arguments diverges to infinity. We tried a test based on L1 (ρ)/L0 (ρ) with L1 (ρ) = R We estimated L0 (ρ) using kernel methods and a simulated sample from the multivariate TW2 distribution. The test has comparable size and somewhat better power properties than our benchmark test. However, the necessity of estimating L0 (ρ) makes the test much more complicated than the benchmark, and thus less attractive from a practical perspective. Our test is one of many possible “simple” tests based on g(ρ) = R We have tried a variety of choices of g(ρ) in addition to g(ρ) = R For example, we considered wi g(ρ) = R˜ ≡ max
k0
i
k1 +1
γj − (1 − wi )
j=k0 +1
j=i+1
γi+1 − γk1 +2
γj
where wi = (1 − ((i − k0 )/(k1 − k0 + 1))) In contrast to R which maximizes “local” measures of the curvature of the scree plot at would-be kink points under the alternative, statistic R˜ maximizes wider measures of the curvature. However, none of the alternative statistics which we tried unambiguously outperformed R in terms of the size and power of the corresponding test Now we would like to address another important issue: it is unlikely that the factor model with few common factors literally describes the data. For example, macroeconomic data often have a clear block structure, with blocks corresponding to different categories of economic indicators, such as prices or
LARGE FACTOR MODELS
1469
output measures, included in the data set. Such data are perhaps better described by a factor model with few strong factors and many “weaker” factors that correspond to the different data categories. Although a comprehensive analysis of such a model is beyond the scope of this paper, we would like to point out two facts. First, as shown in Onatski (2009a), when the explanatory power of a factor is not much larger than that of the idiosyncratic components, it is impossible to empirically distinguish between the two. More important, a situation with few dominant and many “weaker” factors can alternatively be modeled as a situation where there are few factors and the idiosyncratic components are strongly influential. We therefore expect that if the distinction between strong and weak factors is not sharp, our test will work as if there were no factors at all. In contrast, if the distinction is sharp, the test will work as if the true number of factors were equal to the number of strong factors. To check this intuition, we generate data with two strong factors, each of which explains s · 100% of the data’s variation, and ten weak factors, each of which explains w · 100% of the data’s variation, where s > w. We consider three (s w) pairs: (1/6 1/30) (1/8 1/24), and (1/12 1/20). Figure 3 plots the power (based on 2000 MC replications of the data with MA loadings) against the nominal size of the test of H0 : k = 0 versus H1 : 0 < k ≤ 2 (left panel) and of the test of H0 : k = 2 versus H1 : 2 < k ≤ 4 factors (right panel). As expected, when the explanatory power of the two strong factors is much larger than that of the weak factors (solid line), the test has a lot of power
FIGURE 3.—Power plotted against the nominal size. Left panel: Test of H0 : k = 0 versus H1 : 0 < k ≤ 2. Right panel: Test of H0 : k = 2 versus H1 : 2 < k ≤ 4. Dotted, dashed, and solid lines in that order correspond to the increasing explanatory power of the two strong factors.
1470
ALEXEI ONATSKI
against the null of zero factors, but no power against the null of two factors. When the relative strength of the strong factors fades (dashed line corresponds to (s w) = (1/8 1/24) and dotted line corresponds to (s w) = (1/12 1/20)), the power against the null of zero factors gradually disappears. 8. CONCLUSION In this paper, we have developed a test of the null hypothesis that there are k0 factors, versus the alternative that there are more than k0 but less than k1 + 1 factors in a generalized dynamic factor model. Our test is based on the statistic maxk0
ij
|Aij |2 .
This is a well known result. See, for example, Horn and Johnson (1985, p. 421).
LARGE FACTOR MODELS
1471
LEMMA 3: Let n and m go to infinity, and let A(nm) and B(nm) be random n × m matrices such that σ12 (A(nm) − B(nm) ) = op (n−1/3 ) and σ12 (B(nm) ) = Op (n). Then |σk2 (A(nm) ) − σk2 (B(nm) )| = op (n1/3 ) uniformly over k. PROOF: To simplify notation, we omit the index (n m) over A and B. Weyl’s inequalities (see p. 423 of Horn and Johnson (1985)) for singular values of any n × m matrices F and G are σi+j−1 (F + G) ≤ σi (F) + σj (G) where 1 ≤ i j ≤ min(n m) and i + j ≤ min(n m) + 1. Taking i = 1, j = k, and F = A − B, and considering G = −A and G = B, we obtain |σk (A) − σk (B)| ≤ σ1 (A − B) = op (n−1/6 ) But |σk2 (A) − σk2 (B)| = |σk (A) − σk (B)|(σk (A) + σk (B)) ≤ σ1 (A − B)(2σ1 (B) + op (n−1/6 )) = op (n1/3 ) uniformly over k Q.E.D. LEMMA 4: Let n, m, and T go to infinity so that n ∼ m = o(T 3/7 ). Then, unˆ = op (n−1/3 ), where χˆ ≡ [χˆ 1 (n) der Assumptions 1 and 2(i), σ12 (χˆ − Λˆ 0 F) ∞ χˆ m (n)], Fˆ ≡ [Fˆ1 Fˆm ], and Λˆ s ≡ u=0 Λ(u) (n)e−iuωs . PROOF: Write χˆ − Λˆ 0 Fˆ as P + Q where P and Q are n × m matrices with sth columns Ps ≡ χˆ s (n) − Λˆ s Fˆs and Qs ≡ (Λˆ s − Λˆ 0 )Fˆs , respectively. First, consider matrix P. Interchanging the order of the sums in the definition χˆ s (n) ≡ T ∞ (u) −itωs √1 and changing summation index t to τ = t − u, t=1 u=0 Λ (n)Ft−u e T we obtain the representation ∞ 1 (u) Ps ≡ χˆ s (n) − Λˆ s Fˆs = √ Λ (n)e−iuωs ru T u=0
where
min(T −u0)
ru =
τ=−u+1
Fτ e−iτωs −
T
Fτ e−iτωs
τ=max(0T −u)+1
Using this representation, we obtain E Ps 2 ≤ ≤
∞
1
(u) (v) E ru Λ (n) Λ (n)rv
T uv=0 ∞ 1 Λ(u) (n)Λ(v) (n)E( ru rv ) T uv=0
∞ 1 Λ(u) (n)Λ(v) (n)(E ru 2 E rv 2 )1/2 ≤ T uv=0
2 ∞ 1 (u) 2 1/2 Λ (n) (E ru ) = T u=0
1472
ALEXEI ONATSKI
But
min(T −u0)
E ru 2 =
Fτ 2 +
T
Fτ 2 = 2k min(u T )
τ=max(0T −u)+1
τ=−u+1
because Ft is a k-dimensional white noise. Therefore, by Lemma 2, Eσ (P) ≤ 2 1
m s=1
2 ∞ m (u) 1/2 Λ (n)(2ku) E Ps ≤ T u=0 2
which is o(n−1/3 ) by Assumption 2(i) and by the assumption that n ∼ m = o(T 3/7 ). Hence, by Markov’s inequality, σ1 (P) = op (n−1/6 ) Next, consider matrix Q. Note that Λˆ s − Λˆ 0 ≤
∞ (u) −iuω s Λ (n)|e − e−iuω0 | u=0
≤
∞ (u) 2π(m + 1) Λ (n)u T u=0
which is o(n−5/6 ) uniformly in s by Assumption 2(i) and by the assumption that m n ∼ m = o(T 3/7 ). Further, since Ft is a k-dimensional white noise, s=1 Fˆs 2 = Op (m) Finally, by Lemma 2, σ12 (Q) ≤
m s=1
E Qs 2 ≤ max Λˆ s − Λˆ 0 2 s
= o n−5/3 · Op (m) = op n−2/3
m
Fˆs 2
s=1
so that σ1 (Q) = op (n−1/3 ). Now, the statement of the lemma follows from the ˆ ≤ (σ1 (P) + σ1 (Q))2 = op (n−1/3 ). fact that σ12 (χˆ − Λˆ 0 F) Q.E.D. LEMMA 5: Let eˆ ≡ [eˆ 1 (n) eˆ m (n)]. Then, under the assumptions of Theorem 1, there exists an n × m matrix e˜ with independent NnC (0 2πSne (ω0 )) columns, ˆ and such that σ 2 (eˆ − e) ˜ = op (n−1/3 ). independent from F 1 PROOF: First, suppose that Assumption 2(ii) holds and n ∼ m = o(T 3/8 ). Define η ≡ ((Re eˆ 1 (n)) (Im eˆ 1 (n)) (Re eˆ m (n)) (Im eˆ m (n)) ) . Using Theorem 4.3.2 of Brillinger (1981), which characterizes mixed cumulants of
LARGE FACTOR MODELS
1473
discrete Fourier transforms, it is not difficult to show8 that Eηη = V + R with a block diagonal Re Sne (ω0 ) − Im Sne (ω0 ) V = πIm ⊗ Im Sne (ω0 ) Re Sne (ω0 ) and Rij = δ[i/2n][j/2n] O(m/T ) + O(T −1 ), where δst is the Kronecker delta, and O(m/T ) and O(T −1 ) are uniform in i and j running from 1 to 2nm. Construct η˜ = V 1/2 (V + R)−1/2 η and define an n × m matrix e˜ with the sth columns e˜ s so that ((Re e˜ 1 ) (Im e˜ 1 ) (Re e˜ m ) (Im e˜ m ) ) = η. ˜ Note that e˜ has independent NnC (0 2πSne (ω0 )) columns by construction. Using inequalities BA 2 ≤ B A 2 and AB 2 ≤ A 2 B (see, for example, Horn and Johnson (1985, Problem 20, p. 313)), we obtain 2 E η − η ˜ 2 = (V + R)1/2 − V 1/2 2 4 2 1/2 ≤ V 1/4 I + V −1/2 RV −1/2 − I 2 Denote the ith largest eigenvalue of V −1/2 RV −1/2 as μi and note that |μi | ≤ 1 for large enough T . Since |(1 + μi )1/2 − 1| ≤ |μi | for any |μi | ≤ 1, the ith eigenvalue of (I + V −1/2 RV −1/2 )1/2 − I is no larger by absolute value than the ith eigenvalue of V −1/2 RV −1/2 for large enough T . Therefore, 4 2 4 4 E η − η ˜ 2 ≤ V 1/4 V −1/2 RV −1/2 2 ≤ V 1/4 V −1/2 R 22 But V 1/4 = (πl1n )1/4 and V −1/2 = (πlnn )−1/2 by construction, and R 22 =
2nm
2
δ[i/2n][j/2n] O(m/T ) + O(T −1 )
ij=1
= m(2n)2 O(m2 /T 2 ) + (2mn)2 O(T −2 ) = o n−1/3 because n ∼ m = o(T 3/8 ). Hence,
E η − η ˜ 2 ≤ (πl1n )(πlnn )−2 o n−1/3 = o n−1/3
−1 remain bounded as n m → ∞ where the last equality holds because l1n and lnn by Assumption 3. Finally, Lemma 2 and Markov’s inequality imply that σ12 (eˆ − ˜ = op (n−1/3 ). e) Now, suppose that 2(ii)(a) holds and m = o(T 1/2−1/p log−1 T )6/13 . Assumption ∞ In this case, eˆ is = j=1 Aij uˆ js , where uˆ js is the d.f.t. of ujt at frequency ωs . For fixed j and ω0 = 0, Phillips (2007) showed that there exist i.i.d. complex
8
For details of the derivation, see the Supplementary Appendix.
1474
ALEXEI ONATSKI
normal variables ξjs , s = 1 m such that uˆ js − ξjs = op (m/T 1/2−1/p ) uniformly over s ≤ m. Lemmas S1, S2, and S3 in the Supplementary Appendix extend Phillips’ proof to the case ω0 = 0 and show that there exist Gaussian processes uG jt with the same autocovariance structure as ujt and independent over j ∈ N such that the differences between the d.f.t.’s uˆ js − uˆ G js ≡ rjs satisfy 2 2 2 2/p−1 log T for large enough T , where K > 0 supj>0 E(maxs≤m |rjs |) ≤ Km T ∞ depends only on p μp supj≥1 ( k=0 k|cjk |)p , and supj≥1 |Cj (e−iω0 )|. ∞ G Assumption 2(ii)(a) implies that the process eG it = j=1 Aij ujt satisfies AsG sumption 2(ii). Let eˆ be the n × m matrix with i sth entry equal to the d.f.t. 1/2−1/p log−1 T )6/13 = o(T 3/8 ) for positive p, the of eG it at frequency ωs . Since (T above analysis of the Gaussian case implies the existence of an n × m matrix ˜ = op (n−1/3 ) On the other e˜ described in Lemma 5 and such that σ12 (eˆ G − e) G G ˆ ˜ ˆ ˆ ˆ ˜ hand, σ1 (e − e) ≤ σ1 (e − e ) + σ1 (e − e). Hence, to complete the proof, we only need to show that σ12 (eˆ − eˆ G ) = op (n−1/3 ). But we have m n
E|(eˆ − eˆ G )is |2 =
i=1 s=1
m ∞ n i=1 s=1 j=1
≤m
n ∞ i=1 j=1
which is o(n−1/3 ) because supi>0 m
A2ij E|rjs |2
∞ j=1
2 A2ij E max |rjs | s≤m
A2ij < ∞ by Assumption 2(ii)(a) and
n 2 E max |rjs | ≤ mnKm2 T 2/p−1 log2 T = o n−1/3 i=1
s≤m
if n ∼ m = o(T 1/2−1/p log−1 T )6/13 as have been assumed. Therefore, Lemma 2 and Markov’s inequality imply that σ12 (eˆ − eˆ G ) = op (n−1/3 ). Q.E.D. LEMMA 6: Let A(κ) = A + κA(1) , where A ≡ diag(a1 ak 0 0) is an n × n matrix with a1 ≥ a2 ≥ · · · ≥ ak > 0 and A(1) is a symmetric nonnegative (1) definite n × n matrix with lower right (n − k) × (n − k) block A22 . Let r0 = ak /2. (1) Then, for any real κ such that 0 < κ < r0 / A and for any i = 1 2 n − k, we have 2
κ 2 A(1)
λk+i (A(κ)) − κλi A(1) ≤ 22 r0 − κ A(1) PROOF: Let P(κ) and P be the orthogonal projections in Rn on the invariant subspaces of A(κ) and A, respectively, corresponding to their smallest (1) n − k eigenvalues. Then A22 = PA(1) P and the k + ith eigenvalue of A(κ) is
1475
LARGE FACTOR MODELS
the ith eigenvalue of A˜ (1) (κ) ≡ P(κ)A(κ)P(κ) so that by Weyl’s inequalities (1) (see Theorem 4.3.1 of Horn and Johnson (1985)), |λk+i (A(κ)) − κλi (A22 )| ≤ (1) (1) ˜ A (κ) − κPA P . Kato (1980) showed (see formulas 2.37, 2.38, and 2.17 on pp. 81 and 87 with Kato’s T˜ (1) (κ) equivalent to our κ1 A˜ (1) (κ)) that for |κ| < r0 / A(1) , κ (−κ)k A˜ (1) (κ) − κPA(1) P = 2πi k=1 ∞
k+1 R(z) A(1) R(z) z dz
Γ
where R(z) ≡ (A − zIn )−1 and Γ is a positively oriented circle in the complex plane with center 0 and radius r0 . Note that maxz∈Γ R(z) = 1/r0 so that (1) k+1 A (1) k+1 R(z) A R(z) z dz ≤ 2π rk Γ
0
and we have the desired estimate k+1 2 ∞ (1) |κ|2 A(1) |κ|k+1 A(1) A˜ (κ) − κPA(1) P ≤ = r0k r0 − |κ|A(1) k=1 Q.E.D. PROOF OF THEOREM 1: Lemmas 3, 4, and 5 imply that there exists√a matrix e˜ ˜ with i.i.d. NnC (0 2πSne (ω0 )) columns such that |γj − σj2 ((Λˆ 0 Fˆ + e)/ 2πm)| = −2/3 2/3 op (m ) uniformly over j. Then, since σmn ∼ m the asymptotic joint dis−1 (λk+i (X˜ X˜ / tribution of interest in Theorem 1 is the same as that of {σmn (2πm)) − μmn ) i = 1 r} where X˜ ≡ Λˆ 0 Fˆ + e˜ and λj (A) denotes the jth largest eigenvalue of matrix A We will prove that this distribution is the multivariate TW2 . ˆ has all but the first k columns equal Let U be a unitary matrix such that FU ˆ ˆ ˜ as [e˜ 1 e˜ 2 ], where e˜ 1 is n × k and Fˆ1 is to zero. Partition FU as [F1 0] and eU k × k. Then we have a decomposition (Λˆ 0 Fˆ1 + e˜ 1 )(Λˆ 0 Fˆ1 + e˜ 1 ) e˜ 2 e˜ 2 X˜ X˜ = + 2πm 2πm 2πm where matrix e˜ 2 has iid NnC (0 2πSne (ω0 )) columns. Let Vn AVn be a spectral decomposition of ((Λˆ 0 Fˆ1 + e˜ 1 )(Λˆ 0 Fˆ1 + e˜ 1 ) )/(2πmn) so that A is a rank-k diagonal matrix with the first k diagonal elements a1 ≥ a2 ≥ · · · ≥ ak > 0. Then
1476
ALEXEI ONATSKI
we have Vn X˜ X˜ Vn /(2πmn) = A + n1 A(1) , where A(1) ≡ Vn e˜ 2 e˜ 2 Vn /(2πm). De(1) note the matrix of the last n − k rows of Vn e˜ 2 as e˜ 22 so that A22 ≡ e˜ 22 e˜ 22 /(2πm) (1) is the lower right (n − k) × (n − k) block of A . Since ˜ ˜ ˜ ˜ XX V n X X Vn 1 λi = λi n 2πm 2πmn we have, by Lemma 6, (1) 2 2
˜ ˜ A /n
1
λk+i X X − 1 λi A(1) ≤ 22
n 2πm n r0 − A(1) /n whenever 1/n < r0 / A(1) where r0 = ak /2. Note that A(1) ≡ λ1 (e˜ 2 e˜ 2 /(2πm)) ≤ λ1 (e˜ e˜ /(2πm)) = Op (1) by Lemma 1. On the other hand, r0 decreases slower than n−1/3 Indeed, by Weyl’s inequalities for singular values (see Lemma 3), a
1/2 k
ˆ ˆ ˆ ˆ Λ 0 F F Λ0 e˜ 1 e˜ 1 1/2 − λ1 ≥λ 2πmn 2πmn 1/2 k
where λ1 (e˜ 1 e˜ 1 /(2πmn)) ≤ n1 λ1 (e˜ e˜ /(2πm)) = Op ( n1 ) and ˆ ˆ ˆ ˆ ˆ ˆ Λ 0 F F Λ0 1 FF ≥ λk λk (Λˆ 0 Λˆ 0 ) λk 2πmn n 2πm decreases slower than op (n−1/3 ) by Assumptions 1 and 4. Therefore, for large enough n the inequality 1/n < r0 / A(1) is satisfied and (1) 2
˜ ˜ A /n
(1)
X X
λk+i = op n−2/3 − λi A22
≤
2πm r0 − A(1) /n −1 × This implies that, since σmn ∼ m−2/3 ∼ n−2/3 the random variables σmn ˜ ˜ (λk+i (X X /(2πm)) − μmn ), i = 1 r, have the same asymptotic joint dis−1 (λi (e˜ 22 e˜ 22 /(2πm)) − μmn ), i = 1 r. tribution as σmn Now, note that the distribution of e˜ 22 e˜ 22 /(2πm) conditional on Vn is C (m − k (S¯ne (ω0 ))/m), where S¯ne (ω0 ) is obtained from Vn Sne (ω0 )Vn by elimWn−k inating its first k rows and columns. Matrix S¯ne (ω0 ) satisfies an assumption analogous to Assumption 3 for Sne (ω0 ). Precisely, let l¯1n ≥ · · · ≥ l¯n−kn be the eigenvalues of S¯ne (ω0 ), let H¯ n be the spectral distribution of S¯ne (ω0 ), and let c¯mn −1 be the unique root in [0 l¯1n ) of the equation (λc¯mn /(1 − λc¯mn ))2 d H¯ n (λ) = (m − k)/(n − k) Then, as n and m tend to infinity so that m/n remains in a compact subset of (0 ∞), lim sup l¯1n < ∞ lim inf l¯n−kn > 0, and
LARGE FACTOR MODELS
1477
lim sup l¯1n c¯mn < 1. The first two of the latter inequalities follow from Assumption 3 and the fact that, by Theorem 4.3.15 of Horn and Johnson (1985), lk+in ≤ l¯in ≤ lin for n − k ≤ i ≤ 1. The third inequality follows from the fact that c¯mn − cmn = o(1) This fact, and even a stronger result that on function f¯(c) ≡ c¯mn − cmn = O(1/n) can be established by finding bounds 2 ¯ (λc/(1 − λc)) d Hn (λ) in terms of function f (c) ≡ (λc/(1 − λc))2 dHn (λ) (see the Supplementary Appendix for details). Since S¯ne (ω0 ) satisfies an assumption analogous to Assumption 3, Lemma 1 −1 (λi (e˜ 22 e˜ 22 /(2πm)) − μ¯ mn ); i = implies that the joint distribution of {σ¯ mn 1 r} conditional on Vn converges to TW2 , where λc¯mn n−k 1 ¯ μ¯ mn = 1+ d Hn (λ) m − k 1 − λc¯mn c¯mn and σ¯ mn =
3 1/3 λc¯mn n−k 1 ¯ n (λ) 1 + d H (m − k)2/3 c¯mn m−k 1 − λc¯mn
We, however, would like to show that the convergence to TW2 still takes place if we replace σ¯ mn by σmn and μ¯ mn by μmn It is enough to show that −1 −1 −1 (μ¯ mn − μmn ) and (σ¯ mn − σmn )(λi (e˜ 22 e˜ 22 /(2πm)) − μmn ) are op (1). But σ¯ mn the inequality lk+in ≤ l¯in ≤ lin for n − k ≤ i ≤ 1 and the fact that c¯mn − −1 −1 − σmn = O(n−1/3 ). cmn = O(1/n) imply that μ¯ mn − μmn = O(n−1 ) and σ¯ mn −2/3 −1 Then, since σ¯ mn ∼ n we indeed have σ¯ mn (μ¯ mn − μmn ) = o(1). Further, −1 since λi (e˜ 22 e˜ 22 /(2πm)) = Op (1) and μmn = O(1) we indeed have (σ¯ mn − −1 σmn )(λi (e¯ 22 e¯ 22 /m) − μmn ) = op (1). −1 Therefore, we have {σmn (λi (e˜ 22 e˜ 22 /(2πm)) − μmn ); i = 1 r} conditional on Vn converges to TW2 In particular, the conditional probabilities
e˜ 22 e˜ 22
−1 − μmn ≤ xi ; i = 1 r Vn Pr σmn λi 2πm converge to the cumulative distribution function TW2 (x1 xr ) with probability 1. Then, by the dominated convergence theorem, the unconditional probabilities e˜ 22 e˜ 22 −1 − μmn ≤ xi ; i = 1 r Pr σmn λi 2πm which are just the expected values of the conditional probabilities, also converge to TW2 (x1 xr ). Q.E.D. PROOF OF THEOREM 2: The convergence of R to max0
1478
ALEXEI ONATSKI
(γk − γk+1 )/(γk+1 − γk+2 ). Therefore, we only need to show that (γk − γk+1 )/ p (γk+1 − γk+2 ) → ∞. As was shown in the proof of Theorem 1, |γi − λi (X˜ X˜ / (2πm))| = o(n−2/3 ) uniformly in i. Using Weyl’s inequalities for singular values (see Lemma 3), we obtain
ˆ ˆ ˆ ˆ
1/2 X˜ X˜ e˜ e˜ 1/2 Λ0 F F Λ0
1/2
λi ≤ λ − λ i 1
2πm 2πm 2πm for i = 1 n where λ1 (e˜ e˜ /(2πm)) = Op (1) by Lemma 1. Take i = k. By p p Assumption 4, λk (Λˆ 0 Fˆ Fˆ Λˆ 0 /(2πm)) → ∞. Therefore, λk (X˜ X˜ /(2πm)) → ∞ p ˆ ˆ ˆ ˆ and, hence, γk → ∞. Now, take i > k. Then λ1/2 i (Λ0 F F Λ0 /(2πm)) = 0. Therefore, λi (X˜ X˜ /(2πm)) = Op (1) and hence, γi = Op (1). Summing up, p γk − γk+1 → ∞ while γk+1 − γk+2 = Op (1). Hence, (γk − γk+1 )/(γk+1 − p Q.E.D. γk+2 ) → ∞. REFERENCES BAI, J., AND S. NG (2007): “Determining the Number of Primitive Shocks in Factor Models,” Journal of Business & Economic Statistics, 25, 52–60. [1448,1463,1470] BOIVIN, J., M. P. GIANNONI, AND B. MOJON (2009): “How Has the Euro Changed the Monetary Transmission?” in NBER Macroeconomics Annual, Vol. 23, ed. by D. Acemoglu, K. Rogoff, and M. Woodford. Chicago, IL: The University of Chicago Press. [1465-1467,1470] BREITUNG, J., AND S. EICKMEIER (2006): “Dynamic Factor Models,” Allgemeines Statistisches Archiv, 90, 27–42. [1447] BRILLINGER, D. R. (1981): Time Series. Data Analysis and Theory. San Francisco: Holden-Day. [1449,1453,1472] CATTELL, R. B. (1966): “The Scree Test for the Number of Factors,” Multivariate Behavioral Research, 1, 245–276. [1450] CHAMBERLAIN, G., AND M. ROTHSCHILD (1983): “Arbitrage, Factor Structure, and Mean– Variance Analysis on Large Asset Markets,” Econometrica, 51, 1281–1304. [1447,1453,1455] CONNOR, G., AND R. KORAJCZYK (1993): “A Test for the Number of Factors in an Approximate Factor Model,” The Journal of Finance, 58, 1263–1291. [1462,1470] EL KAROUI, N. (2006): “On the Largest Eigenvalue of Wishart Matrices With Identity Covariance When n, p and n/p Tend to Infinity,” Manuscript, Berkeley University. [1455] (2007): “Tracy–Widom Limit for the Largest Eigenvalue of a Large Class of Complex Wishart Matrices,” Annals of Probability, 35, 663–714. [1449,1453] FORNI, M., M. HALLIN, M. LIPPI, AND L. REICHLIN (2000): “The Generalized Dynamic-Factor Model: Identification and Estimation,” The Review of Economics and Statistics, 82, 540–554. [1448,1450-1452] GIANNONE, D., L. REICHLIN, AND L., SALA (2005): “Monetary Policy in Real Time,” in NBER Macroeconomic Annual 2004, ed. by M. Gertler and K. Rogoff. Cambridge: MIT Press, 161–200. [1465] HALLIN, M., AND R. LISKA (2007): “The Generalized Dynamic Factor Model: Determining the Number of Factors,” Journal of the American Statistical Association, 102, 603–617. [1448,1452, 1458,1463,1470] HORN, R. A., AND C. R. JOHNSON (1985): Matrix Analysis. Cambridge: Cambridge University Press. [1470,1471,1473,1475,1477]
LARGE FACTOR MODELS
1479
JACOBS, J. P. A. M., AND P. W. OTTER (2008): “Determining the Number of Factors and Lag Order in Dynamic Factor Models: A Minimum Entropy Approach,” Econometric Reviews, 27, 385–397. [1461] KAPETANIOS, G. (2005): “A Testing Procedure for Determining the Number of Factors in Approximate Factor Models With Large Datasets,” Working Paper 551, Department of Economics Queen Mary University of London. [1462,1470] KATO, T. (1980): Perturbation Theory for Linear Operators. Berlin: Springer-Verlag. [1475] LEWBEL, A. (1991): “The Rank of Demand Systems: Theory and Nonparametric Estimation,” Econometrica, 59, 711–730. [1447] ONATSKI, A. (2008): “The Tracy–Widom Limit for the Largest Eigenvalues of Singular Complex Wishart Matrices,” Annals of Applied Probability, 18, 470–490. [1449,1452-1454,1465] (2009a): “Asymptotics of the Principal Components Estimator of Large Factor Models With Weak Factors,” Manuscript, Columbia University. [1469] (2009b): “Supplement to ‘Testing Hypotheses About the Number of Factors in Large Factor Models’,” Econometrica Supplemental Material, 77, http://econometricsociety. org/ecta/Supmat/6964_data and programs.zip; http://econometricsociety.org/ecta/Supmat/ 6964_proofs.pdf. [1448,1456] PÉCHÉ, S. (2009): “Universality Results for the Largest Eigenvalues of Some Sample Covariance Matrix Ensembles,” Probability Theory and Related Fields, 143, 481–516. [1455] PHILLIPS, P. C. B. (2007): “Unit Root Log Periodogram Regression,” Journal of Econometrics, 138, 104–124. [1473] SOSHNIKOV, A. (2002): “A Note on Universality of the Distribution of the Largest Eigenvalues in Certain Sample Covariance Matrices,” Journal of Statistical Physics, 108, 1033–1056. [1455] STOCK, J., AND M. WATSON (2002): “Macroeconomic Forecasting Using Diffusion Indexes,” Journal of Business & Economic Statistics, 20, 147–162. [1458,1465,1466,1470] (2005): “Implications of Dynamic Factor Models for VAR Analysis,” Manuscript, Princeton University. [1448,1465] TRACY, C. A., AND H. WIDOM (1994): “Level Spacing Distributions and the Airy Kernel,” Communications in Mathematical Physics, 159, 151–174. [1449,1454] UHLIG, H. (2009): “Macroeconomic Dynamics in the Euro Area. Discussion by Harald Uhlig,” in NBER Macroeconomics Annual, Vol. 23, ed. by D. Acemoglu, K. Rogoff, and M. Woodford. Chicago, IL: The University of Chicago Press. [1465,1467]
Economics Department, Columbia University, New York, NY 10027, U.S.A.;
[email protected]. Manuscript received February, 2007; final revision received December, 2008.
Econometrica, Vol. 77, No. 5 (September, 2009), 1481–1512
IDENTIFICATION AND ESTIMATION OF TRIANGULAR SIMULTANEOUS EQUATIONS MODELS WITHOUT ADDITIVITY BY GUIDO W. IMBENS AND WHITNEY K. NEWEY1 This paper uses control variables to identify and estimate models with nonseparable, multidimensional disturbances. Triangular simultaneous equations models are considered, with instruments and disturbances that are independent and a reduced form that is strictly monotonic in a scalar disturbance. Here it is shown that the conditional cumulative distribution function of the endogenous variable given the instruments is a control variable. Also, for any control variable, identification results are given for quantile, average, and policy effects. Bounds are given when a common support assumption is not satisfied. Estimators of identified objects and bounds are provided, and a demand analysis empirical example is given. KEYWORDS: Nonseparable models, control variables, quantile effects, bounds, average derivative, policy effect, nonparametric estimation, demand analysis.
1. INTRODUCTION MODELS WITH ENDOGENEITY are central in econometrics. An intrinsic feature of many of these models, often generating the endogeneity, is nonseparability in disturbances. In this paper we provide identification and estimation results for such models via control variables. These are variables that, when conditioned on, make regressors and disturbances independent. We show that the conditional distribution function of the endogenous variable given the instruments is a control variable in a triangular simultaneous equations model with scalar, continuous endogenous variable and reduced form disturbance, and with instruments independent of disturbances. We also give identification and estimation results for outcome effects when any observable or estimable control variable is present. We focus on models where the dimension of the outcome disturbance is unspecified, allowing for individual heterogeneity and other disturbances in a fully flexible way. Since a nonseparable outcome with a general disturbance is equivalent to treatment effects models, some of our identification results apply there. We give identification and bound results for the outcome quantiles for a fixed value of the endogenous variables. Such quantiles correspond to the outcome 1 This research was partially completed while the second author was a fellow at the Center for Advanced Study in the Behavioral Sciences. The NSF provided partial financial support through Grants SES 0136789 (Imbens) and SES 0136869 (Newey). Versions were presented at seminars in March 2001 and December 2003. We are grateful for comments by S. Athey, L. Benkard, S. Berry, R. Blundell, G. Chamberlain, A. Chesher, J. Heckman, O. Linton, A. Nevo, A. Pakes, J. Powell, and participants at seminars at Stanford, University College London, Harvard, MIT, Northwestern, and Yale. We especially thank R. Blundell for providing the data and initial empirical results.
© 2009 The Econometric Society
DOI: 10.3982/ECTA7108
1482
G. W. IMBENS AND W. K. NEWEY
at quantiles of the disturbance when the outcome is monotonic in a scalar disturbance. More generally, they can be used to characterize how endogenous variables affect the distribution of outcomes. Differences of these quantiles over values of the endogenous regressors correspond to quantile treatment effects as in Lehmann (1974). We give identification and estimation results for these quantile effects under a common support condition. We also derive bounds on quantile treatment effects when the common support condition is not satisfied. Furthermore, we present identification results for averages of linear functionals of the outcome function. Such averages have long been of interest, because they summarize effects for a whole population. Early examples are Chamberlain’s (1984) average response probability, Stoker’s (1986) average derivative, and Stock’s (1989) policy effect.2 We also give identification results for average and quantile policy effects in the triangular model. In addition, we provide a control variable for the triangular model where results of Blundell and Powell (2003, 2004), Wooldridge (2002), Altonji and Matzkin (2005), and Florens, Heckman, Meghir, and Vytlacil (2008) can be applied to identify various effects. We employ a multistep approach to identification and estimation. The first step is construction of the control variable. The second step consists of obtaining the conditional distribution or expectation of the outcome given the endogenous variable and the control variable. Various structural effects are then recovered by averaging over the control variable or the endogenous and control variable together. An important feature of the triangular model is that the joint density of the endogenous variable and the control variable goes to zero at the boundary of the support of the control variable. Consequently, using nonparametric estimators with low sensitivity to edge effects may be important. We describe both locally linear and series estimators, because conventional kernel estimators are known to converge at slower rates in this setting. We give convergence rates for power series estimators. The edge effect also impacts the convergence rate of the estimators. Averaging over the control variable “upweights” the tails relative to the joint distribution. Consequently, unlike the usual results for partial means (e.g., Newey (1994)), such averages do not converge as fast as a smaller dimensional nonparametric regression. Estimators of averages over the joint distribution do not suffer from this “upweighting” and so will converge faster. Furthermore, the convergence rate of estimators that are affected by the upweighting problem will depend on how fast the joint density goes to zero on the boundary. We find that in a Gaussian model that rate is related to the r-squared of the reduced form. In a Gaussian model this leads to convergence rates that are slower than in the additive nonparametric model of Newey, Powell, and Vella (1999). 2 The average derivative results were developed independently of Altonji and Matzkin (2005) in a 2003 version of our paper.
TRIANGULAR SIMULTANEOUS EQUATIONS MODELS
1483
We allow for an outcome disturbance of unspecified dimension, while Chesher (2003) restricts this disturbance to be at most two dimensional. Allowing any dimension has the advantage that the interpretation of effects does not depend on exactly how many disturbances there are, but has the disadvantage that it does not identify effects for particular individuals. Such a trade-off is familiar from the treatment effects literature (e.g., Imbens and Wooldridge (2009)). To allow for general individual heterogeneity, that literature has largely opted for unrestricted dimension. Also, while Chesher (2003) only needs local independence conditions to identify his local effects, we need global ones to identify the global effects we consider. Our control variable results for the triangular model extend the work of Blundell and Powell (2003), who had extended Newey, Powell, and Vella (1999) and Pinkse (2000a) to allow for a nonseparable structural equation and a separable reduced form, to allow both the structural equation and the reduced form to be nonseparable. Chesher (2002) considered identification under index restrictions with multiple disturbances. Ma and Koenker (2006) considered identification and estimation of parametric nonseparable quantile effects using a parametric, quantile based control variable. Our triangular model results require that the endogenous variable be continuously distributed. For a discrete endogenous variable, Chesher (2005) used the assumption of a monotonic, scalar outcome disturbance to develop bounds in a triangular model; see also Imbens (2007). Vytlacil and Yildiz (2007) gave results on identification with a binary endogenous variable under instrumental variable conditions. Imbens and Angrist (1994) and Angrist, Graddy, and Imbens (2000) also allowed for nonseparable disturbances of any dimension, but focused on different effects than those we consider. Chernozhukov and Hansen (2005) and Chernozhukov, Imbens, and Newey (2007) considered identification and estimation of quantile effects without the triangular structure, but with restrictions on the dimension of the disturbances. Das (2001) also allowed for nonseparable disturbances, but considered a single index setting with monotonicity. The independence of disturbances and instruments that we impose is stronger than the conditional mean restriction of Newey and Powell (2003), Das (2004), Darolles, Florens, and Renault (2003), Hall and Horowitz (2005), and Blundell, Chen, and Christensen (2007), but they require an additive disturbance. In Section 2 of the paper we present and motivate our models. Section 3 considers identification. Section 4 describes the estimators and Section 5 gives an empirical example. Some large sample theory is presented in Section 6. 2. THE MODEL The model we consider has an outcome equation (1)
Y = g(X ε)
1484
G. W. IMBENS AND W. K. NEWEY
where X is a vector of observed variables and ε is a general disturbance vector. Here ε often represents individual heterogeneity, which may be correlated with X because X is chosen by the agent corresponding to ε or because X is an equilibrium outcome partially determined by ε. We focus on models where ε has unknown dimension, corresponding to a completely flexible specification of heterogeneity. In a triangular system there is a single endogenous variable X1 included in X along with a vector of exogenous variables Z1 , so that X = (X1 Z1 ) There is also another vector Z2 and a scalar disturbance η such that for Z = (Z1 Z2 ) , the reduced form for X1 is given by (2)
X1 = h(Z η)
where h(Z η) is strictly monotonic in η. Equations (1) and (2) form a triangular pair of nonparametric, nonseparable, simultaneous equations. We refer to equation (2) as the reduced form for X1 , though it could be thought of as a structural equation in a triangular system. This model rules out a nonseparable supply and demand model with one disturbance per equation, because that model would generally have a reduced form with two disturbances in both supply and demand equations. An economic example helps motivate this triangular model. For simplicity, suppose Z1 is absent, so that X1 = X. Let Y denote some outcome such as firm revenue or individual lifetime earnings, let X be chosen by the individual agent, and let ε represent inputs at most partially observed by agents or firms. Here g(x e) is the (educational) production function, with x and e being possible values for X and ε. The agent optimally chooses X by maximizing the expected outcome, minus the costs associated with the value of X, given her information set. Suppose the information set consists of a scalar noisy signal η of the unobserved input ε and a cost shifter Z3 The cost function is c(x z). Then X would be obtained as the solution to the individual choice problem X = arg max E[g(x ε)|η Z] − c(x Z) x
leading to X = h(Z η). Thus, this economic example leads to a triangular system of the above type. When X is schooling and Y is earnings, this example corresponds to models for educational choices with heterogenous returns such as those used by Card (2001) and Das (2001). When X is an input and Y is output, this example is a nonadditive extension of a classical problem in the estimation of production functions, for example, Mundlak (1963). Note the importance of allowing the 3 Although we do not do so in the present example, we could allow the cost to depend on the signal η, if, for example, financial aid was partly tied to test scores.
TRIANGULAR SIMULTANEOUS EQUATIONS MODELS
1485
production function g(x e) to be nonadditive in e (and thus allowing the mar∂g ginal returns ∂x (x ε) to vary with the unobserved heterogeneity). If the objective function g(x e) were additively separable in e, so that g(x ε) = g0 (x) + ε, the optimal level of x would be arg maxx {g0 (x) + E[ε|η] − c(x Z)}. In that case, the solution X would depend on Z, but not on η, and thus X would be exogenous. Hence in these models nonseparability is important for generating endogeneity of choices. Applying monotone comparative statics results from Milgrom and Shannon (1994) and Athey (2002), Das (2001) discussed a number of examples where monotonicity of the decision rule h(Z η) in the signal η is implied by conditions on the economic primitives. For example, assume that g(x e) is twice continuously differentiable. Suppose that (i) the educational production function is strictly increasing in ability e and education x; (ii) the marginal return to formal education is strictly increasing in ability and decreasing in education, so that ∂g/∂e > 0, ∂g/∂x > 0, ∂2 g/∂x ∂e > 0, and ∂2 g/∂x2 < 0 (this would be implied by a Cobb–Douglas production function); (iii) both the cost function and the marginal cost function are increasing in education, so that ∂c/∂x > 0, ∂2 c/∂x2 > 0; and (iv) the signal η and ability ε are affiliated. Under these conditions, the decision rule h(Z η) is monotone in η.4 The approach we adopt to identification and estimation is based on control variables. For the model Y = g(X ε) a control variable is any observable or estimable variable V satisfying the following condition: ASSUMPTION 1—Control Variable: X and ε are independent conditional on V . That is, X is independent of ε once we condition on the control variable V . This assumption makes changes in X causal, once we have conditioned on V , leading to identification of structural effects from the conditional distribution of Y given X and V . In the triangular model of equations (1) and (2), it turns out that under independence of (ε η) and Z a control variable is the uniformly distributed V = FX1 |Z (X1 Z) = Fη (η), where FX1 |Z (x1 z) is the conditional cumulative distribution function (CDF) of X1 given Z and Fη (t) is the CDF of η. Conditional independence occurs because V is a one-to-one function of η and, conditional on η, the variable X1 will only depend on Z. THEOREM 1: In the model of equations (1) and (2), suppose (i) (independence) (ε η) and Z are independent and (ii) (monotonicity) η is a continuously distributed scalar with CDF that is strictly increasing on the support of η and h(Z t) is strictly monotonic in t with probability 1. Then X and ε are independent conditional on V = FX1 |Z (X1 Z) = Fη (η). 4 Of course in this case one may wish to exploit these restrictions on the production function, as in, for example, Matzkin (1993).
1486
G. W. IMBENS AND W. K. NEWEY
In condition (i) we require full independence. In the economic example of Section 2, this assumption could be plausible if the value of the instrument was chosen at a more aggregate level rather than at the level of the agents themselves. State or county level regulations could serve as such instruments, as would natural variation in economic environment conditions, in combination with random location of agents. For independence to be plausible in economic models with optimizing agents, it is also important that the relation between the outcome of interest and the regressor, g(x ε), is distinct from the objective function that is maximized by the economic agent (g(x ε) − c(x z) in the economic example from the previous section), as pointed out in Athey and Stern (1998). To make the instrument correlated with the endogenous regressor, it should enter the latter (e.g., through the cost function), but to make the independence assumption plausible, the instrument should not enter the former. A scalar reduced form disturbance η and monotonicity of h(Z η) is essential to FX1 |Z (X1 Z) being a control variable.5 Otherwise, not all of the endogeneity can be corrected by conditioning on identifiable variables, as discussed in Imbens (2007). Condition (ii) is trivially satisfied if h(z t) is additive in t, but allows for general forms of nonadditive relations. Matzkin (2003) considered nonparametric estimation of h(z t) under conditions (i) and (ii) in a single equation exogenous regressor framework, and Pinkse (2000b) gave a multivariate version. Das (2001) used similar conditions to identify parameters in single index models with a single endogenous regressor. Our identification results that are based on the control variable V = FX1 |Z (X1 Z) = Fη (η) are related to the approach to identification in Chesher (2003). For simplicity, suppose z = z2 is a scalar, so that x = x1 and let QY |XV (τ x v), QY |XZ (τ x z), and QX|Z (τ z) be conditional quantile functions of Y given X and V , of Y given X and Z, and of X given Z, respectively. Also let ∇a denote a partial derivative with respect to a variable a THEOREM 2: If (X FX|Z (X|Z)) is a one-to-one transformation of (X Z), and for z0 x0 τ0 and v0 = FX|Z (x0 z0 ) it is the case that QY |XZ (τ0 x z) and FX|Z (x z) are continuously differentiable in (x z) in a neighborhood of (x0 z0 ), ∇z FX|Z (x0 z0 ) = 0, ∇x FX|Z (x0 z0 ) = 0, then (3)
∇x QY |XV (τ0 x0 v0 ) = ∇x QY |XZ (τ0 x0 z0 ) +
∇z QY |XZ (τ0 x0 z0 ) ∇z QX|Z (v0 z0 )
In the triangular model with two-dimensional ε = (η ξ) for a scalar ξ, Chesher (2003) showed that the right-hand side of (3) is equal to ∂g(x ε)/∂x 5 For scalar X we need scalar η. In a systems generalization we would need η to have the same dimension as X.
TRIANGULAR SIMULTANEOUS EQUATIONS MODELS
1487
under certain local independence conditions. Theorem 2 shows that conditioning on the control variable V = FX|Z (X Z) leads to the same local derivative, in the absence of the triangular model and without any independence restrictions. In this sense, Chesher’s (2003) approach to identification is equivalent to using the control variable V = FX|Z (X Z) but without explicit specification of this variable. Explicit conditioning on V is useful for our results, which involve averaging over V , as discussed below.6 Many of our identification results apply more generally than just to the triangular model. They rely on Assumption 1 holding for any observed or estimable V rather than on the control variable from Theorem 1 for the triangular model. To emphasize this, we will state some results by referring to Assumption 1 rather than to the reduced form equation (2). Identification of structural effects requires that X varies, while holding the control variable V constant. For identification of some effects, we need a strong condition: that the support of the control variable V conditional on X is the same as the marginal support of V . ASSUMPTION 2—Common Support: For all X ∈ X , the support of V conditional on X equals the support of V . To explain, consider the triangular system, where V = FX1 |Z (X1 Z). Here the control variable conditional on X = x = (x1 z1 ) is FX1 |Z (x1 z1 Z2 ). Thus, for Assumption 2 to be satisfied, the instrumental variable Z2 must affect FX1 |Z (x1 z1 Z2 ). This is like the rank condition that is familiar from the linear simultaneous equations model. Also, for Assumption 2 it will be required that Z2 vary sufficiently. To illustrate, suppose z = z2 is a scalar and that the reduced form is X1 = X = πZ + η, where η is continuously distributed with CDF G(u). Then FX|Z (x z) = G(x − πz) Assume that the support of FX|Z (X Z) is [0 1]. Then a necessary condition for Assumption 2 is that π = 0, because otherwise FX|Z (x Z) would be a constant. This is like the rank condition. Together with π = 0 the support of Z being the entire real line will be sufficient for Assumption 2. This example illustrates that Assumption 2 embodies two types of conditions, one being a rank condition and the other being a full support condition. 6 One can obtain an analogous result in a linear quantile model. If the conditional quantile of Y given X and Z is linear in X and Z, and the conditional quantile of X given Z is linear in Z, with residual U, then the Chesher (2003) formula equals the coefficient of X in a linear quantile regression of Y on X and U.
1488
G. W. IMBENS AND W. K. NEWEY
3. IDENTIFICATION In this section we will show identification of several objects and give some bounds. We do this by giving explicit formulas for objects of interest in terms of the distribution of observed data. As is well known, such explict formulas imply identification in the sense of Hurwicz (1950). A main contribution of this paper is to give new identification results for quantile, average, and policy effects. Identification results have previously been given for other objects when there is a control variable, including the average structural function (Blundell and Powell (2003)) and the local average response (Altonji and Matzkin (2005)). For these objects, a contribution of Theorem 1 above is to show that V = FX|Z (X Z) serves as a control variable in the triangular model of equations (1) and (2), and so can be used to identify these other functionals. We focus here on quantile, average, and policy effects. All the results are based on the fact that for any integrable function Λ(y), E[Λ(Y )|X = x V = v] = Λ(g(x e))Fε|XV (de|x v) (4) =
Λ(g(x e))Fε|V (de|v)
where the second equality follows from Assumption 1. Thus, changes in x in E[Λ(Y )|X = x V = v] correspond to changes in x in g(x ε), that is, are structural. This equation has an essential role in the identification and bounds results below. This equation is similar in form to equations on page 1273 in Chamberlain (1984) and equation (2.46) of Blundell and Powell (2003). 3.1. The Quantile Structural Function We define the quantile structural function (QSF) qY (τ x) as the τth quantile of g(x ε). In this definition, x is fixed and ε is what makes g(x ε) random. Note that because of the endogeneity of X, this is in general not equal to the conditional quantile of g(X ε) conditional on X = x, qY |X (τ|x). In treatment effects models, qY (τ x )−qY (τ x ) is the quantile treatment effect of a change in x from x to x ; see Lehmann (1974). When ε is a scalar and g(x ε) is monotonic increasing in ε, then qY (τ x) = g(x qε (τ)), where qε (τ) is the τth quantile of ε. When ε is a vector, then as the value of x changes, so may the values of ε with which the QSF is associated. This feature seems essential to distributional effects when the dimension of ε is unrestricted. To show identification of the QSF, note that equation (4) with Λ(Y ) = 1(Y ≤ y) gives FY |XV (y|x v) = 1(g(x e) ≤ y)Fε|V (de|v) (5)
TRIANGULAR SIMULTANEOUS EQUATIONS MODELS
1489
Then under Assumption 2 we can integrate over the marginal distribution of V and apply iterated expectations to obtain FY |XV (y|x v)FV (dv) = 1(g(x e) ≤ y)Fε (de) (6) = Pr(g(x ε) ≤ y) def
= G(y x)
Then by the definition of the QSF we have (7)
qY (τ x) = G−1 (τ x)
Thus the QSF is the inverse of FY |XV (y|x v)FV (dv). The role of Assumption 2 is to ensure that FY |XV (y|x v) is identified over the entire support of the marginal distribution of V . We have thus shown the following result: THEOREM 3—Identification of the QSF: In a model where equation (1) and Assumptions 1 and 2 are satisfied, qY (τ x) is identified for all x ∈ X . 3.2. Bounds for the QSF and Average Structural Function Assumption 2 is a rather strong assumption that may only be satisfied on a small set X . In the empirical example below it does appear to hold, but only over part of the range of X. Thus, it would be good to be able to drop Assumption 2. When Assumption 2 is not satisfied but the structural function g(x e) is bounded, one can bound the average structural function (ASF) μ(x) = g(x e)Fε (de). (Identification of μ(x) under Assumptions 1 and 2 was shown by Blundell and Powell (2003).) Let V denote the supportof V , let V (x) denote the support of V conditional on X = x, and let P(x) = V ∩V (x)c FV (dV ). Note that given X = x, the conditional expectation function m(x v) = E[Y |X = ˜ be the identified object: x V = v] is identified for v ∈ V (x). Let μ(x) m(x v)FV (dv) μ(x) ˜ = V (x)
THEOREM 4: If Assumption 1 is satisfied and B ≤ g(x e) ≤ Bu for all x in the support of X and e in the support of ε, then def
def
˜ + B P(x) ≤ μ(x) ≤ μ(x) ˜ + Bu P(x) = μu (x) μ (x) = μ(x) and these bounds are sharp.
1490
G. W. IMBENS AND W. K. NEWEY
One example is the binary choice model where g(x e) ∈ {0 1}. In that case, B = 0 and Bu = 1, so that μ(x) ˜ ≤ μ(x) ≤ μ(x) ˜ + P(x) These same bounds apply to the ASF in the example considered below, where g(x e) is the share of expenditure on a commodity and so is bounded between zero and one. There are also bounds for the QSF. Replacing Y by 1(Y ≤ y) in the bounds for the ASF, and setting B = 0 and Bu = 1 gives a lower bound G (y x) and an upper bound Gu (y x) on the integral of equation (6): G (y x) = (8) Pr(Y ≤ y|X = x V )FV (dV ) V (x)
Gu (y x) = G (y x) + P(x) Assuming that Y is continuously distributed and inverting these bounds, G leads to the bounds for the QSF, given by −∞ τ ≤ P(x), qY (τ x) = (9) (τ x) τ > P(x), G−1 u −1 G (τ x) τ < 1 − P(x), qYu (τ x) = +∞ τ ≥ 1 − P(x). THEOREM 5—Bounds for the QSF: If Assumption 1 is satisfied, then qY (τ x) ≤ qY (τ x) ≤ qYu (τ x) These bounds on the QSF imply bounds on the quantile treatment effects in the usual way. For values x and x we have qY (τ x ) − qYu (τ x ) ≤ qY (τ x ) − qY (τ x ) ≤ qYu (τ x ) − qY (τ x ) These bounds are essentially continuous versions of selection bounds in Manski (1994) and are similar to. See also Heckman and Vytlacil (2000) and Manski (2007). Blundell, Gosling, Ichimura, and Meghir (2007) have refined the Manski (1994) bounds using monotonicity and other restrictions. It should also be possible to refine the bounds here under similar conditions, although that is beyond the scope of this paper. 3.3. Average Effects Assumption 2 is not required for identification of averages over the joint distribution of (X ε). For example, consider the policy effect γ = E g((X) ε) − Y
TRIANGULAR SIMULTANEOUS EQUATIONS MODELS
1491
where (X) is some known function of X This object is analogous to the policy effect studied by Stock (1989) in the exogenous X case. For example, one might consider a policy that imposes an upper limit x¯ on the choice variable X in the economic model described above. Then, for a single peaked objective function, ¯ Assuming there are it follows that the optimal choice will be (X) = min{X x}. no general equilibrium effects, the average difference of the outcome with and without the constraint will be E[g((X) ε) − Y ]. For this example, rather than Assumption 2, we can assume that the support of (X V ) includes the support of ((X) V ). Then for m(x v) = E[Y |X = x V = v], equation (4) with Λ(Y ) = Y gives (10) E g((X) ε) = E E g((X) ε)|X V
=E g((X) e)Fε|V (de|V ) = E m((X) V ) Then γ = E[m((X) V )] − E[Y ]. Another example is the average derivative δ = E[∂g(X ε)/∂x] This object is like that studied by Stoker (1986) and Powell, Stock, and Stoker (1989) in the context of exogenous regressors. It summarizes the marginal effect of x on g over the population of X and ε In a linear random coefficients model Y = α(ε) + X β(ε), the average derivative is δ = E[β(ε)]. If the struc˜ β0 ε) then tural function satisfies a single index restriction, with g(x ε) = g(x δ will be proportional to β0 . For this example we assume that the derivatives of m(x v) and g(x ε) with respect to x are well defined objects, implying that X and Y are continuous random variables. Then differentiating equation (4) with Λ(Y ) = Y gives (11) ∂m(X V )/∂x = gx (X e)Fε|V (de|V ) for gx (x ε) = ∂g(x ε)/∂x. Then by Assumption 1,
(12) gx (X e)Fε|XV (de|X V ) δ = E[gx (X ε)] = E =E
∂ m(X V ) gx (X e)Fε|V (de|V ) = E ∂x
We give precise identification results for the policy function and average derivative in the following result: THEOREM 6: Consider a model where Assumption 1 is satisfied. If the support of ((X) V ) is a subset of the support of (X V ), then γ = E[g((X) ε) − Y ]
1492
G. W. IMBENS AND W. K. NEWEY
is identified. If (i) X has a continuous conditional distribution given V , (ii) with probability 1, g(x ε) is continuously differentiable in x at x = X, and (iii) for all x and some Δ > 0, E[ sup x−X ≤Δ gx (x ε) Fε|V (dε|V )] exists, then δ = E[gx (X ε)] is identified. Analogous identification results can be formulated for expectations of other linear transformations of g(x ε). Let h(x) denote a function of x and let T (h(·) x) be a transformation that is linear in h. Then, assuming that the order of integration and transformation can be interchanged, we obtain, from equation (4), T (m(· v) x) = T g(· ε)Fε|V (dε|v) x =
T (g(· ε) x)Fε|V (dε|v)
=
T (g(· ε) x)Fε|XV (dε|x v)
= E T (g(· ε) X)|X = x V = v Taking expectations of both sides we find that E T (m(· V ) X) = E E T (g(· ε) X)|X V = E T (g(· ε) X) This formula leads to the following general identification result: THEOREM 7: In a model where Assumption 1 is satisfied, T (m(· V ) X) is a well defined random variable, E[T (m(· V ) X)] exists, and T ( g(· ε)Fε|V (dε| V ) X) = T (g(· ε) X)Fε|V (dε|V ), the object E[T (g(· ε) X)] is identified. Theorem 6 is a special case of this result with T (h(·) x) = ∂h(x)/∂x and T (h(·) x) = h((x)). 3.4. Policy Effects in the Triangular Model In the triangular model one can consider the effects of changes in the X equation h(z v) for X, where X is a scalar and we use the normalization η = ˜ v) denote a new function. Assuming that the change has no effect V .7 Let h(z 7
Steven Berry suggested the subject of this subsection. The policy effects and cost identification considered here are similar in motivation to those of Heckman and Vytlacil (2005, 2008) for their models.
TRIANGULAR SIMULTANEOUS EQUATIONS MODELS
1493
on the distribution of (ε V ), the average outcome given Z = z after the change to h˜ would be ˜ ˜ θ(z) = g(h(z v) e)FεV (de dv) =
˜ g(h(z v) e)Fε|V (de|v) FV (dv)
From equation (4) with Λ(Y ) = Y , we obtain ˜ ˜ θ(z) = m(h(z v) v)FV (dv) An average, conditional policy effect of changing the Y equation from h(z v) ˜ to h(z v) is ˜ ρ(z) ˜ = θ(z) − E[Y |Z = z] An unconditional policy effect of changing both h to h˜ and the distribution of Z to F˜ is ˜ F˜Z (dz) − E[Y ] ρ˜ = θ(z) THEOREM 8: Consider a model where the conditions of Theorem 1 are satisfied ˜ and expectations exist. If the support of (h(z V ) V ) is contained in the suport of ˜ (X V ), then ρ(z) ˜ is identified. Also if the support of (h(z V ) V ) is contained in the support of (X V ) for all z in the support of F˜Z , then ρ˜ is identified. ˜ The previous policy effect γ is a special case of ρ˜ where h(z v) = (h(z v)) Here γ is obtained by integrating over the product of the marginal distributions of (ε V ) and FZ (z), while above it is obtained by integrating over the joint distribution of (X V ε). This difference could lead to different estimators in practice, although it is beyond the scope of this paper to compare their properties. One can also consider analogous quantile effects. Define the conditional ˜ CDF of Y after a change to h(z v) at a given z, to be
˜ ˜ z) = 1 g(h(z J(y v) ε) ≤ y FεV (dε dv) It follows similarly to previous results that this object is identified from ˜ ˜ z) = FY |XV (y|h(z v) v)FV (dv) J(y
1494
G. W. IMBENS AND W. K. NEWEY
The τth conditional quantile of Y following the change is Q˜ Y |Z (τ z) = J˜−1 (τ z) A quantile policy effect is Q˜ Y |Z (τ z) − QY |Z (τ z) An unconditional policy effect that includes a change in the CDF of Z to F˜ is −1 ˜ ˜ ˜ ˜ ˜ z)F˜Z (dz) QY (τ) − QY (τ) QY (τ) = J (y) J(y) = J(y where QY (τ) is the τth quantile of Y . THEOREM 9: Consider a model where the conditions of Theorem 1 are sat˜ isfied. If the support of (h(z V ) V ) is contained in the suport of (X V ), then ˜ ˜ V ) V ) is contained in the QY |Z (τ z) is identified. Also if the support of (h(z ˜ support of (X V ) for all z in the support of FZ , then Q˜ Y (τ) is identified. In the economic model of Section 2, a possible choice of a changed ˜ h(z v) corresponds to a shift in the cost function. Note that for a given x we have E[g(x ε)|V = v Z = z] = E[g(x ε)|V = v] = m(x v) ˜ ˜ Then for an alternative cost function c(x z), the value h(z v) of x that maximizes the objective function would be ˜ ˜ h(z v) = arg max{m(x v) − c(x z)} x
Also, it may be desireable to specify c˜ relative to the cost function c(x z) identified from the data. The cost function is identified, up to an additve function of Z, by the first-order conditions ∂c(X Z) ∂m(X V ) = −1 ∂x ∂x V =h (XZ) 4. ESTIMATION We follow a multistep approach to estimation from independent and identically distributed (i.i.d.) data (Yi Xi Zi ) (i = 1 n). The first step is estimation of the control variable observations Vi by Vˆi . Details of this step depend
TRIANGULAR SIMULTANEOUS EQUATIONS MODELS
1495
on the form of the control variable. For the triangular simultaneous equations system, we can form Vˆi = FˆX1 |Z (X1i Zi ) where FˆX1 |Z (x1 z) is an estimator of the conditional CDF of X1 given Z. These estimates can then be used to construct an estimator FˆY |XV (y|x v) of ˆ FY |XV (y|x v) or an estimator m(x v) of E[Y |X V ], where Vˆi is used in place of Vi . Estimators of objects of interest can then be formed by plugging these estimators into the formulae of Section 3, replacing integrals with sample averages. An estimator of the QSF is given by ˆ −1 (y x) qˆ Y (τ x) = G
1 ˆ ˆ G(y x) = FY |XV (y|x Vˆi ) n i=1 n
In the triangular simultaneous equations model, where Vi is known to be uniformly distributed, the sample averages can be replaced by integrals over the uniform distribution (or simulation estimators of these integrals). Estimators of the policy effect and average derivative can be constructed by plugging in the formulae and replacing the expectation over (X V ) with a sample average, as in 1 ˆ ˆ m((X i ) Vi ) − Yi n i=1 n
γˆ =
ˆ i Vˆi ) 1 ∂m(X δˆ = n i=1 ∂x n
When Assumption 2 is not satisfied, the bounds for the ASF and QSF can be estimated in a similar way. An estimator Vˆ (x) of the support of V conditional on X is needed for these bounds. One can form that as Vˆ (x) = V : fˆV |X (v|x) ≥ δn V ∈ Vˆ where δn is a trimming parameter and Vˆ is an estimator of the support V of V containing all Vˆi . In some cases V may be known, as for the triangular model where V = [0 1]. Estimates of the ASF bounds can then be formed as sample analogs: ˆ μ(x) ˜ + B P(x) μˆ (x) =
ˆ μ(x) ˜ + Bu P(x) μˆ u (x) =
1 ˆ ˆ ˆ 1(Vi ∈ V (x))m(x Vˆi ) μ(x) ˜ = n i=1 n
1 ˆ ˆ ˆ P(x) = 1(Vi ∈ / V (x)) n i=1 n
1496
G. W. IMBENS AND W. K. NEWEY
Bounds for the QSF can be formed in an analogous way. Estimates of the upper and the lower bounds on G(y x) can be constructed as ˆ (y x) = G
n
1(Vˆi ∈ Vˆ (x))FˆY |XV (y|x Vˆi )/n
i=1
ˆ u (y x) = G ˆ (y x) + P(x) ˆ G ˆ (y x) is strictly increasing in y, we then can compute the Assuming that G ˆ (y x) and P(x) ˆ bounds for the QSF by plugging G into equation (9) to obtain ˆ −∞ τ ≤ P(x), qˆ Y (τ x) = −1 ˆ (τ x) τ > P(x), ˆ G u ˆ −1 ˆ G τ < 1 − P(x), (τ x) qˆ Yu (τ x) = ˆ +∞ τ ≥ 1 − P(x). To implement these estimators we need to be specific about each of their components, including the needed nonparametric regression estimators. Our choice of regression estimators is influenced by the potential importance of edge effects. For example, an important feature of the triangular model is that the joint density of (X V ) may go to zero on the boundary of the support of V . This can easily be seen when the reduced form is linear. Suppose that X1 = X = Z + η, and that the support of Z and η is the entire real line. Let fZ (z) and Fη (t) be the marginal probability density function (PDF) and CDF of Z and η, respectively. The joint PDF of (X V ) is fXV (x v) = fZ (x − Fη−1 (v))
0 < v < 1
Although V has a uniform marginal distribution, the joint PDF goes to zero as v goes to zero or one. In the Gaussian Z and η case, we can be specific about the rate of decrease of the joint density, as shown by the following result: LEMMA 10: If X = Z + η, where Z and η are normally distributed and independent, then for R2 = Var(Z)/[Var(X)] and α¯ = (1 − R2 )/R2 , for any B, δ > 0, there exists C such that for all |x| ≤ B, v ∈ [0 1] ¯ ¯ C[v(1 − v)]α−δ ≥ fXV (x v) ≥ C −1 [v(1 − v)]α+δ
Here the rate at which the joint density goes to zero at the boundary is a power of v that increases as the reduced form r-squared falls. Thus, the lower is the r-squared of the reduced form, the less tail information there is about the control variable V .
TRIANGULAR SIMULTANEOUS EQUATIONS MODELS
1497
Locally linear regression estimators and series estimators are known to be less sensitive to edge effects than kernel estimators, so we focus on these. For instance, Hengartner and Linton (1996) showed that locally linear estimators have optimal convergence rates when regressor densities can go to zero, and kernel estimators do not. We will consider estimators that use the same method in both first and second stages. We also smooth out the indicator functions that appear as the left-hand side variables in these estimators, as has been suggested by Yu and Jones (1998). To facilitate describing both steps of each estimator, we establish some additional notation. For a random variable Y and an r × 1 random vector W , let (Yi Wˆ i ) denote a sample where the observations on W may be estimated. We will let aˆ hY (w) denote the locally linear estimator with bandwidth h, of E[Y |W = w]. For a kernel function K(u), let Kˆ ih (w) = K((w − Wˆ i )/ h) and Sˆ0w =
n
Kˆ ih (w)
i=1
Sˆ1w =
n
Kˆ ih (w)(w − Wˆ i )
i=1
Sˆ2w =
n
Kˆ ih (w)(w − Wˆ i )(w − Wˆ i )
i=1
Then aˆ hY (w) = (Sˆ0w − Sˆ1w (Sˆ2w )−1 Sˆ1w )−1 n n h w ˆ w −1 h ˆ ˆ ˆ ˆ × Ki (w)Yi − S1 (S2 ) Ki (w)(w − Wi )Yi i=1
i=1
For the first stage of the locally linear estimator, we also smooth the indicator function in FX1 |Z (x|z) = E[1(X1i ≤ x)|Zi = z]. Let b1 be a positive scalar bandwidth and let Φ(x) be a CDF for a scalar x, so that Φ(x/b1 ) is a smooth approximation to the indicator function. The estimator is a locally linear estimator where w = z and Y = Φ((x − X1 )/b1 ). For observations (X1i Zi ) i = 1 n, on X1 and Z and a positive bandwidth h1 , an estimator of FX1 |Z (x|z) is h1 FˆX1 |Z (x z) = aˆ Φ((x−X (z) 1 )/b1 )
Then Vˆi = FˆX1 |Z (X1i Zi ) (i = 1 n). For the second step, let w = (x v) Wˆ i = (Xi Vˆi ), b2 , and h2 be bandwidths. We also use Φ(x/b2 ) to approximate the indicator function for the conditional CDF estimator. The estimators will
1498
G. W. IMBENS AND W. K. NEWEY
be locally linear estimators where Y = Φ((y − Y )/b2 ) or just Y = Y These are given by h2 FˆY |XV (y|x v) = aˆ Φ((y−Y )/b2 ) (x v)
h
ˆ m(x v) = aˆ Y2 (x v)
Evidently these estimators depend on the bandwidths b1 h1 b2 , and h2 . Derivation of optimal bandwidths is beyond the scope of this paper, but we do consider sensitivity to their choice in the application. To describe a series estimator of E[Y |W = w] for any random vector W , let pK (w) = (p1K (w) pKK (w)) denote a K × 1 vector of approximating functions, such as power series or splines, and let pi = pK (Wˆ i ) Let a˜ KY (w) denote the series estimator obtained as the predicted value from regressing Yi on pi , that is, n − n K K pi pi pi Yi a˜ Y (w) = p (w) i=1
i=1
where A− denotes any generalized inverse of the matrix A Let τ(u) denote the CDF for a uniform distribution. Then a series estimator of the observations on the control variable is given by choosing w = z and calculating
K1 (z) FˆX1 |Z (x1 z) = τ a˜ 1(X 1 ≤x) Then Vˆi = FˆX1 |Z (X1i Zi ) (i = 1 n). For the second stage, let w = (X V ) Wˆ i = (Xi Vˆi ), b2 be a bandwidth and let K2 be a number of terms to be used in approximating functions of w = (X V ). Then series estimators of the conditional CDF FY |XV (y|x v) and the conditional expectation E[Y |X V ] are given by K2 FˆY |XV (y|x v) = a˜ Φ((y−Y )/b2 ) (x v)
K
ˆ m(x v) = a˜ Y 2 (x v)
Evidently these estimators depend on the bandwidth b2 and the number of approximating functions K1 and K2 . Derivation of optimal values for these tuning parameters is beyond the scope of this paper. 5. AN APPLICATION In this section we consider an application to estimation of a triangular simultaneous equations model for Engel curves. Here Y will be the share of expenditure on a commodity and X will be the log of total expenditure. We use as an instrument gross earnings of the head of household. This instrument also was used by Blundell, Chen, and Christensen (2007), who motivated it by separability of household saving and consumption decisions. In the application, we estimate the QSF and the ASF when Y is the share of expenditure
TRIANGULAR SIMULTANEOUS EQUATIONS MODELS
1499
on either food or leisure. Here we may interpret the QSF as giving quantiles, across heterogenous individuals, of individual Engel curves. This interpretation depends on ε solely representing heterogeneity and no other source of randomness, such as measurement error. The data (and this description) are similar to those considered in Blundell, Chen, and Christensen (2007). The data are taken from the British Family Expenditure Survey for the year 1995. To keep some demographic homogeneity, the data are a subset of married and cohabitating couples where the head of the household is aged between 20 and 55, and those couples with three or more children are excluded. Unlike Blundell, Chen, and Christensen (2007), we do not include number of children as covariates. In this application, we exclude households where the head of household is unemployed so as to have the instrument Z available. This earnings variable is the amount that the male of the household earned in the chosen year before taxes. This leaves us with 1655 observations. In this application we use locally linear estimators as described earlier. We use Silverman’s (1986) density bandwidth throughout and carry out some sensitivity checks. We also check sensitivity of the results to the choice δn used in the bounds. As previously discussed, an important identification concern is over what values of X the common support condition might be satisfied. Similarly to the rank condition in linear models, the common support condition can be checked by examining the data. We do so in Figure 1, which gives a graph of level sets of a joint kernel density estimator for (X V ) based on Xi and the control variable estimates Vˆi = FˆX|Z (Xi |Zi ). This figure suggests that Assumption 2 may be satisfied only over a narrow range of X values, so it may be important to allow for bounds. For comparison purposes, we first give graphs of the QSF for food and leisure expenditure, respectively, assuming that the common support condition is satisfied.8 Figures 2 and 3 report graphs of these functions for the quartiles. These graphs have the shape one has come to expect of Engel curves for these commodities. In comparing the curves, it is interesting to note that there is evidence of substantial asymmetry for the leisure expenditure. There is more of a shift toward leisure at the upper quantiles of the expenditure. There is less evidence of asymmetry for food expenditure. Turning now to the bounds, we chose δn so that the probability that a Gaussian PDF (with mean equal to the sample mean μˆ of X and variance equal to the sample variance σˆ 2 ) exceeds δn is 0.975. This δn satisfies the equation φ((t − μ)/ ˆ σ) ˆ dt = 0975 φ((t−μ)/ ˆ σ)≥δ ˆ n
8
These graphs were initially derived by Richard Blundell.
1500
G. W. IMBENS AND W. K. NEWEY
FIGURE 1.—Level curves of estimated joint density of X and Vˆ .
ˆ ˆ Figure 4 graphs the P(x) for this δn . The bounds coincide when P(x) = 0, but differ when it is nonzero. Here we find that the bounds will coincide only over a small interval of x values. Figures 5 and 6 graph bounds for the median QSF for food and leisure, along with an estimator of the marginal PDF of total expenditure X. Here we find that even though the upper and lower bounds coincide only over a small range, they are quite informative. We also carried out some sensitivity analysis. We found that the QSF estimates are not very sensitive to the choice of bandwidth. Also, increasing δn
TRIANGULAR SIMULTANEOUS EQUATIONS MODELS
FIGURE 2.—Food QSF.
FIGURE 3.—Leisure QSF.
1501
1502
G. W. IMBENS AND W. K. NEWEY
FIGURE 4.—P(x) for bounds.
does widen the bounds appreciably, although δn does not have to increase ˆ much before P(x) is nonzero for all x 6. ASYMPTOTIC THEORY We have presented two kinds of estimators for a variety of functionals. A full account of asymptotic theory for all these cases is beyond the scope of this paper. As an example, we give asymptotic theory for a power series estimator of the ASF in the triangular model. Here we assume that the order of the approximating functions, that is, the sum of the exponents of the powers in pkK (w), is increasing in K, with all terms of a given order included before increasing the order. Results for the power series estimators are used to highlight two important features of the estimation problem that arise from the fact that the joint density of X and V goes to zero on the boundary of the control variable. One feature is that the rate of convergence of the ASF will depend on how fast the density goes to zero, since the ASF integrates over the control variable. The other feature is that the ASF does not necessarily converge at the same rate as a regression of Y on just X. In other words, unlike, in Newey (1994), integrating over a conditioning variable does not lead to a rate that is the same as if that variable was not present.
TRIANGULAR SIMULTANEOUS EQUATIONS MODELS
1503
FIGURE 5.—Density of total expenditure and median bounds for food.
The convergence rates of the estimators will depend on certain smoothness restrictions. The next assumption imposes smoothness conditions on the control variable. ASSUMPTION 3: Zi ∈ r1 has compact support and FX1 |Z (x1 z) is continuously differentiable of order d1 on the support with derivatives uniformly bounded in x and z. −d /r
This condition implies an approximation rate of K1 1 1 for the CDF that is uniform in both its arguments; see Lorentz (1986). The following result gives a convergence rate for the first step: LEMMA 11: If the conditions of Theorem 1 and Assumption 3 are satisfied, then n
1−2d /r (Vˆi − Vi )2 /n = O K1 /n + K1 1 1 E i=1
1504
G. W. IMBENS AND W. K. NEWEY
FIGURE 6.—Density of total expenditure and median bounds for leisure share.
The two terms in this rate result are variance (K1 /n) and squared bias 1−2d /r (K1 1 1 ) terms, respectively. In comparison with previous results for series 1−2d /r estimators, this convergence result has K1 1 for the squared bias term as a −2d1 /r rate rather than K1 . The extra K1 arises from the predicted values Vˆi being based on regressions with the dependent variables varying over the observations. To obtain convergence rates for series estimators, it is necessary to restrict the rate at which the density goes to zero as V approaches zero or one. The next condition fulfills this purpose. Let w = (x v) and let X denote the support of X . ASSUMPTION 4: X is a Cartesian product of compact intervals, pK2 (w) = p (x) ⊗ pKV (v) and there exist constants C α > 0 such that Kx
inf fXV (x v) ≥ C[v(1 − v)]α
x∈X
The next condition imposes smoothness of m(w), so as to obtain an approximation rate for the second step.
TRIANGULAR SIMULTANEOUS EQUATIONS MODELS
1505
ASSUMPTION 5: m(w) is continuously differentiable of order d2 on X × [0 1] ⊂ r2 Note that w is an r2 × 1 vector, so that x is an (r2 − 1) × 1 vector. Next, we bound the conditional variance of Y , as is often done for series estimators. ASSUMPTION 6: Var(Y |X1 Z) is bounded. With these conditions in place, we can obtain a convergence rate bound for the second-step estimator. THEOREM 12: If the conditions of Theorem 1 and Assumptions 3–6 are satis1−2d /r fied, and K22 KVα+2 (K1 /n + K1 1 ) → 0, then
−2d /r 1−2d /r ˆ [m(w) − m(w)]2 dF(w) = Op K2 /n + K2 2 2 + K1 /n + K1 1 1 ˆ − m(w)| sup |m(w) w∈W
−2d /r 1−2d /r 1/2 = Op KVα K2 K2 /n + K2 2 2 + K1 /n + K1 1 1
This result gives both mean-square and uniform convergence rates for ˆ m(x V ). It is interesting to note that the mean-square rate is the sum of the first-step convergence rate and the rate that would obtain for the second step if the first step was known. This result is similar to that of Newey, Powell, and Vella (1999), and results from conditioning on the first step in the second-step regression. Also, the first-step and the second-step rates are each the sum of a variance term and a squared bias term. The following result gives an upper bound on the rate of convergence for the 1 ˆ ASF estimator μ(x) ˆ = 0 m(x v) dv. THEOREM 13: If the conditions of Theorem 1 and Assumptions 3–6 are satis1−2d /r fied, and K22 KV2+2α (K1 /n + K1 1 1 ) −→ 0, then [μ(x) ˆ − μ(x)]2 FX (dx)
−2d /r 1−2d /r = Op KV2+2α Kx /n + K2 2 2 + K1 /n + K1 1 1 To interpret this result, we can use the fact that all terms of a given order are added before increasing the order to say that there is a constant C with r r −1 K2 ≥ KV2 /C and Kx ≤ CKV2 . In that case, we will have [μ(x) ˆ − μ(x)]2 FX (dx)
r +1+2α
−2d 1−2d /r = Op KV2 /n + KV2+2α KV 2 + K1 /n + K1 1 1
1506
G. W. IMBENS AND W. K. NEWEY
The choice of KV and K1 minimizing this expression is proportional to n1/(2d2 +r2 −1) and nr1 /2d1 , respectively. For this choice of KV and K1 , the rate hypothesis and the convergence rate are given by r1 2(r2 + α + 1) + < 1 2d2 + r2 + 1 2d1 [μ(x) ˆ − μ(x)]2 FX (dx)
= Op n2(2+α−d2 )/(2d2 +r2 −1) + n[2(1+α)/(2d2 +r2 −1)]+(r1 /2d1 )−1 The inequality requires that m(w) have more than (1 + 2α + r2 )/2 derivatives and that FX1 |Z (x1 |z) have more than r1 /2 derivatives. One can compare convergence rates of estimators in a model where several estimators are consistent. One such model is the additive disturbance model Y = g(X) + ε Z ∼ N(0 1)
X = Z + η η ∼ N(0 (1 − R2 )/R2 )
where X, Z, ε, and η are scalars and we normalize E[ε] = 0. Here additive triangular and nonparametric instrumental variable estimators will be consistent, in addition to the triangular nonseparable estimators given here. Suppose that the object of estimation is the ASF g(x). Under regularity conditions like those give above, the estimator will converge at a rate that is a power of n, but slower than the optimal one-dimensional rate. In contrast, the estimator of Newey, Powell, and Vella (1999), which imposes additivity, does converge at the optimal one-dimensional rate. Also, estimators of g(x) that only use E[ε|Z] = 0 will converge at a rate that is slower than any power of n (e.g., see Severini and Tripathi (2006)). Thus, the convergence rate we have obtained here is intermediate between that of an estimator that imposes additivity and one that is based just on the conditional mean restriction. 7. CONCLUSION The identification and bounds results for the QSF, ASF, and policy effects also apply to settings with an observable control variable V , in addition to the triangular model. For example, the set up with Y = g(X ε) for X ∈ {0 1} and unrestricted ε, and Assumption 1 for an observable V is a well known treatment effects model, where Assumption 1 is referred to as unconfoundedness or selection on observables (e.g., Imbens (2004), Heckman, Ichimura, Smith, and Todd (1998)). The QSF and other identification and bounds apply to this model, and to generalizations where X takes on more than two values.
TRIANGULAR SIMULTANEOUS EQUATIONS MODELS
1507
APPENDIX PROOF OF THEOREM 1: Let h−1 (x z) denote the inverse function for h(z η) in its first argument, which exists by condition (ii). Then, as shown in the proof of Lemma 1 of Matzkin (2003), FX1 |Z (x z) = Pr(X1 ≤ x|Z = z) = Pr(h(z η) ≤ x|Z = z) = Pr(η ≤ h−1 (x z)|Z = z) = Pr(η ≤ h−1 (x z)) = Fη (h−1 (x z)) By condition (ii), η = h−1 (X1 Z), so that plugging in gives V = FX1 |Z (X1 Z) = Fη (h−1 (X1 Z)) = Fη (η) By Fη strictly monotonic on the support of η, the sigma algebra associated with η is equal to that associated with V = Fη (η) so that conditional expectations given η are identical to those given V Also, for any bounded function a(X), by independence of Z and (ε η), E[a(X)|η ε] = a(h(z η))FZ (dz) = E[a(X)|η] Therefore, for any bounded function b(ε), we have E[a(X)b(ε)|V ] = E b(ε)E[a(X)|η ε]|η = E b(ε)E[a(X)|η]|η = E[b(ε)|η]E[a(X)|η]
Q.E.D.
PROOF OF THEOREM 2: Define V = FX|Z (X Z) and let (X k(V X)) denote the inverse of (X FX|Z (X|Z)), so that Z = k(V X). It then follows by (X V ) and (X Z) being one-to-one transformations of each other that QY |XV (τ X V ) = QY |XZ (τ X Z) = QY |XZ (τ X k(V X)) Also, by the inverse function theorem, QX|Z (τ z) is differentiable at (v0 z0 ) and k(v x) is differentiable at (v0 x0 ) with ∇x k(v0 x0 ) = −∇x FX|Z (x0 z0 )/∇z FX|Z (x0 z0 ) = 1/∇z QX|Z (v0 z0 ) Then by the chain rule, ∇x QY |XV (τ0 x0 v0 )
∂ QY |XZ (τ0 x k(v0 x)) = ∂x x=x0
1508
G. W. IMBENS AND W. K. NEWEY
= ∇x QY |XZ (τ0 x0 z0 ) + ∇z QY |XZ (τ0 x0 z0 )∇x k(v0 x0 ) = ∇x QY |XZ (τ0 x0 z0 ) +
∇z QY |XZ (τ0 x0 z0 ) ∇z QX|Z (v0 z0 )
Q.E.D.
PROOF OF THEOREM 3: By Assumption 2, the support of V conditional on X = x equals the support V of V so that Pr(Y ≤ y|X = x V ) is unique with probability 1 on X × V . The conclusion then follows by the derivation in the text. Q.E.D. PROOF OF THEOREM 4: By the definition of V (x) and Assumption 1, on a set of x with probability 1 integrating equation (4) gives g(x e)Fε|V (de|V )FV (dV ) m(x v)FV (dv) = V (x)
V (x)
Also by B ≤ g(x e) ≤ Bu it follows that B ≤ g(x e)Fε|V (de|V ) ≤ Bu so that g(x e)Fε|V (de|V )FV (dV ) ≤ Bu P(x) B P(x) ≤ V ∩V (x)c
Summing up these two equations and applying iterated expectations gives μ (x) ≤ g(x e)Fε|V (de|V )FV (dV ) = μ(x) ≤ μu (x) V
To see that the bound is sharp, let ε = V and m(x V ) V ∈ V (x), u g (x ε) = V ∈ / V (x). Bu By ε constant given V , ε is independent of X conditional on V . Then μ(x) = μu (x) Defining g (x ε) similarly, with B replacing Bu , gives μ(x) = μ (x). Q.E.D. PROOF OF THEOREM 5: Note first that by Assumption 1, G (y x) = Pr(Y ≤ y|X = x V = v)FV (dv) = =
V (x)
V (x)
V (x)
Pr(g(x ε) ≤ y|X = x V = v)FV (dv) Pr(g(x ε) ≤ y|V = v)FV (dv)
TRIANGULAR SIMULTANEOUS EQUATIONS MODELS
1509
Then by Pr(g(x ε) ≤ y|V ) ≥ 0 we have G (y x) ≤ Pr(g(x ε) ≤ y|V = v)FV (dv) = G(y x) Also by Pr(g(x ε) ≤ y|V ) ≤ 1, we have Pr(g(x ε) ≤ y|V = v)FV (dv) G(y x) = G (y x) + ≤ G (y x) +
V ∩V (x)c
V ∩V (x)c
FV (dv) = Gu (y x) Q.E.D.
The conclusion then follows by inverting.
PROOF OF THEOREM 6: By the fact that g(x ε) is continuously differentiable and the integrability condition, it follows that m(x v) is differentiable and equation (11) is satisfied. Then by equation (12), the average derivative is an explicit functional of the data distribution and so is identified. For the policy effect, by the assumption about it, it follows that m(x v) is well defined, with probability 1, at (x v) = ((X) V ), so that the conclusion follows as in equation (10). Q.E.D. PROOF OF THEOREM 7: By equation (4), m(X V ) = E[Y |X V ] = g(X e)Fε|V (de|V ). Then by T ( g(· e)Fε|V (de|V ) X) = T (g(· e) X) × Fε|V (de|V ) and iterated expectations,
E T (m(· V ) X) = E T g(· e)Fε|V (de|V ) X =E
T (g(· e) X)Fε|V (de|V )
= E E T (g(· ε) X)|V X = E T (g(· ε) X) Since E[T (g(· ε) X)] is equal to an explicit function of the data distribution, it is identified. Q.E.D. The proofs of Theorems 8 and 9 were sketched in the text. The proofs of Lemmas 10 and 11 and Theorems 12 and 13 are given in the Supplemental Material (Imbens and Newey (2009)). REFERENCES ALTONJI, J., AND R. MATZKIN (2005): “Cross Section and Panel Data Estimators for Nonseparable Models With Endogenous Regressors,” Econometrica, 73, 1053–1102. [1482,1488]
1510
G. W. IMBENS AND W. K. NEWEY
ANGRIST, J., K. GRADDY, AND G. W. IMBENS (2000): “The Interpretation of Instrumental Variable Estimators in Simultaneous Equations Models With an Application to the Demand for Fish,” Review of Economic Studies, 67, 499–527. [1483] ATHEY, S. (2002): “Monotone Comparative Statics Under Uncertainty,” Quarterly Journal of Economics, 187–223. [1485] ATHEY, S., AND S. STERN (1998): “An Empirical Framework for Testing Theories About Complementarity in Organizational Design,” Working Paper 6600, NBER. [1486] BLUNDELL, R., AND J. L. POWELL (2003): “Endogeneity in Nonparametric and Semiparametric Regression Models,” in Advances in Economics and Econometrics, Vol. II, ed. by M. Dewatripont, L. Hansen, and S. Turnovsky. Cambridge: Cambridge University Press, 312–357. [1482,1483,1488,1489] (2004): “Endogeneity in Semiparametric Binary Response Models,” Review of Economic Studies, 71, 581–913. [1482] BLUNDELL, R., X. CHEN, AND D. CHRISTENSEN (2007): “Semi-Nonparametric IV Estimation of Shape-Invariant Engel Curves,” Econometrica, 75, 1613–1669. [1483,1498,1499] BLUNDELL, R., A. GOSLING, H. ICHIMURA, AND C. MEGHIR (2007): “Changes in the Distribution of Male and Female Wages Accounting for Unemployment Using Bounds,” Econometrica, 75, 323–363. [1490] CARD, D. (2001): “Estimating the Return to Schooling: Progress on Some Persistent Econometric Problems,” Econometrica, 69, 1127–1160. [1484] CHAMBERLAIN, G. (1984): “Panel Data,” in Handbook of Econometrics, Vol. 2, ed. by Z. Griliches and M. Intrilligator. New York: North-Holland. [1482,1488] CHERNOZHUKOV, V., AND C. HANSEN (2005): “An IV Model of Quantile Treatment Effects,” Econometrica, 73, 1127–1160. [1483] CHERNOZHUKOV, V., G. IMBENS, AND W. NEWEY (2007): “Instrumental Variable Estimation of Nonseparable Models,” Journal of Econometrics, 139, 4–14. [1483] CHESHER, A. (2002): “Semiparametric Identification in Duration Models,” Working Paper CWP20/02, Cemmap. [1483] (2003): “Identification in Nonseparable Models,” Econometrica, 71, 1405–1441. [1483, 1486,1487] (2005): “Nonparametric Identification Under Discrete Variation,” Econometrica, 73, 1525–1550. [1483] DAROLLES, S., J.-P. FLORENS, AND E. RENAULT (2003): “Nonparametric Instrumental Regression,” Working Paper, Toulouse University. [1483] DAS, M. (2001): “Monotone Comparative Statics and the Estimation of Behavioral Parameters,” Working Paper, Department of Economics, Columbia University. [1483-1486] (2004): “Instrumental Variable Estimators for Nonparametric Models With Discrete Endogenous Regressors,” Journal of Econometrics, 124, 335–361. [1483] FLORENS, J. P., J. J. HECKMAN, C. MEGHIR, AND E. VYTLACIL (2008): “Identification of Treatment Effects Using Control Functions in Models With Continuous, Endogenous Treatment and Heterogeneous Effects,” Econometrica, 76, 1191–1206. [1482] HALL, P., AND J. L. HOROWITZ (2005): “Nonparametric Methods for Inference in the Presence of Instrumental Variables,” The Annals of Statistics, 33, 2904–2929. [1483] HECKMAN, J. J., AND E. J. VYTLACIL (2000): “Local Instrumental Variables,” in Nonlinear Statististical Modeling, ed. by C. Hsiao, K. Morimune, and J. L. Powell. Cambridge: Cambridge University Press. [1490] (2005): “Structural Equations, Treatment Effects, and Econometric Policy Evaluation,” Econometrica, 73, 669–738. [1492] (2008): “Generalized Roy Model and Cost Benefit Analysis of Social Programs,” Working Paper, Yale University. [1492] HECKMAN, J. J., H. ICHIMURA, J. SMITH, AND P. TODD (1998): “Characterizing Selection Bias Using Experimental Data,” Econometrica, 66, 1017–1098. [1506]
TRIANGULAR SIMULTANEOUS EQUATIONS MODELS
1511
HENGARTNER, N. W., AND O. B. LINTON (1996): “Nonparametric Regression Estimation at Design Poles and Zeros,” Canadian Journal of Statistics, 24, 583–591. [1497] HURWICZ, L. (1950): “Generalization of the Concept of Identification,” in Statistical Inference in Dynamic Ecnomic Models, Cowles Comission Monograph, Vol. 10. New York: Wiley. [1488] IMBENS, G. (2004): “Nonparametric Estimation of Average Treatment Effects Under Exogeneity: A Review,” Review of Economics and Statistics, 86, 1–29. [1506] (2007): “Nonadditive Models With Endogenous Regressors,” in Advances in Economic and Econometrics, Theory and Applications, ed. by R. Blundell, W. Newey, and T. Persson. Cambridge: Cambridge University Press. [1483,1486] IMBENS, G. W., AND J. ANGRIST (1994): “Identification and Estimation of Local Average Treatment Effects,” Econometrica, 62, 467–476. [1483] IMBENS, G. W., AND W. K. NEWEY (2009): “Supplement to ‘Identification and Estimation of Triangular Simultaneous Equations Models Without Additivity’,” Econometrica Supplemental Material, 77, http://www.econometricsociety.org/ecta/Supmat/7108_Proofs.pdf. [1509] IMBENS, G. W., AND J. WOOLDRIDGE (2009): “Recent Developments in the Econometrics of Program Evaluation,” Journal of Economic Literature, 47, 5–86. [1483] LEHMANN, E. L. (1974): Nonparametrics: Statistical Methods Based on Ranks. San Francisco, CA: Holden-Day. [1482,1488] LORENTZ, G. (1986): Approximation of Functions. New York: Chelsea Publishing Company. [1503] MA, L., AND R. KOENKER (2006): “Quantile Regression Methods for Recursive Structural Equation Models,” Journal of Econometrics, 134, 471–506. [1483] MANSKI, C. (1994): “The Selection Problem,” in Advances in Economics and Econometrics, ed. by C. Sims. Cambridge, England: Cambridge University Press. [1490] (2007): Identification for Prediction and Decision. Princeton: Princeton University Press. [1490] MATZKIN, R. (1993): “Restrictions of Economic Theory in Nonparametric Models,” in Handbook of Econometrics, Vol. IV, ed. by R. Engle and D. McFadden. Amsterdam: North-Holland. [1485] (2003): “Nonparametric Estimation of Nonadditive Random Functions,” Econometrica, 71, 1339–1375. [1486,1507] MILGROM, P., AND C. SHANNON (1994): “Monotone Comparative Statics,” Econometrica, 58, 1255–1312. [1485] MUNDLAK, Y. (1963): “Estimation of Production Functions From a Combination of CrossSection and Time-Series Data,” in Measurement in Economics, Studies in Mathematical Economics and Econometrics in Memory of Yehuda Grunfeld, ed. by C. Christ. Stanford: Stanford University Press, 138–166. [1484] NEWEY, W. K. (1994): “Kernel Estimation of Partial Means and a Variance Estimator,” Econometric Theory, 10, 233–253. [1482,1502] NEWEY, W. K., AND J. L. POWELL (2003): “Nonparametric Instrumental Variables Estimation,” Econometrica, 71, 1565–1578. [1483] NEWEY, W. K., J. L. POWELL, AND F. VELLA (1999): “Nonparametric Estimation of Triangular Simultaneous Equations Models,” Econometrica, 67, 565–603. [1482,1483,1505,1506] PINKSE, J. (2000a): “Nonparametric Two-Step Regression Functions When Regressors and Error Are Dependent,” Canadian Journal of Statistics, 28, 289–300. [1483] (2000b): “Nonparametric Regression Estimation Using Weak Separability,” University of British Columbia. [1486] POWELL, J., J. STOCK, AND T. STOKER (1989): “Semiparametric Estimation of Index Coefficients,” Econometrica, 57, 1403–1430. [1491] SEVERINI, T., AND G. TRIPATHI (2006): “Some Identification Issues in Nonparametric Linear Models With Endogneous Regressors,” Econometric Theory, 22, 258–278. [1506] SILVERMAN, B. (1986): Density Estimation for Statistics and Data Analysis. New York: Chapman & Hall. [1499]
1512
G. W. IMBENS AND W. K. NEWEY
STOCK, J. (1989): “Nonparametric Policy Analysis: An Application to Estimating Hazardous Waste Cleanup Benefits,” in Nonparametric and Semiparametric Methods in Econometrics and Statistics, ed. by W. Barnett, J. Powell, and G. Tauchen. Cambridge: Cambridge University Press, 77–98. [1482,1491] STOKER, T. (1986): “Consistent Estimation of Scaled Coefficients,” Econometrica, 54, 1461–1481. [1482,1491] VYTLACIL, E. J., AND N. YILDIZ (2007): “Dummy Endogenous Variables in Weakly Separable Models,” Econometrica, 75, 757–779. [1483] WOOLDRIDGE, J. (2002): Econometric Analysis of Cross Section and Panel Data. Cambridge: MIT Press. [1482] YU, K., AND M. C. JONES (1998): “Local Linear Quantile Regression,” Journal of the American Statistical Association, 93, 228–237. [1497]
Dept. of Economics, Littauer Center, Harvard University, 1805 Cambridge Street, Cambridge, MA 02138, U.S.A.;
[email protected] and Dept. of Economics, Massachusetts Institute of Technology, Cambridge, MA 02142-1347, U.S.A.;
[email protected]. Manuscript received April, 2007; final revision received January, 2009.
Econometrica, Vol. 77, No. 5 (September, 2009), 1513–1574
INFORMATION PERCOLATION WITH EQUILIBRIUM SEARCH DYNAMICS BY DARRELL DUFFIE, SEMYON MALAMUD, AND GUSTAVO MANSO1 We solve for the equilibrium dynamics of information sharing in a large population. Each agent is endowed with signals regarding the likely outcome of a random variable of common concern. Individuals choose the effort with which they search for others from whom they can gather additional information. When two agents meet, they share their information. The information gathered is further shared at subsequent meetings, and so on. Equilibria exist in which agents search maximally until they acquire sufficient information precision and then search minimally. A tax whose proceeds are used to subsidize the costs of search improves information sharing and can, in some cases, increase welfare. On the other hand, endowing agents with public signals reduces information sharing and can, in some cases, decrease welfare. KEYWORDS: Information percolation, search, learning, equilibrium.
1. INTRODUCTION WE CHARACTERIZE THE EQUILIBRIUM DYNAMICS of information sharing in a large population. An agent’s optimal current effort to search for information sharing opportunities depends on that agent’s current level of information and on the cross-sectional distribution of information quality and search efforts of other agents. Under stated conditions, in equilibrium, agents search maximally until their information quality reaches a trigger level and then search minimally. In general, it is not the case that raising the search-effort policies of all agents causes an improvement in information sharing. This monotonicity property does, however, apply to trigger strategies and enables a fixed-point algorithm for equilibria. In our model, each member of the population is endowed with signals regarding the likely outcome of a Gaussian random variable Y of common concern. The ultimate utility of each agent is increasing in the agent’s conditional precision of Y . Individuals therefore seek out others from whom they can gather additional information about Y . When agents meet, they share their information. The information gathered is then further shared at subsequent meetings and so on. Agents meet according to a technology for search and random matching, versions of which are common in the economics literatures covering labor markets, monetary theory, and financial asset markets. A distinction is that the search intensities in our model vary cross sectionally in a manner that depends on the endogenously chosen efforts of agents. 1 We are grateful for a conversation with Alain-Sol Sznitman, and for helpful remarks and suggestions from three referees and a co-editor. Duffie acknowledges support from the Swiss Finance Institute while visiting The University of Lausanne. Malamud gratefully acknowledges financial support from the National Centre of Competence in Research program “Financial Valuation and Risk Management” (NCCR FINRISK).
© 2009 The Econometric Society
DOI: 10.3982/ECTA8160
1514
D. DUFFIE, S. MALAMUD, AND G. MANSO
Going beyond prior work in this setting, we capture implications of the incentive to search more intensively whenever there is greater expected utility to be gained from the associated improvement in the information arrival process. Of course, the amount of information that can be gained from others depends on the efforts that the others themselves have made to search in the past. Moreover, the current expected rate at which a given agent meets others depends not only on the search efforts of that agent, but also on the current search efforts of the others. We assume complementarity in search efforts. Specifically, we suppose that the intensity of arrival of matches by a given agent increases in proportion to the current search effort of that agent, given the search efforts of the other agents. Each agent is modeled as fully rational, in a subgame-perfect Bayes–Nash equilibrium. The existence and characterization of an equilibrium involve incentive consistency conditions on the jointly determined search efforts of all members of the population simultaneously. Each agent’s lifetime search intensity process is the solution of a stochastic control problem, whose rewards depend on the search intensity processes chosen by other agents. We state conditions for a stationary equilibrium in which each agent’s search effort at a given time depends only on that agent’s current level of precision regarding the random variable Y of common concern. We show that if the cost of search is increasing and convex in effort, then, taking as given the cross-sectional distribution of other agents’ information quality and search efforts, the optimal search effort of any given agent is declining in the current information precision of that agent. This property holds, even out of equilibrium, because the marginal valuation of additional information for each agent declines as that agent gathers additional information. With proportional search costs, this property leads to equilibria with trigger policies that reduce search efforts to a minimum once a sufficient amount of information is obtained. Our proof of existence relies on a monotonicity result: Raising the assumed trigger level at which all agents reduce their search efforts leads to a first-order-dominant cross-sectional distribution of information arrivals. We show by counterexample, however, that for general forms of searcheffort policies, it is not generally true that the adoption of more intensive search policies leads to an improvement in population-wide information sharing. Just the opposite can occur. More intensive search at given levels of information can, in some cases, advance the onset of a reduction of the search efforts of agents who may be a rich source of information to others. This can lower access to richly informed agents in such a manner that, in some cases, information sharing is actually poorer. We also analyze the welfare effects of some policy interventions. First, we analyze welfare gains that can be achieved with a lump-sum tax whose proceeds are used to subsidize the costs of search efforts. Under stated conditions, we show that this promotes positive search externalities that would not otherwise arise in equilibrium. Finally, we show that, with proportional search costs, additional public information leads in equilibrium to an unambiguous reduction
INFORMATION PERCOLATION
1515
in the sharing of private information to the extent that there is, in some cases, a net negative welfare effect. 2. RELATED LITERATURE Previous research in economics has investigated the issue of information aggregation. A large literature has focused on the aggregation of information through prices. For example, Grossman (1981) proposed the concept of rational-expectations equilibrium to capture the idea that prices aggregate information that is initially dispersed across investors. Wilson (1977), Milgrom (1981), Pesendorfer and Swinkels (1997), and Reny and Perry (2006) provided strategic foundations for the rational-expectations equilibrium concept in centralized markets. In many situations, however, information aggregation occurs through local interactions rather than through common observation of market prices. For example, in decentralized markets, such as those for real estate and over-thecounter securities, agents learn from the bids of other agents in private auctions or bargaining sessions. Wolinsky (1990) and Blouin and Serrano (2001) studied information percolation in decentralized markets. In the literature on social learning, agents communicate with each other and choose actions based on information received from others. Banerjee and Fudenberg (2004), for example, studied information aggregation in a social-learning context. Previous literature has shown that some forms of information externalities may slow down or prevent information aggregation. For example, Vives (1993) showed that information aggregation may be slowed when agents base their actions on public signals (price) rather than on private signals, making inference noisier. Bikhchandani, Hirshleifer, and Welch (1992) and Banerjee (1992) showed that agents may rely on publicly observed actions, ignoring their private signals, giving rise to informational cascades that prevent social learning. Burguet and Vives (2000) and Amador and Weill (2008) posed related questions. Burguet and Vives (2000) studied a model with endogenous private information acquisition and public revelation of a noisy statistic of agents’ predictions. Improving public information reduces agents’ incentives to collect private information that could potentially slow down learning and reduce social welfare. Amador and Weill (2008) studied a model in which agents learn from public information as well as from the private observation of other agents’ actions. Improving public information allows agents to rely less heavily on their private signals when choosing their actions, thus slowing down the diffusion of information and potentially reducing social welfare. Our paper studies information aggregation in a social learning context. In contrast to previous studies, we analyze the equilibria of a game in which agents seek out other agents from whom they can gather information. This introduces a new source of information externality. If an agent chooses a high search intensity, he produces an indirect benefit to other agents by increasing both the
1516
D. DUFFIE, S. MALAMUD, AND G. MANSO
mean arrival rate at which the other agents will be matched and receive additional information, as well as the amount of information that the given agent is able to share when matched. We show that because agents do not take this externality into account when choosing their search intensities, social learning may be relatively inefficient or even collapse. We also show that endowing agents with public signals reduces their effort in searching for other agents from whom they can gather information. This reduces information sharing and can, in some cases, reduce social welfare. In addition to the information externality problem, our paper shows that coordination problems may be important in information aggregation problems. In our model, there are multiple equilibria that are Pareto-ranked in terms of the search intensities of the agents. If agents believe that other agents are searching with lower intensity, they will also search with lower intensity, producing an equilibrium with slower social learning. Pareto-dominant equilibria, in which all agents search with higher intensity, may be possible, but it is not clear how agents coordinate to achieve such equilibria. Our technology of search and matching is similar to that used in searchtheoretic models that have provided foundations for competitive general equilibrium and for models of equilibrium in markets for labor, money, and financial assets.2 Unlike these prior studies, we allow for information asymmetry about a common-value component, with learning from matching and with endogenously chosen search efforts. Our model is related to those of Duffie and Manso (2007) and Duffie, Giroux, and Manso (2009), who provided an explicit solution for the evolution of posterior beliefs when agents are randomly matched in groups over time, exchanging their information with each other when matched. In contrast to these prior studies, however, we model the endogenous choice of search intensities. Moreover, we deal with Gaussian uncertainty, as opposed to the case of binary uncertainty that is the focus of these prior two papers. Further, we allow for the entry and exit of agents, and analyze the resulting stationary equilibria. 3. MODEL PRIMITIVES A probability space (Ω F P) and a nonatomic measure space (A A α) of agents are fixed. We rely throughout on applications of the exact law of large numbers (LLN) for a continuum of random variables. A suitably precise version can be found in Sun (2006), based on technical conditions on the measurable subsets of Ω × A. As in the related literature, we also rely formally on 2
Examples of theoretical work using random matching to provide foundations for competitive equilibrium include that of Rubinstein and Wolinsky (1985) and Gale (1987). Examples in labor economics include Pissarides (1985) and Mortensen (1986); examples in monetary theory include Kiyotaki and Wright (1993) and Trejos and Wright (1995); examples in finance include Duffie, Gârleanu, and Pedersen (2005), Lagos and Rocheteau (2009), and Weill (2008).
INFORMATION PERCOLATION
1517
a continuous-time LLN for random search and matching that has only been rigorously justified in discrete-time settings.3 An alternative, which we avoid for simplicity, would be to describe limiting results for a sequence of models with discrete-time periods or finitely many agents as the lengths of time periods shrink or as the number of agents gets large. All agents benefit, in a manner to be explained, from information about a particular random variable Y . Agents are endowed with signals from a space S . The signals are jointly Gaussian with Y . Conditional on Y , the signals are pairwise independent. We assume that Y and all of the signals in S have zero mean and unit variance, which is without loss of generality because they play purely informational roles. Agent i enters the market with a random number Ni0 of signals that is independent of Y and S . The probability distribution π of Ni0 does not depend on i. For almost every pair (i j) of agents, Ni0 and Nj0 are independent, and their signal sets are disjoint. When present in the market, agents meet other agents according to endogenous search and random matching dynamics to be described. Under these dynamics, for almost every pair (i j) of agents, conditional on meeting at a given time t, there is zero probability that they meet at any other time and zero probability that the set of agents that i has met before t overlaps with the set of agents that j has met before t. Whenever two agents meet, they share with each other enough information to reveal their respective current conditional distributions of Y . Although we do not model any strict incentive for matched agents to share their information, they have no reason not to do so. We could add to the model a joint production decision that would provide a strict incentive for agents to reveal their information when matched, but we have avoided this for simplicity. By the joint Gaussian assumption and by induction in the number of prior meetings of each of a pair of currently matched agents, it is enough when sharing information relevant to Y that each of the two agents tells the other his or her immediately prior conditional mean and variance of Y . The conditional variance of Y given any n signals is v(n) =
1 − ρ2 1 + ρ2 (n − 1)
where ρ is the correlation between Y and any signal. Thus, it is equivalent for the purpose of updating the agents’ conditional distributions of Y that agent i tells his counterparty at any meeting at time t his or her current conditional mean Xit of Y and the total number Nit of signals that played a role in calculating the agent’s current conditional distribution of Y . This number of signals is initially the endowed number Ni0 , and is then incremented at each meeting 3
See Duffie and Sun (2007).
1518
D. DUFFIE, S. MALAMUD, AND G. MANSO
by the number Njt of signals that similarly influenced the information about Y that had been gathered by his counterparty j by time t. Because the precision 1/v(Nit ) of the conditional distribution of Y given the information set Fit of agent i at time t is strictly monotone in Nit , we speak of “precision” and Nit interchangeably. Agents remain in the market for exponentially distributed times that are independent (pairwise) across agents, with parameter η . If exiting at time t, agent i chooses an action A, measurable with respect to his current information Fit , with cost (Y − A)2 . Thus, to minimize the expectation of this cost, agent i optimally chooses the action A = E(Y |Fit ) and incurs an optimal expected exit cost equal to the Fit -conditional variance σit2 of Y . Thus, while in the market, the agent has an incentive to gather information about Y in order to reduce the expected exit cost. We will shortly explain how search for other agents according to a costly effort process φ influences the current mean rate of arrival of matches and, thus, the information filtration {Fit : t ≥ 0}. Given a discount rate r, the agent’s lifetime utility (measuring time from the point of that agent’s market entrance) is τ −rτ 2 −rt e K(φt ) dt Fi0 U(φ) = E −e σiτ − 0
where τ is the exit time and K(c) is the cost rate for search effort level c, which is chosen at each time from some interval [cL cH ] ⊂ R+ . We take the cost function K to be bounded and measurable, so U(φ) is bounded above and finite. As we will show, essentially any exit utility formulation that is concave and increasing in σiτ−2 would result in precisely the same characterization of equilibrium that we shall provide. The agent is randomly matched at a stochastic intensity that is proportional to the current effort of the agent, given the efforts of other agents. The particular pairings of counterparties are randomly chosen, in the sense of the law of large numbers for pairwise random matching described by Duffie and Sun (2007). The proportionality of matching intensities to effort levels means that an agent who exerts search effort c at time t has a current intensity (conditional mean arrival rate) of cbqb of being matched to some agent from the set of agents currently using effort level b at time t, where qb is the current fraction of the population using effort b. More generally, if the current crosssectional distribution of effort by other agents is given by a measure , then the intensity of a match with agents whose current effort levels are in a set B is c B b d(b). Our equilibrium analysis rests significantly on the complementarity of the search and matching technology, meaning that the more effort that an agent makes to be found, the more effective are the efforts of his counterparties to find him. One can relax the parameterization of the search technology by rescaling the “effort” variable and making a corresponding adjustment of the effort cost
INFORMATION PERCOLATION
1519
function K(·). Search incentives ultimately depend only on the mapping from the search costs of two types of agents to the expected contact rate between these types per unit mass of each.4 Agents enter the market at a rate proportional to the current mass qt of agents in the market for some proportional “birth rate” η > 0. Because agents exit the market pairwise independently at intensity η , the law of large numbers implies that the total quantity qt of agents in the market at time t is qt = q0 e(η−η )t almost surely. An advantage of the Gaussian informational setting is that the crosssectional distribution of agents’ current conditional distributions of Y can be described in terms of the joint cross-sectional distribution of conditional precisions and conditional means. We now turn to a characterization of the dynamics of the cross-sectional distribution of posteriors as the solution of a particular deterministic differential equation in time. The cross-sectional distribution μt of information precision at time t is defined, at any set B of positive integers, as the fraction μt (B) = α({i : Nit ∈ B})/qt of agents whose precisions are currently in the set B. We sometimes abuse notation by writing μt (n) for the fraction of agents with precision n. In the equilibria that we shall demonstrate, each agent chooses an effort level at time t that depends only on that agent’s current precision, according to a policy C : N → [cL cH ] used by all agents. Assuming that such a search effort policy C is used by all agents, the cross-sectional precision distribution satisfies (almost surely) the differential equation (1)
d μt = η(π − μt ) + μCt ∗ μCt − μCt μCt (N) dt
where μCt (n) = Cn μt (n) is the effort-weighted measure, μ ∗ ν denotes the convolution of two measures μ and ν, and μ (N) = C t
∞
Cn μt (n)
n=1
is the cross-sectional average search effort. The mean exit rate η plays no role in (1) because exit removes agents with a cross-sectional distribution that is the same as the current population cross-sectional distribution. The first term on the right-hand side of (1) represents the replacement of agents with newly entering agents. The convolution term μCt ∗ μCt represents the gross rate at which new agents of a given precision are created through matching and information 4 For example, some of our results allow the cost function K(·) to be increasing and convex, so for these results we can allow mean contact rates to have decreasing returns to scale in effort costs. For our characterization of equilibrium trigger strategies, however, we rely on the assumption that expected matching rates are linear with respect to the cost of each agent’s effort.
1520
D. DUFFIE, S. MALAMUD, AND G. MANSO
sharing. For example, agents of a given posterior precision n can be created by pairing agents of prior respective precisions k and n − k, for any k < n, so the total gross rate of increase of agents with precision n from this source is (μCt ∗ μCt )(n) =
n−1
μt (k)C(k)C(n − k)μt (n − k)
k=1
The final term of (1) captures the rate μCt (n)μCt (N) of replacement of agents with prior precision n with those of some new posterior precision that is obtained through matching and information sharing. We anticipate that, in each state of the world ω and at each time t, the joint cross-sectional population distribution of precisions and posterior means of Y has a density ft on N × R, with evaluation ft (n x) at precision n and posterior mean x of Y . This means that the fraction of agents whose conditional precision–mean pair (n x) is in a given measurable set B ⊂ N × R is +∞ n −∞ ft (n x)1{(nx)∈B} dx. When it is important to clarify the dependence of this density on the state of world ω ∈ Ω, we write ft (n x ω). PROPOSITION 3.1: For any search-effort policy function C, the cross-sectional distribution ft of precisions and posterior means of the agents is almost surely given by (2)
ft (n x ω) = μt (n)pn (x|Y (ω))
where μt is the unique solution of the differential equation (1) and pn (·|Y ) is the Y -conditional Gaussian density of E(Y |X1 Xn ) for any n signals X1 Xn . This density has conditional mean nρ2 Y 1 + ρ2 (n − 1) and conditional variance (3)
σn2 =
nρ2 (1 − ρ2 )
(1 + ρ2 (n − 1))2
Appendix A provides a proof based on a formal application of the law of large numbers, and an independent proof by direct solution of the differential equation for ft that arises from matching and information sharing. As n goes to infinity, the measure with density pn (·|Y ) converges, ω by ω (almost surely), to a Dirac measure at Y (ω). In other words, those agents who have collected a large number of signals have posterior means that cluster (cross sectionally) close to Y .
INFORMATION PERCOLATION
1521
4. STATIONARY MEASURE In our eventual equilibrium, all agents adopt an optimal search-effort policy function C, taking as given the presumption that all other agents adopt the same policy C and taking as given a stationary cross-sectional distribution of posterior conditional distributions of Y . Proposition 3.1 implies that this cross-sectional distribution is determined by the cross-sectional precision distribution μt . In a stationary setting, from (1), this precision distribution μ solves (4)
0 = η(π − μ) + μC ∗ μC − μC μC (N)
which can be viewed as a form of algebraic Riccati equation. We consider only solutions that have the correct total mass μ(N) of 1. For brevity, we use the notation μi = μ(i) and Ci = C(i). LEMMA 4.1: Given a policy C, there is a unique measure μ satisfying the stationary-measure equation (4). This measure μ is characterized as follows. For ¯ by the algorithm ¯ C) any C¯ ∈ [cL cH ], construct a measure μ( ¯ = μ¯ 1 (C)
ηπ1 η + C1 C¯
and then, inductively, ¯ = ηπk + μ¯ k (C)
k−1 l=1
¯ μ¯ k−l (C) ¯ Cl Ck−l μ¯ l (C)
η + Ck C¯
There is a unique solution C¯ to the equation C¯ = ¯ we have μ = μ( ¯ C).
∞ n=1
¯ C. ¯ Given such a C, ¯ μ¯ n (C)
An important question is stability. That is, if the initial condition μ0 is not sufficiently near the stationary measure, will the solution path {μt : t ≥ 0} converge to the stationary measure? The dynamic equation (1) is an infinitedimensional nonlinear dynamical system that could, in principle, have potentially complicated oscillatory behavior. In fact, a technical condition on the tail behavior of the effort policy function C(·) implies that the stationary distribution is globally attractive: From any initial condition, μt converges to the unique stationary distribution. PROPOSITION 4.2: Suppose that there is some integer N such that Cn = CN for n ≥ N, and suppose that η ≥ cH CN . Then the unique solution μt of (1) converges pointwise to the unique stationary measure μ.
1522
D. DUFFIE, S. MALAMUD, AND G. MANSO
The proof, given in Appendix B, is complicated by the factor μCt (N), which is nonlocal and involves μt (n) for each n. The proof takes the approach of representing the solution as a series {μt (1) μt (2) }, each term of which solves an equation similar to (1), but without the factor μCt (N). Convergence is proved for each term of the series. A tail estimate completes the proof. The convergence of μt does not guarantee that the limit measure is, in fact, the unique stationary measure μ. Appendix B includes a demonstration of this, based on Proposition B.13. As we later show in Proposition 5.3, the assumption that Cn = CN for all n larger than some integer N is implied merely by individual agent optimality under a mild condition on search costs. Our eventual equilibrium will, in fact, be in the form of a trigger policy C N , which for some integer N ≥ 1 is defined by c n < N, N Cn = H cL n ≥ N. In other words, a trigger policy exerts maximal search effort until sufficient information precision is reached and then exerts minimal search effort thereafter. A trigger policy automatically satisfies the “flat-tail” condition of Proposition 4.2. A key issue is whether search policies that exert more effort at each precision level actually generate more information sharing. This is an interesting question in its own right and also plays a role in obtaining a fixed-point proof of the existence of equilibria. For a given agent, access to information from others is entirely determined by the weighted measure μC , because if the given agent searches at some rate c, then the arrival rate of agents that offer n units of additional precision is cCn μn = cμCn . Thus, a first-order stochastically dominant shift in the measure μC is an unambiguous improvement in the opportunity of any agent to gather information. (A measure ν has first-order stochastic dominance (FOSD) relative to a measure θ if, for any nonnegative bounded increasing sequence f , we have n fn νn ≥ n fn θn .) The next result states that, at least when comparing trigger policies, a more intensive search policy results in an improvement in information sharing opportunities. PROPOSITION 4.3: Let μM and ν N be the unique stationary measures corre= μNn CnN denote the sponding to trigger policies C M and C N , respectively. Let μCN n CN associated search-effort-weighted measure. If N > M, then μ has the first-order dominance property over μCM . Superficially, this result may seem obvious. It says merely that if all agents extend their high-intensity search to a higher level of precision, then there will be an unambiguous upward shift in the cross-sectional distribution of information transmission rates. Our proof, shown in Appendix B, is not simple. Indeed,
INFORMATION PERCOLATION
1523
we provide a counterexample below to the similarly “obvious” conjecture that any increase in the common search-effort policy function leads to a first-orderdominant improvement in information sharing. Whether or not raising search efforts at each given level of precision improves information sharing involves two competing forces. The direct effect is that higher search efforts at a given precision increase the speed of information sharing, holding constant the precision distribution μ. The opposing effect is that if agents search harder, then they may earlier reach a level of precision at which they reduce their search efforts, which could, in principle, cause a downward shift in the cross-sectional average rate of information arrival. To make precise these competing effects, we return to the construction ¯ in Lemma 4.1 and write μ(C ¯ to show the depenof the measure μ( ¯ C) ¯ C) dence of this candidate measure on the conjectured average search effort C¯ as well as the given policy C. We emphasize that μ is the stationary measure ¯ and C¯ = Cn μn . From the algorithm stated in for C provided μ = μ(C ¯ C) n ¯ is increasing in C and decreasing in C¯ for all k. (A proof, Lemma 4.1, μ¯ k (C C) by induction in k, is given in Appendix B.) Now the relevant question is, “What ¯ solveffect does increasing C have on the stationary average search effort, C, ¯ ¯ ing the equation C = n μ¯ n (C C)Cn ?” The following proposition shows that increasing C has a positive effect on C¯ and, thus, through this channel has a ¯ negative effect on μ¯ k (C C). measures associated with poliPROPOSITION 4.4: Let μ and ν be the stationary cies C and D. If D ≥ C, then n Dn νn ≥ n Cn μn . That is, any increase in search policy increases the equilibrium average search effort. For trigger policies, the direct effect of increasing the search-effort policy C dominates the “feedback” effect on the cross-sectional average rate of effort. For other types of policies, this need not be the case, as shown by the following counterexample, whose proof is given in Appendix B. EXAMPLE 4.5: Suppose that π2 > 2π1 , that is, the probability of being endowed with two signals is more than double the probability of being endowed with only one signal. Consider a policy C with Cn = 0 for n ≥ 3. Fix C2 > 0 and consider a variation of C1 . For C1 sufficiently close to C2 , we show in Appendix B that ∞
Ck μk = C2 μ2
k=2
is monotone decreasing in C1 . Thus, if we consider the increasing sequence f1 = 0 fn = 1
(n ≥ 2)
1524
D. DUFFIE, S. MALAMUD, AND G. MANSO
we have f · μC = C2 μ2 strictly decreasing in C1 for C1 in a neighborhood of C2 , so that we do not have FOSD of μC with increasing C. In fact, more search effort by those agents with precision 1 can actually lead to poorer information sharing. To see this, consider the policies D = (1 1 0 0 ) and C = (1 − 1 0 0 ). The measure μC has FOSD over the measure μD for any5 sufficiently small > 0. 5. OPTIMALITY In this section, we study the optimal policy of a given agent who presumes that precision is distributed in the population according to some fixed measure μ and further presumes that other agents search according to a conjectured policy function C. We let C¯ = n Cn μn denote the average search effort. Given the conjectured market properties (μ C), each agent i chooses some search-effort process φ : Ω×[0 ∞) → [cL cH ] that is progressively measurable with respect to that agent’s information filtration {Fit : t ≥ 0}, meaning that φt is based only on current information. The posterior distribution of Y given Fit has conditional variance v(Nt ), where N is the agent’s precision process and v(n) is the variance of Y given any n signals. For a discount rate r on future expected benefits and given the conjectured market properties (μ C), an agent solves the problem τ −rτ φ −st (5) e K(φt ) dt Fi0 U(φ) = sup E −e v(Nτ ) − φ
0
where τ is the time of exit, exponentially distributed with parameter η , and where the agent’s precision process N φ is the pure-jump process with a given ¯ and with jump-size probinitial condition N0 , with jump-arrival intensity φt C, C ¯ ability distribution μ /C, that is, with probability C(j)μ(j)/C¯ of jump size j. We have abused notation by measuring calendar time for the agent from the time of that agent’s market entry. For generality, we relax from this point on the assumption that the exit disutility is the conditional variance v(Nτφ ) and we allow the exit utility to be of the more general form u(Nτφ ) for any bounded increasing concave6 function u(·) on the positive integers. It can be checked that u(n) = −v(n) is indeed a special case. We say that φ∗ is an optimal search-effort process given (μ C) if φ∗ attains the supremum (5). We further say that a policy function Γ : N → [cL cH ] is optimal given (μ C) if the search-effort process {Γ (Nt ) : t ≥ 0} is optimal, where the precision process N uniquely satisfies the stochastic differential equation 5 For this, we can without loss of generality take f1 = 1 and calculate that h() = f · μC is decreasing in for sufficiently small > 0. 6 We say that a real-valued function F on the integers is concave if F(j + 2) + F(j) ≤ 2F(j + 1).
1525
INFORMATION PERCOLATION
¯ (Bewith jump-arrival intensity Γ (Nt )C¯ and with jump-size distribution μC /C. cause Γ (n) is bounded by cH , there is a unique solution N to this stochastic differential equation; see Protter (2005).) We characterize agent optimality given (μ C) using the principle of dynamic programming, showing that the indirect utility, or “value,” Vn for precision n satisfies the Hamilton–Jacobi–Bellman (HJB) equation for optimal search effort given by
∞ 0 = −(r + η )Vn + η un + sup −K(c) + c (6) (Vn+m − Vn )μCm c∈[cL cH ]
m=1
A standard martingale-based verification proof of the following result is found in Appendix C. LEMMA 5.1: If V is a bounded solution of the Hamilton–Jacobi–Bellman equation (6) and Γ is a policy with the property that, for each n, the supremum in (6) is attained at Γn , then Γ is an optimal policy function given (μ C), and VN0 is the value of this policy. We begin to lay out some of the properties of optimal policies, based on conditions on the search-cost function K(·). PROPOSITION 5.2: Suppose that K is increasing, convex, and differentiable. Then, given (μ C), there is a policy Γ that is optimal for all agents and the optimal search effort Γn is monotone decreasing in the current precision n. ¯ independent of the conjectured popuTo calculate a precision threshold N, lation properties (μ C), above which it is optimal to search minimally, we let u¯ = limn u(n), which exists because un is increasing in n and bounded, and we let
N¯ = sup n : cH η (r + η )(u¯ − u(n)) ≥ K (cL ) which is finite if K (cL ) > 0. A proof of the following result is found in Appendix C. PROPOSITION 5.3: Suppose that K(·) is increasing, differentiable, and convex, with K (cL ) > 0. Then, for any optimal search-effort policy Γ , Γn = cL
¯ (n ≥ N)
In the special case of proportional and nontrivial search costs, it is, in fact, optimal for all agents to adopt a trigger policy that searches at maximal effort until a trigger level of precision is reached and searches at minimal effort
1526
D. DUFFIE, S. MALAMUD, AND G. MANSO
thereafter. This result, stated next, is a consequence of our prior results that an optimal policy is decreasing and eventually reaches cL , and of the fact that with linear search costs, an optimal policy is “bang-bang,” therefore taking the maximal effort level cH at first and then eventually switching to the minimal effort cL at a sufficiently high precision. PROPOSITION 5.4: Suppose that K(c) = κc for some scalar κ > 0. Then, given (μ C), there is a trigger policy that is optimal for all agents. 6. EQUILIBRIUM An equilibrium is a search-effort policy function C that satisfies the statements: (i) there is a unique stationary cross-sectional precision measure μ satisfying the associated equation (4) and (ii) taking as given the market properties (μ C), the search-effort policy function C is indeed optimal for each agent. Our main result is that with proportional search costs, there exists an equilibrium in the form of a trigger policy. THEOREM 6.1: Suppose that K(c) = κc for some scalar κ > 0. Then there exists a trigger policy that is an equilibrium. The theorem is proved using the following Proposition 6.2 and Corollary 6.3. We let C N be the trigger policy with trigger at precision level N and we let μN denote the associated stationary measure. We let N (N) ⊂ N be the set of trigger levels that are optimal given the conjectured market properties (μN C N ) associated with a trigger level N. We can look for an equilibrium in the form of a fixed point of the optimal trigger-level correspondence N (·), that is, some N such that N ∈ N (N). Theorem 6.1 does not rely on the stability result that from any initial condition, μt converges to μ. This stability applies, by Proposition 4.2, provided that η ≥ cH cL . PROPOSITION 6.2: Suppose that K(c) = κc for some scalar κ > 0. Then N (N) is increasing in N in the sense that if N ≥ N and if k ∈ N (N), then there exists some k ≥ k in N (N ). Further, there exists a uniform upper bound on N (N), independent of N, given by
N¯ = max j : cH η (r + η )(u¯ − u(j)) ≥ κ Theorem 6.1 then follows from the corollary: COROLLARY 6.3: The correspondence N has a fixed point N. An equilibrium is given by the associated trigger policy C N . Our proof, found in Appendix D, leads to the following algorithm for computing symmetric pure strategy equilibria of the game. The algorithm finds all such equilibria in trigger strategies.
INFORMATION PERCOLATION
1527
¯ ALGORITHM: Start by letting N = N. Step 1. Compute N (N). If N ∈ N (N), then output C N (an equilibrium of the game). Go to the next step. Step 2. If N > 0, go back to Step 1 with N = N − 1; otherwise quit. There may exist multiple equilibria of the game. The following proposition shows that the equilibria are Pareto-ranked according to their associated trigger levels and that there is never “too much” search in equilibrium. PROPOSITION 6.4: Suppose that K(c) = κc for some scalar κ > 0. If C N is an equilibrium of the game, then it Pareto dominates a setting in which all agents employ a policy C N for a trigger level N < N. In particular, an equilibrium associated with a trigger level N Pareto dominates an equilibrium with a trigger level lower than N. 6.1. Equilibria With Minimal Search We now consider conditions under which there are equilibria with minimal search, corresponding to the trigger precision N = 0. The idea is that such equilibria can arise because a presumption that other agents make minimal search efforts can lead to a conjecture of such poor information sharing opportunities that any given agent may not find it worthwhile to exert more than minimal search effort. We give an explicit sufficient condition for such equilibria, a special case of which is cL = 0. Clearly, with cL = 0, it is pointless for any agent to expend any search effort if he or she assumes that all other agents make no effort to be found. Let μ0 denote the stationary precision distribution associated with minimal search, so that C¯ = cL is the average search effort. The value function V of any agent solves (7)
(r + η + cL2 )Vn = η un − K(cL ) + cL2
∞
Vn+m μ0m
m=1
Consider the bounded increasing sequence f given by fn = (r + η + cL2 )−1 (η un − K(cL )) Define the operator A on the space of bounded sequences by (A(g))n =
∞ cL2 gn+m μ0m r + η + cL2 m=1
1528
D. DUFFIE, S. MALAMUD, AND G. MANSO
LEMMA 6.5: The unique, bounded solution V to (7) is given by −1
V = (I − A) (f ) =
∞
Aj (f )
j=0
which is concave and monotone increasing. To provide simple conditions for minimal-search equilibria, let (8)
B = cL
∞ (V1+m − V1 )μ0m ≥ 0 m=1
THEOREM 6.6: Suppose that K(·) is convex, increasing, and differentiable. Then the minimal-search policy C, that with C(n) = cL for all n, is an equilibrium if and only if K (cL ) ≥ B. In particular, if cL = 0, then B = 0 and minimal search is always an equilibrium. Intuitively, when the cost of search is small, there should exist equilibria with active search. PROPOSITION 6.7: Suppose that K(c) = κc and cL = 0. If π1 > 0 and (9)
κ−
η (u(2) − u(1))cH μ11 < 0 r + η
then there exists an equilibrium trigger policy C N with N ≥ 1. This equilibrium strictly Pareto dominates the no-search equilibrium. 7. POLICY INTERVENTIONS In this section, we discuss the potential welfare implications of policy interventions. First, we analyze the potential to improve welfare by a tax whose proceeds are used to subsidize the costs of search efforts. This has the potential benefit of positive search externalities that may not otherwise arise in equilibrium because each agent does not search unless others are searching, even though there are feasible search efforts that would make all agents better off. Then, we study the potentially adverse implications of providing all entrants with some additional common information. Although there is some direct benefit of the additional information, we show that with proportional search costs, additional public information leads to an unambiguous reduction in the sharing of private information, to the extent that there is, in some cases, a net negative welfare effect.
INFORMATION PERCOLATION
1529
In both cases, welfare implications are judged in terms of the utilities of agents as they enter the market. In this sense, the welfare effect of an intervention is said to be positive if it improves the utility of every agent at the point in time that the agent enters and to be negative if it causes a reduction in the utilities of all entering agents. 7.1. Subsidizing Search The adverse welfare implications of low information sharing may be attenuated by a tax whose proceeds are used to subsidize search costs. An example could be research subsidies aimed at the development of technologies that reduce communication costs. Another example is a subsidy that defrays some of the cost of using existing communication technologies. We assume in this subsection that each agent pays a lump-sum tax τ at entry. Search costs are assumed to be proportional, at rate κc for some κ > 0. Each agent is also offered a proportional reduction δ in search costs, so that the after-subsidy search-cost function of each agent is Kδ (c) = (κ − δ)c. The lumpsum tax has no effect on equilibrium search behavior, so we can solve for an equilibrium policy C, as before, based on an after-subsidy proportional search cost of κ − δ. Because of the law of large numbers, the total per capita rate τη of tax proceeds can then be equated to the total per capita rate of subsidy by setting τ=
1 δ μn Cn η n
The search subsidy can potentially improve welfare by addressing the failure, in a low-search equilibrium, to exploit positive search externalities. As Proposition 6.4 shows, there is never too much search in equilibrium. The following Lemma 7.1 and Proposition 7.2 show that, indeed, equilibrium search effort is increasing in the search subsidy rate δ. LEMMA 7.1: Suppose that K(c) = κc for some κ > 0. For given market conditions (μ C), the trigger precision level N of an optimal search-effort policy is increasing in the search subsidy rate δ. That is, if N is an optimal trigger level given a subsidy δ, then for any search subsidy δ ≥ δ, there exists a higher optimal trigger N ≥ N. Coupled with Proposition 4.3, Lemma 7.1 implies that an increase in the subsidy allows an increase (in the sense of first-order dominance) in information sharing. A direct consequence of this lemma is the following proposition. PROPOSITION 7.2: Suppose that K(c) = κc for some κ > 0. If C N is an equilibrium with proportional search subsidy δ, then for any δ ≥ δ, there exists some N ≥ N such that C N is an equilibrium with proportional search subsidy δ .
1530
D. DUFFIE, S. MALAMUD, AND G. MANSO
EXAMPLE: Suppose, for some integer N > 1, that π0 = 1/2, πN = 1/2, and cL = 0. This setting is equivalent to that of Proposition 6.7, after noting that every information transfer is in blocks of N signals each, resulting in a model isomorphic to one in which each agent is endowed with one private signal of a particular higher correlation. Recalling that inequality (9) determines whether zero search is optimal, we can exploit continuity of the left-hand side of this inequality to choose parameters so that, given market conditions (μN C N ), agents have a strictly positive but arbitrarily small increase in utility when choosing search policy C 0 over policy C N . With this, C 0 is the unique equilibrium. This is before considering a search subsidy. We now consider a model that is identical with the exception that each agent is taxed at entry and given search subsidies at the proportional rate δ. We can choose δ so that all agents strictly prefer C N to C 0 (the nonzero search condition (9) is satisfied) and C N is an equilibrium. For sufficiently large N, all agents have strictly higher indirect utility in the equilibrium with the search subsidy than they do in the equilibrium with the same private-signal endowments and no subsidy. 7.2. Educating Agents at Birth A policy that might superficially appear to mitigate the adverse welfare implications of low information sharing is to “educate” all agents by giving all agents additional public signals at entry. We assume for simplicity that the M ≥ 1 additional public signals are drawn from the same signal set S . When two agents meet and share information, they take into account that the information reported by the other agent contains the effect of the additional public signals. (The implications of the reported conditional mean and variance for the conditional mean and variance associated with a counterparty’s nonpublic information can be inferred from the public signals by using Lemma A.1.) Because of this, our prior analysis of information sharing dynamics can be applied without alteration merely by treating the precision level of a given agent as the total precision less the public precision and by treating the exit utility of each ˆ agent for n nonpublic signals as u(n) = u(n + M). The public signals influence optimal search efforts. Given the market conditions (μ C), the indirect utility Vn for nonpublic precision n satisfies the Hamilton–Jacobi–Bellman equation for optimal search effort given by
∞ C (10) (Vn+m − Vn )μm 0 = −(r + η )Vn + η uM+n + sup −K(c) + c c∈[cL cH ]
m=1
Educating agents at entry with public signals has two effects. On one hand, when agents enter the market they are better informed than if they had not received the extra signals. On the other hand, this extra information may reduce agents’ incentives to search for more information, slowing down information
INFORMATION PERCOLATION
1531
percolation. Below, we show an example in which the net effect is a strict welfare loss. First, however, we establish that adding public information causes an unambiguous reduction in the sharing of private information. LEMMA 7.3: Suppose that K(c) = κc for some κ > 0. For given market conditions (μ C), the trigger level N in nonpublic precision of an optimal policy C N is decreasing in the precision M of the public signals. (That is, if N is an optimal trigger level of precision given public-signal precision M, then for any higher public precision M ≥ M, there exists a lower optimal trigger N ≤ N.) Coupled with Proposition 4.3, Lemma 7.3 implies that adding public information leads to a reduction (in the sense of first-order dominance) in information sharing. A direct consequence of Lemma 7.3 is the following result. PROPOSITION 7.4: Suppose that K(c) = κc for some κ > 0. If C N is an equilibrium with M public signals, then for any M ≤ M, there exists some N ≥ N such that C N is an equilibrium with M public signals. In particular, by removing all public signals, as in the following example, we can get strictly superior information sharing and, in some cases, a strict welfare improvement. EXAMPLE: As in the previous example, suppose, for some integer N > 1, that π0 = 1/2, πN = 1/2, and cL = 0. This setting is equivalent to that of Proposition 6.7, after noting that every information transfer is in blocks of N signals each, resulting in a model isomorphic to one in which each agent is endowed with one private signal of a particular higher correlation. Analogously with the previous example, we can exploit continuity in the model parameters of the left-hand side of inequality (9), determining whether zero search is optimal, to choose the parameters so that, given market conditions (μN C N ), agents have a strict but arbitrarily small preference of policy C N over C 0 . We now consider a model that is identical with the exception that each agent is given M = 1 public signal at entry. With this public signal, again using continuity, we can choose parameters so that all agents strictly prefer C 0 to C N (the nonzero search condition (9) fails) and C 0 is the only equilibrium. For sufficiently large N or, equivalently, for any N ≥ 2 and sufficiently small signal correlation ρ, all agents have strictly lower indirect utility in the equilibrium with the public signal at entry than they do in the equilibrium with the same private-signal endowments and no public signal. 8. COMPARATIVE STATICS We conclude with a brief selection of comparative statics. We say that the set of equilibrium trigger levels is increasing in a parameter α if for any α1 ≥ α
1532
D. DUFFIE, S. MALAMUD, AND G. MANSO
and any equilibrium trigger level N for α, there exists an equilibrium trigger level N1 ≥ N corresponding to α1 . For simplicity, we take the specific exit disutility given by conditional variance, rather than allowing an arbitrary bounded concave increasing exit utility. PROPOSITION 8.1: Suppose that the exit disutility is conditional variance; that is, −un = vn . Then the set of equilibrium trigger levels • Is increasing in the exit intensity η . • Is decreasing in the discount rate r. √ • Is decreasing in the signal “quality” ρ2 provided that ρ2 ≥ 2 − 1 ≈ 0 414. A proof is given in the final Appendix F. The first two results, regarding η and r, would apply for an arbitrary bounded concave increasing exit utility. Roughly speaking, scaling up the number of primitively endowed signals by a given integer multiple has the same effect as increasing the signal quality ρ2 by a particular amount, so one can provide a corresponding comparative static concerning the distribution π of the number of endowed signals. For suffi√ ciently small ρ2 < 2 − 1, we suspect that nothing general can be said about the monotonicity of equilibrium search efforts with respect to signal quality. APPENDIX A: PROOFS FOR SECTION 3: INFORMATION SHARING MODEL LEMMA A.1: Suppose that Y X1 Xn Z1 Zm are joint Gaussian, and that X1 Xn and Z1 Zm all have correlation ρ with Y and are Y conditionally independent and identically distributed (i.i.d.). Then E(Y |X1 Xn Z1 Zm ) γn γm = E(Y |X1 Xn ) + E(Y |Z1 Zm ) γn+m γm+n where γk = 1 + ρ2 (k − 1). PROOF: The proof is by calculation. If (Y W ) are joint mean-zero Gaussian and W has an invertible covariance matrix, then by a well known result, E(Y |W ) = W cov(W )−1 cov(Y W ) It follows by calculation that (A.1)
E(Y |X1 Xn ) = βn (X1 + · · · + Xn )
where βn =
ρ
1 + ρ (n − 1) 2
INFORMATION PERCOLATION
1533
Likewise, E(Y |X1 Xn Z1 Zm ) = βn+m (X1 + · · · + Xn + Z1 + · · · + Zm ) E(Y |X1 Xn ) E(Y |Z1 Zm ) = βn+m
+ βn βm The result follows from the fact that βn+m /βn = γn /γn+m .
Q.E.D.
COROLLARY A.2: The conditional probability density pn (·|Y ) of E(Y |X1 Xn ) given Y is almost surely Gaussian with conditional mean nρ2 Y 1 + ρ2 (n − 1) and with conditional variance (A.2)
σn2 =
nρ2 (1 − ρ2 )
(1 + ρ2 (n − 1))2
PROOF OF PROPOSITION 3.1: We use the conditional law of large numbers (LLN) to calculate the cross-sectional population density ft . Later, we independently calculate ft , given the appropriate boundary condition f0 , by a direct solution of the particular dynamic equation that arises from updating beliefs at matching times. Taking the first, more abstract, approach, we fix a time t and state of the world ω, and let Wn (ω) denote the set of all agents whose current precision is n. We note that Wn (ω) depends nontrivially on ω. This set Wn (ω) has an infinite number of agents whenever μt (n) is nonzero, because the space of agents is nonatomic. In particular, the restriction of the measure on agents to Wn (ω) is nonatomic. Agent i from this set Wn (ω) has a current conditional mean of Y that is denoted Ui (ω). Now consider the cross-sectional distribution qn (ω)—a measure on the real line—of {Ui (ω) : i ∈ Wn (ω)}. Note that the random variables Ui and Uj are Y -conditionally independent for almost every distinct pair (i j), by the random matching model, which implies by induction in the number of their finitely many prior meetings that they have conditioned on distinct subsets of signals, and that the only source of correlation in Ui and Uj is the fact that each of these posteriors is a linear combination of Y and of other pairwise-independent variables that are also jointly independent of Y . Conditional on the event {Nit = n} that agent i is in the set Wn (ω) and conditional on Y , Ui has the Gaussian conditional density pn (·|Y ) recorded in Corollary A.2. This conditional density function does not depend on i. Thus, by a formal application of the law of large numbers, in almost every state of
1534
D. DUFFIE, S. MALAMUD, AND G. MANSO
the world ω, qn (ω) has the same distribution as the (Wn Y )-conditional distribution of Ui for any i. Thus, for almost every ω, the cross-sectional distribution qn (ω) of posteriors over the subset Wn (ω) of agents has the density pn (·|Y (ω)). In summary, for almost every state of the world, the fraction μt (n) of the population that has received n signals has a cross-sectional density pn (·|Y (ω)) over their posteriors for Y . We found it instructive to consider a more concrete proof based on a computation of the solution of the appropriate differential equation for ft , using the LLN to set the initial condition f0 . Lemma A.1 implies that when an agent with joint type (n x) exchanges all information with an agent whose type is (m y), both agents achieve posterior type γn γm m + n x+ y γm+n γm+n We therefore have the dynamic equation (A.3)
d ft (n x) = η(Π(n x) − ft (n x)) + (ft ◦ ft )(n x) dt ∞ − Cn ft (n x) Cm ft (m x) dx m=1
R
where Π(n x) = π(n)pn (x|Y (ω)) and n−1 γn Cn−m Cm (ft ◦ ft )(n x) = γn−m m=1 +∞ γ n x − γm y ft (m y) dy × ft n − m γn−m −∞
It remains to solve this ordinary differential equation (ODE) for ft . We will use the following calculation. LEMMA A.3: Let q1 (x) and q2 (x) be the Gaussian densities with respective means M1 M2 and variances σ12 σ22 . Then, +∞ γn x − γ m y γn q2 (y) dy = q(x) q1 γn−m −∞ γn−m where q(x) is the density of a Gaussian with mean M=
γm γn−m μ1 + μ2 γn γn
INFORMATION PERCOLATION
1535
and variance σ2 =
2 γn−m γm2 2 2 σ + σ 1 γn2 γn2 2
PROOF: Let X be a random variable with density q1 (x) and let Y be an independent variable with density q2 (x). Then Z = γn−1 (γn−m X + γm Y ) is also normal with mean M and variance σ 2 . On the other hand, γn−1 γn−m X and γn−1 γm Y are independent with densities γn γn q1 x γn−m γn−m and
γn γn q2 x γm γm
respectively. Consequently, the density of Z is the convolution γn γn γn2 (A.4) q1 (x − y) q2 y dy γn−m γm R γn−m γm +∞ γ n x − γm y γn q1 σn−m q2 (z) dz = γn−m −∞ γn−m where we have made the transformation z = γn γm−1 y.
Q.E.D.
LEMMA A.4: The density ft (n x ω) = μt (n)pn (x|Y (ω)) solves the evolution equation (A.3) if and only if the distribution μt of precisions solves the evolution equation (1). PROOF: By Lemma A.3 and Corollary A.2, +∞ γn γn x − γm y pn−m Y (ω) pm (y|Y (ω)) dy γn−m −∞ γn−m is conditionally Gaussian with mean γn−m γm nρ2 Y (n − m)ρ2 Y mρ2 Y + = γn 1 + ρ2 (n − m − 1) γn 1 + ρ2 (m − 1) 1 + ρ2 (n − 1)
1536
D. DUFFIE, S. MALAMUD, AND G. MANSO
and conditional variance σ2 = =
2 γn−m (n − m)ρ2 (1 − ρ2 ) γm2 2 mρ2 (1 − ρ2 ) + σ γn2 (1 + ρ2 (n − m − 1))2 γn2 2 (1 + ρ2 (m − 1))2
nρ2 (1 − ρ2 )
(1 + ρ2 (n − 1))2
Therefore, (ft ◦ ft )(n x) +∞ n−1 γ n x − γm y γn = ft (m y) dy Cn−m Cm ft n − m γn−m γn−m −∞ m=1 =
=
n−1
γn γ n−m m=1 +∞ γn x − γm y × pn−m Y (ω) pm (y|Y (ω)) dy γn−m −∞
n−1
Cn−m Cm μt (n − m)μt (m)
Cn−m Cm μt (n − m)μt (m)pn (x|Y (ω))
m=1
Substituting the last identity into (A.3), we get the required result.
Q.E.D.
APPENDIX B: PROOFS FOR SECTION 4: STATIONARY DISTRIBUTIONS This appendix provides proofs of the results on the existence, stability, and monotonicity properties of the stationary cross-sectional precision measure μ. B.1. Existence of the Stationary Measure PROOF OF LEMMA 4.1: If a positive, summable sequence {μn } indeed solves (4), then, adding up the equations over n, we get that μ(N) = 1, that is, μ is indeed a probability measure. Thus, it remains to show that the equation C¯ =
∞ n=1
¯ n μ¯ n (C)C
INFORMATION PERCOLATION
1537
¯ is monotone dehas a unique solution. By construction, the function μ¯ k (C) ¯ creasing in C and ¯ = ηπk + ημ¯ k (C)
k−1
¯ μ¯ k−l (C) ¯ − Ck μ¯ k (C) ¯ C ¯ Cl Ck−l μ¯ l (C)
l=1
¯ < π1 ≤ 1. Suppose that C¯ ≥ cH . Then, adding up the above Clearly, μ¯ 1 (C) identities, we get η
n
¯ ≤ η + cH μk (C)
k−1
k=1
Cl μ¯ l − C¯
l=1
k−1
Cl μ¯ l ≤ η
l=1
Hence, for C¯ ≥ cH we have that ∞
¯ ≤ 1 μk (C)
k=1
Consequently, the function ¯ = f (C)
∞
¯ Ck μ¯ k (C)
k=1
is strictly monotone decreasing in C¯ and satisfies ¯ ≤ C ¯ f (C)
C¯ ≥ cH
¯ otherwise, we set It may happen that f (x) = +∞ for some Cmin ∈ (0 C); Cmin = 0. The function ¯ = C¯ − f (C) ¯ g(C) is continuous (by the monotone convergence theorem for infinite series (see, e.g., Yeh (2006, p. 168)) and strictly monotone increasing, and satisfies Q.E.D. g(Cmin ) ≤ 0 and g(cH ) ≥ 0. Hence, it has a unique zero. B.2. Stability of the Stationary Measure PROOF OF PROPOSITION 4.2: The ordinary differential equation for μk (t) can be written as (B.1)
μk = ηπk − ημk − Ck μk
∞ i=1
Ci μi +
k−1 l=1
Cl μl Ck−l μk−l
1538
D. DUFFIE, S. MALAMUD, AND G. MANSO
We will need to establish the right to interchange infinite summation and differentiation. We will use the following known lemma: LEMMA B.1: Let gk (t) be C 1 functions such that gk (t) and gk (0) k
k
converge for all t and |gk (t)| k
is locally bounded (in t). Then
k
gk (t) is differentiable and
gk (t) = gk (t)
k
k
We will also need the next lemma: LEMMA B.2: Suppose that f solves f = −a(t)f + b(t) where a(t) ≥ ε > 0 and b(t) = c t→∞ a(t) lim
Then f (t) = e−
t
t
0 a(s) ds
s
e 0 a(u) du b(s) ds + f0 e−
t
0 a(s) ds
0
and limt→∞ f (t) = c PROOF: The formula for the solution is well known. By l’Hôpital’s rule,
t
s
e 0 a(u) du b(s) ds lim
t→∞
0
t
e 0 a(s) ds
t
= lim
t→∞
e 0 a(s) ds b(t) t
a(t)e 0 a(s) ds
= c
Q.E.D.
The following proposition shows existence and uniqueness of the solution.
1539
INFORMATION PERCOLATION
PROPOSITION B.3: There exists a unique solution {μk (t)} to (B.1) and this solution satisfies (B.2)
∞
μk (t) = 1
k=1
for all t ≥ 0. PROOF: Let l1 (N) be the space of absolutely summable sequences {μk } with {μk }l1 (N) =
∞
|μk |
k=1
Consider the mapping F : l1 (N) → l1 (N) defined by
F({μi })
= ηπk − ημk − Ck μk k
∞ i=1
Ci μi +
k−1
Cl μl Ck−l μk−l
l=1
Then (B.1) takes the form ({μk }) = F({μk }). A direct calculation shows that ∞ k−1 (B.3) (Cl al Ck−l ak−l − Cl bl Ck−l bk−l ) k=1
l=1
≤ cH2
k−1 ∞ (|al − bl ||ak−l | + |ak−l − bk−l ||bl |) k=1 l=1
=c
2 H
{ak }l1 (N) + {bk }l1 (N) {ak − bk }l1 (N)
Thus, F({ak }) − F({bk }) l1 (N) 2 ≤ η + 2cH {ak }l1 (N) + {bk }l1 (N) {ak − bk }l1 (N) so F is locally Lipschitz continuous. By a standard existence result (Dieudonné (1960, Theorem 10.4.5)), there exists a unique solution to (B.1) for t ∈ [0 T0 ) for some T0 > 0 and this solution is locally bounded. Furthermore, [0 T0 ) can be chosen to be the maximal existence interval, such that the solution {μk } cannot be continued further. It remains to show that T0 = +∞. Because, for any t ∈ [0 T0 ), ({μk }) = F({μk })l (N) ≤ η + η{μk }l1 (N) + 2cH2 {μk }2l1 (N) l (N) 1
1
1540
D. DUFFIE, S. MALAMUD, AND G. MANSO
is locally bounded, Lemma B.1 implies that ∞ ∞ k−1 ∞ ηπk − ημk − Ck μk μk = Ci μi + Cl μl Ck−l μk−l k=1
i=1
k=1
l=1
= 0 and hence (B.2) holds. We will now show that μk (t) ≥ 0 for all t ∈ [0 T0 ]. For k = 1, we have 1
μ = ηπ1 − ημ1 − C1 μ1
∞
Ci μi
i=1
Denote a1 (t) = −η − C1
∞ i=1
Ci μi . Then we have
μ1 = ηπ1 + a1 (t)μ1 Lemma B.2 implies that μ1 ≥ 0 for all t ∈ [0 T0 ). Suppose we know that μl ≥ 0 for l ≤ k − 1. Then μk = zk (t) + ak (t)μk (t) where ak (t) = −η − Ck
∞
Ci μi
i=1
zk (t) = ηπk +
k−1
Cl μl Ck−l μk−l
l=1
By the induction hypothesis, zk (t) ≥ 0 and Lemma B.2 implies that μk ≥ 0. Thus, {μk }l1 (N) =
∞
μk = 1
k=1
so the solution to (B.1) is uniformly bounded on [0 T0 ), and can, therefore, be continued beyond T0 (Dieudonné (1960, Theorem 10.5.6)). Since [0 T0 ) is, by Q.E.D. assumption, the maximal existence interval, we have T0 = +∞. We now expand the solution μ in a special manner. Namely, denote cH − Ci = fi ≥ 0
1541
INFORMATION PERCOLATION
We can then rewrite the equation as (B.4)
μk = ηπk − (η + cH2 )μk + cH fk μk + Ck μk
∞
fi μi +
i=1
k−1
Cl μl Ck−l μk−l
l=1
Now, we will (formally) expand μk =
∞
μkj (t)
j=0
where μk0 does not depend on (the whole sequence) (fi ), μk1 is linear in (the whole sequence) (fi ), μk2 is quadratic in (fi ), and so on. The main idea is to expand so that all terms of the expansion are nonnegative. Later, we will prove that the expansion indeed converges and coincides with the unique solution to the evolution equation. Substituting the expansion into the equation, we get (B.5)
μk0 = ηπk − (η + cH2 )μk0 +
k−1
Cl μl0 Ck−l μk−l0
l=1
with the given initial conditions μk0 (0) = μk (0) for all k. Furthermore, (B.6)
k1
μ = −(η + c )μk1 + cH fk μk0 + Ck μk0 2 H
∞
fi μi0
i=1
+2
k−1
Cl μl0 Ck−l μk−l1
l=1
and then (B.7)
μkj = −(η + cH2 )μkj + cH fk μkj−1 + 2Ck
j−1
μkm
m=0
∞ i=1
fi μij−1−m + 2
k−1 j−1
Cl μlm Ck−l μk−lj−m
l=1 m=0
with initial conditions μkj (0) = 0. Equations (B.6) and (B.7) are only well defined if μi0 exists for all t and the infinite series ∞
μij (t)
i=1
converges for all t and all j
1542
D. DUFFIE, S. MALAMUD, AND G. MANSO
Thus, we can solve these linear ODEs with the help of Lemma B.2. This is done through a recursive procedure. Namely, the equation for μ10 is linear and we have μ10 = ηπ1 − (η + cH2 )μ10 ≤ ηπk − (η + cH2 )μ10 + cH f1 μ1 + C1 μ1
∞
fi μi
i=1
A comparison theorem for ODEs (Hartman (1982, Theorem 4.1, p. 26)) immediately implies that μ10 ≤ μ1 for all t. By definition, μk0 solves μk0 = ηπk − (η + cH2 )μk0 + zk0 with zk0 =
k−1
Cl μl0 Ck−l μk−l0
l=1
depending on only those μl0 with l < k. Since μ10 is nonnegative, it follows by induction that all equations for μk0 have nonnegative inhomogeneities and hence μk0 is nonnegative for each k. Suppose now that μl0 ≤ μl for all l ≤ k−1. Then zk0 =
k−1
Cl μl0 Ck−l μk−l0 ≤
l=1
k−1
Cl μl Ck−l μk−l
l=1
and (B.4) and the same comparison theorem imply μk0 ≤ μk . Thus, μk0 ≤ μk for all k. It follows that the series μk0 ≤ 1 k
converges and, therefore, equations (B.6) are well defined. Let now μ(N) = k
N
μkj
j=0
Suppose that we have shown that μkj ≥ 0 for all k, and all j ≤ N − 1, and that (B.8)
μk(N−1) ≤ μk
for all k. Equations (B.6) and (B.7) are again linear inhomogeneous and can be solved using Lemma B.2 and the nonnegativity of μkN follows. By adding
INFORMATION PERCOLATION
1543
(B.5), (B.6), and (B.7), and using the induction hypothesis (B.8), we get (N) (N) (B.9) ≤ ηπk − (η + cH2 )μ(N) μk k + cH fk μk + Ck μ(N) k
∞
fi μi
i=1
k−1
Cl μl Ck−l μk−l
l=1
≤ μk . The comparison theorem applied to (B.9) and (B.4) implies that μ(N) k Thus, we have shown by induction that μkj ≥ 0 and N
μkj ≤ μk
j=0
∞ for any N ≥ 0. The infinite series j=0 μkj (t) consists of nonnegative terms and is uniformly bounded from above. Therefore, it converges to a function μ˜ k (t). Using Lemma B.1 and adding up (B.5), (B.6), and (B.7), we get that the sequence {μ˜ k (t)} is continuously differentiable and satisfies (B.4) and μ˜ k (0) = μk (0). Since, by Proposition B.3, the solution to (B.4) is unique, we get μ˜ k = μk for all k. Thus, we have proved the following statement: THEOREM B.4: We have μk =
∞
μkj
j=0
It remains to prove that limt→∞ μk (t) exists. The strategy for this consists of two steps: Step 1. Prove that limt→∞ μkj (t) = μkj (∞) exists. Step 2. Prove that lim
t→∞
∞
μkj =
j=0
∞ j=0
lim μkj
t→∞
Equation (D.1) and Lemma B.2 directly imply the convergence of μk0 . But, the next step is tricky because of the appearance of the infinite sums of the form ∞
fi μij (t)
i=0
in equations (B.6) and (B.7). If we prove convergence of these infinite sums a subsequent application of Lemma B.2 to (B.6) and (B.7) will imply convergence of μij+1 Unfortunately, convergence of μij (t) for each i j is not enough for convergence of these infinite sums.
1544
D. DUFFIE, S. MALAMUD, AND G. MANSO
Recall that, by assumption, there exists an N such that Ci = CN for all i ≥ N. Thus, we need only show that Mj (t) =
∞
μij (t)
i=N
converges for each j. We will start with the case j = 0. Then, adding up (B.5), (B.6), and (B.7) and using Lemma B.1, we get N−1 N−1 2 μi0 = η − (η + cH ) M0 (t) + μi0 M0 (t) + i=1
+
N−1
i=1
2
Ci μi0 + CN M0 (t)
i=1
Opening the brackets, we can rewrite this equation as a Riccati equation for M0 : M0 (t) = a0 (t) + b0 (t)M0 (t) + CN2 M0 (t)2 A priori, we know that M0 stays bounded and, by (B.5) and Lemma B.2, the coefficients a0 and b0 converge to finite limits. LEMMA B.5: M0 (t) converges to a finite limit at t → ∞. To prove Lemma B.5, we will need an auxiliary lemma: LEMMA B.6: Let N(t) be the solution, for c > 0, to N (t) = a + bN(t) + cN 2 (t)
N(0) = N0
If the characteristic polynomial q(λ) = a + bλ + cλ2 has real zeros λ1 ≥ λ2 , then two situations can occur: (i) If N0 < λ1 , then limt→∞ N(t) = λ2 . (ii) If N0 > λ1 , then limt→∞ N(t) = +∞. If q(λ) does not have real zeros, then limt→∞ N(t) = +∞ for any N0 . PROOF: The stationary solutions are N = λ12 . If N0 < λ2 , then N(t) < λ2 for all t by uniqueness. Hence, N (t) = a + bN(t) + cN 2 (t) > 0, and N(t) increases and converges to a limit N(∞) ≤ λ2 . This limit should be a stationary solution, that is, N(∞) = λ2 . If N0 ∈ (λ2 λ1 ), then N(t) ∈ (λ2 λ1 ) for all t by
INFORMATION PERCOLATION
1545
uniqueness, and, therefore, N (t) < 0 and N(t) decreases to N(∞) ≥ λ2 , and we again should have N(∞) = λ2 . If N0 > λ1 , N > 0 and hence N (t) = a + bN(t) + cN 2 (t) > a + bN0 + cN02 > 0 for all t and the claim follows. If q(λ) has no real zeros, its minimum minλ∈R q(λ) = δ is strictly positive. Hence, N (t) > δ > 0 and the claim follows. Q.E.D. PROOF OF LEMMA B.5: Consider the quadratic polynomial q∞ (λ) = a0 (∞) + b0 (∞)λ + CN2 λ2 = 0 We will consider three cases: CASE 1: q∞ (λ) does not have real zeros, that is, minλ∈R q∞ (λ) = δ > 0. Then, for all sufficiently large t, (B.10)
M0 (t) = a0 (t) + b0 (t)M0 (t) + CN2 M0 (t)2 ≥ δ/2 > 0
so M0 (t) will converge to +∞, which is impossible. CASE 2: q∞ (λ) has a double zero λ∞ . Then we claim that limt→∞ M(t) = λ∞ . Indeed, suppose it is not true. Then there exists an ε > 0 such that either supt>T (M(t) − λ∞ ) > ε for any T > 0 or supt>T (λ∞ − M(t)) > ε for any T > 0. Suppose that the first case takes place. Pick a δ > 0 and choose T > 0 so large that a0 (t) ≥ a0 (∞) − δ
b0 (t) ≥ b0 (∞) − δ
for all t ≥ T . The quadratic polynomial a0 (∞) − δ + (b0 (∞) − δ)λ + CN2 λ2 has two real zeros λ1 (δ) > λ2 (δ) and, for sufficiently small δ, |λ12 (δ) − λ∞ | < ε/2. Let T0 > T be such that M0 (T0 ) > λ∞ + ε. Consider the solution N(t) to N (t) = a0 (∞) − δ + (b0 (∞) − δ)N(t) + CN2 N(t)2 N(T0 ) = M(T0 ) > λ1 (δ) By the comparison theorem for ODEs, M0 (t) ≥ N(t) for all t ≥ T0 and Lemma B.6 implies that N(t) → ∞, which is impossible. Suppose now supt>T (λ∞ − M(t)) > ε for any T > 0. Consider the same N(t) as above and choose δ so small that M(T0 ) < λ2 (δ). Then M(t) ≥ N(t) and, by Lemma B.6,
1546
D. DUFFIE, S. MALAMUD, AND G. MANSO
N(t) → λ2 (δ). For sufficiently small δ, λ2 (δ) can be made arbitrarily close to δ∞ and hence supt>T (λ∞ − M(t)) > ε cannot hold for sufficiently large T , which is a contradiction. CASE 3: q(λ) has two distinct real zeros λ1 (∞) > λ2 (∞). Then we claim that either limt→∞ M(t) = λ1 (∞) or limt→∞ M(t) = λ2 (∞). Suppose the contrary. Then there exists an ε > 0 such that supt>T |M(t) − λ1 (∞)| > ε and supt>T |M(t) − λ2 (∞)| > ε. Now, an argument completely analogous to that of Case 2 applies. Q.E.D. With this result, we go directly to (B.6) and get the convergence of μj1 from Lemma B.2. Then adding equations (B.6), we get a linear equation for M1 (t) and again get convergence from Lemma B.2. Note that we are in an even simpler situation of linear equations. Proceeding inductively, we arrive at the following statement: PROPOSITION B.7: The limit lim μkj (t) ≤ 1
t→∞
exists for any k j. The next important observation is that if we subtract an ε from c in the first quadratic term, then the measure remains bounded. This is a nontrivial issue, since the derivative could become large. LEMMA B.8: If ε is sufficiently small, then there exists a constant K > 0 such that the system k−1 ∞ μ = ηπk − ημk − (Ck − ε)μk (Ci − ε)μi + Cl μl Ck−l μk−l k
i=1
l=1
has a unique solution {μk (t)} ∈ l1 (N) and this solution satisfies ∞
μk (t) ≤ K
k=1
for all t > 0. PROOF: An argument completely analogous to that in the proof of Proposition B.3 implies that the solution exists on an interval [0 T0 ), and is unique and nonnegative. Letting M=
∞ k=1
μk (t)
INFORMATION PERCOLATION
1547
and adding up the equations, we get M ≤ η − ηM + 2εcH M 2 Consider now the solution N to the equation (B.11)
N = η − ηN + 2εcH N 2
By a standard comparison theorem for ODEs (Hartman (1982, Theorem 4.1, p. 26)), if N(0) = M(0), then M(t) ≤ N(t) for all t. Thus, we only need to show that N(t) stays bounded. The stationary points of (B.11) are η ± η2 − 8εcH η d12 = 4εcH so, for sufficiently small ε, the larger stationary point d1 is arbitrarily large. Therefore, by uniqueness, if N(0) < d1 , then N(t) < d1 for all t and we are done. Now, the same argument as in the proof of Proposition B.3 implies that T0 = +∞ and the proof is complete. Q.E.D. Since (fi ) is bounded, for any ε > 0, there exists a δ > 0 such that fi + ε ≥ (1 + δ)fi for all i. Consider now the solution (μ(ε) k ) to the equation of Lemma B.8. Then, using the same expansion μ(ε) k =
∞
μ(ε) kj
j=0
we immediately get (by direct calculation) that (1 + δ)j μkj ≤ μ(ε) kj By Lemma B.8, μ(ε) kj ≤
kj
μ(ε) kj =
μ(ε) k < K
k
and, therefore, μkj (t) ≤
K (1 + δ)j
for some (possibly very large) constant K, independent of t. Now we are ready to prove the next theorem:
1548
D. DUFFIE, S. MALAMUD, AND G. MANSO
THEOREM B.9: We have lim μk (t) =
t→∞
∞
μkj (∞)
j=0
PROOF: Take N so large that ∞
μkj (t) ≤
j=N
∞ j=N
K ε < j (1 + δ) 2
for all t. Then choose T so large that N−1
|μkj (t) − μkj (∞)| < ε/2
j=0
for all t > T . Then ∞ (μkj (t) − μkj (∞)) < ε j=0
Since ε is arbitrary, we are done. LEMMA B.10: Consider the limit Cn μn (t) C¯ = lim t→∞
n
In general, C¯ ≥
n
Cn lim μn (t) = C˜ t→∞
with equality if and only if lim μn (t) = 1 n
t→∞
Furthermore, ¯ ˜ C − C = CN 1 − lim μn (t) n
t→∞
Based on Lemma B.10, we have another lemma:
Q.E.D.
INFORMATION PERCOLATION
1549
LEMMA B.11: The limit μ∞ (n) = limt→∞ μt (n) satisfies the equation ¯ 0 = η(π − μ∞ ) + μC∞ ∗ μC∞ − μC∞ C where, in general, C¯ ≥ μC∞ (N) and μ∞ (N) = 1 − η−1 μC∞ (C¯ − μC∞ ) ≤ 1 An immediate consequence is the next lemma. LEMMA B.12: The limit distribution μ∞ is a probability measure and coincides ˜ with the unique solution to (4) if and only if C¯ = C. PROPOSITION B.13: Under the same tail condition Cn = CN for n ≥ N and the condition that (B.12)
η ≥ CN cH
we have C¯ = C˜ and, therefore, μ∞ is a probability measure that coincides with the unique solution to (4). PROOF: Recall that the equation for the limit measure is ¯ k μ∞ (k) + (B.13) −ημ∞ (k) + ηπk − CC Cl μ∞ (l)Ck−l μ∞ (k − l) = 0 l
where C¯ = lim
t→∞
Cn μn (t) ≥
n
The difference C¯ − C˜ = (1 − M)CN with M=
k
μ∞ (k)
n
˜ Cn μ∞ (n) = C
1550
D. DUFFIE, S. MALAMUD, AND G. MANSO
is nonzero if and only if there is a loss of mass, that is, M < 1. Adding (B.13) up over k, we get (B.14)
˜ 2 −ηM + η − C¯ C˜ + (C) =0 ¯ C¯ − (1 − M)CN ) + (C¯ − (1 − M)CN )2 = −ηM + η − C(
If M = 1, we get, dividing this equation by (1 − M), that M =1+
η − CN C¯
CN2
Since M ≤ 1, we immediately get that if η ≥ CN cH then there is no loss of mass, proving the result.
Q.E.D.
B.3. Monotonicity Properties of the Stationary Measure We recall that the equation for the stationary measure μ = (μk k ≥ 1) is −ημk + ηπk − Ck μk
∞
Ci μi +
i=1
k−1
Cl Ck−l μl μk−l = 0
l=1
¯ C1 Ck ) : k ≥ 1} for the measure constructed recursively We write {μ¯ k (C in the statement of Lemma 4.1 from a given C¯ and a given policy C. LEMMA B.14: For each k, the function that maps C to ¯ C1 Ck ) Ck μ¯ k = Ck μ¯ k (C ¯ is monotone increasing in Ci i = 1 k, and monotone decreasing in C. PROOF: The proof is by induction. For k = 1, there is nothing to prove. For k > 1, k−1 Ck ηπk + Ck μ¯ k = (Cl μ¯ l )(Ck−l μ¯ k−l ) ¯ k η + CC l=1
is monotone by the induction hypothesis.
Q.E.D.
INFORMATION PERCOLATION
1551
Steps Toward a Proof of Proposition 4.3 Our proof of Proposition 4.3 is based on the next series of results, Lemmas B.15–B.26, for which we use the notation ¯ = Ck μ¯ k (C ¯ C1 Ck ) νk = νk (C1 Ck C) and Zk =
Ck
η + Ck C¯ √ Note that multiplying η and Ck by the same number λ does not change the equation and, hence, does not change the stationary measure. Thus, without loss of generality, we can normalize for simplicity to the case η = 1. LEMMA B.15: The sequence νk satisfies ν1 = Z1 π1 νk = Zk πk +
k−1
νl νk−l
l=1
and (B.15)
∞
¯ = C ¯ νk (C1 Ck C)
k=1
LEMMA B.16: Differentiating (B.15) with respect to Ck , we get ∂C¯ = ∂Ck
∞ ∂νi ∂Ck i=k
1−
∞ ∂νi i=1
∂C¯
¯ We now let C(C) denote the unique solution of the equation (B.15), and for any policy C we define ¯ ξk (C) = νk (C1 Ck C(C)) LEMMA B.17: Let N − 1 < k. We have ∞ ∂ ξi ≥ 0 ∂CN−1 i=k
1552
D. DUFFIE, S. MALAMUD, AND G. MANSO
if and only if (B.16)
1−
k−1 ∂νi i=1
∂C¯
∞
2 N−1
C
i=k
∞ k−1 ∂νi ∂νi ∂νi 2
≥ CN−1 − ∂CN−1 ∂CN−1 i=k ∂C¯ i=1
PROOF: We have ∞ ∂ ξi ∂CN−1 i=k
∞ ∂νi ∂νi ∂C¯ = + ∂CN−1 ∂C¯ ∂CN−1 i=k ∞ k−1 ∞ ∞ ∞ ∂νi ∂νi ∂νi ∂νi + + 1− ∂C¯ i=k ∂CN−1 ∂C¯ i=N−1 i=k ∂CN−1 i=1 i=k = ∞ ∂νi 1− ∂C¯ i=1 ∞ k−1 ∞ k−1 ∂νi ∂νi ∂νi ∂νi + 1− ∂C¯ i=k ∂CN−1 ∂C¯ i=N−1 ∂CN−1 i=1 i=k = ∞ ∂νi 1− ∂C¯ i=1
Q.E.D.
and the claim follows.
Suppose now that we have the “flat-tail” condition that for some N, Cn = CN for all n ≥ N. We define the moment-generating function m(·) of ν by (B.17)
m(x) =
∞
νi xi
i=1
By definition, the first N − 1 coefficients νi of the power-series expansion of m(x) satisfy ν1 = π1 Z1 and then νi = πi +
i−1 l=1
νl νi−l Zi
1553
INFORMATION PERCOLATION
for i ≤ N − 1 and νi = ZN πi +
i−1
νl νi−l
l=1
for i ≥ N. For N ≥ 2, let mb (x N) =
∞
πi xi
i=N
Using Lemma B.15 and comparing the coefficients in the power-series expansions, we get the following statement: LEMMA B.18: If Cn = CN for all n ≥ N, then m(x) −
N−1
νi x = ZN m (x N) + m (x) − i
b
2
N−1
i=1
i
x
i=2
i−1
νl νi−l
l=1
Thus, 0 = ZN mb (x N) + m2 (x) −
N−1
xi
i=2
l=1
N−1
= ZN mb (x N) + m2 (x) − − m(x) + π1 Z1 x + = ZN m (x) − m(x)
νl νi−l − m(x) +
xi
νl νi−l
l=1
Zi πi +
N−1 i=1
i−1
i=2
N−1
i−1
i−1
i=2
l=1
N−1
N−1
νl νi−l xi
2
+ ZN mb (x N) + N−1
+ π1 Z1 x +
i=2
πi xi −
i=2
Zi πi +
πi +
i=2
i−1
i−1 l=1
νl νi−l xi
l=1
= ZN m (x) − m(x) + ZN mb (x 2) + π1 Z1 x N−1 i−1 i x (Zi − ZN ) πi + νl νi−l + 2
i=2
l=1
νl νi−l xi
νi xi
1554
D. DUFFIE, S. MALAMUD, AND G. MANSO
Solving this quadratic equation for m(x) and picking the branch that satisfies m(0) = 0, we arrive at the next lemma. LEMMA B.19: The moment-generating function m(·) of ν is given by √ 1 − 1 − 4ZN M(x) (B.18) m(x) = 2ZN where (B.19)
M(x) = ZN m (x 2) + π1 Z1 x + b
N−1
x (Zi − ZN ) πi + i
i=2
i−1
νl νi−l
l=1
The derivatives of the functions Zk satisfy interesting algebraic identities, summarized in the following lemma. LEMMA B.20: We have Zk < 1 for all k. Moreover, Ck2
∂Zk ∂Zk =− = Zk2 ∂Ck ∂C¯
These identities allow us to calculate derivatives is an elegant form. Let γ(x) = (1 − 4ZN M(x))−1/2 Differentiating identity (B.18) with respect to C¯ and CN−1 , and using Lemma B.20, we arrive at the following statements: LEMMA B.21: We have (B.20)
N−1 i−1 1 i ∂m(x) 1 + − x Zi (Zi − ZN ) πi + νl νi−l = − + γ(x) 2 2 i=1 ∂C¯ l=1 N−1 i−1 −∂ + xi (Zi − ZN ) νl νi−l ∂C¯ i=2
and (B.21)
N−2 ∂m(x) 2 2 πN−1 + = γ(x)xN−1 ZN−1 νl νN−1−l CN−1 ∂CN−1 l=1
Let now (B.22)
l=1
γ(x) =
∞ j=0
γj xj
INFORMATION PERCOLATION
1555
and let γj = 0 for j < 0. LEMMA B.22: We have γj ≥ 0 for all j. PROOF: By (B.19), the function M(·) has nonnegative Taylor coefficients. Thus, it suffices to show that the function that maps x to (1 − x)−1/2 has nonnegative Taylor coefficients. We have (1 − x)−1/2 = 1 +
∞
xk βk
k=1
with (−0 5)(−0 5 − 1)(−0 5 − 2) · · · (−0 5 − k + 1) k! (0 5)(0 5 + 1)(0 5 + 2) · · · (0 5 + k − 1) > 0 = k!
βk = (−1)k
Therefore, γ(x) = 1 +
∞
βk (4ZN M(x))k
k=1
also has nonnegative Taylor coefficients.
Q.E.D.
Let QN−1 =
N−2
νl νN−1−l + πN−1
l=1
Define also 1 R0 = 2
R1 = Z1 (Z1 − ZN )π1
and i−1 −∂ νl νi−l Ri = (Zi − ZN ) νi + ∂C¯
l=1
Recall that Zi > ZN if and only if Ci > CN and, therefore, Ri ≥ 0 for all i as soon as Ci ≥ CN for all i ≤ N − 1.
1556
D. DUFFIE, S. MALAMUD, AND G. MANSO
LEMMA B.23: We have 2 CN−1
∂νj 2 = ZN−1 QN−1 γj−N+1 ∂CN−1
and ∂νj−i+N−1 ∂νj −2 −1 − Ri γj−i = ZN−1 QN−1 Ri
= ∂CN−1 ∂C¯ N−1
N−1
i=0
i=0
PROOF: Identity (B.20) implies that −∂m(x) j x Ri γj−i = ∂C¯ ∞
N−1
j=1
i=0
On the other hand, by (B.17), −∂m(x) j −∂νj x =
∂C¯ ∂C¯ ∞
j=1
Comparing Taylor coefficients in the above identities, we get the required reQ.E.D. sult. The case of the derivative ∂/∂CN−1 is analogous. LEMMA B.24: We have (R0 + · · · + RN−1 )Z
−2 N−1
Q
−1 N−1
∞ ∞ ∂νj −∂νj ≥ ∂C ∂C¯ N−1 j=k j=k
and R0 γ0 +
k−1 −∂νj j=1
∂C¯
−2 −1 QN−1 ≥ (R0 + · · · + RN−1 )ZN−1
k−1
∂νj
∂CN−1 j=N−1
PROOF: By Lemma B.23, ∞ −∂νj j=k
∂C¯
=
∞ N−1 j=k i=0
Ri γj−i =
N−1 i=0
= (R0 + · · · + RN−1 )Z
∞
Ri
γj ≤
j=k−i −2 N−1
Q
−1 N−1
N−1
Ri
i=0 ∞ ∂νj ∂CN−1 j=k
∞ j=k−N+1
γj
1557
INFORMATION PERCOLATION
and R0 γ0 +
k−1 −∂νj j=1
∂C¯
= R0 γ 0 +
k−1 N−1
Ri γj−i
j=1 i=0
= R0
k−1 i=0
γj +
N−1
k−1−i
Ri
i=1
γi
j=0
k−N+1
≥ (R0 + · · · + RN−1 )
γj
j=0
= (R0 + · · · + RN−1 )Z
−2 N−1
Q
−1 N−1
k−1
∂νj
∂CN−1 j=N−1 Q.E.D.
By definition, R0 = 1/2 and γ0 = 1. Hence, we get the following lemma: LEMMA B.25: Suppose that Ci ≥ CN for all i ≤ N and Ci = CN for all i ≥ N. Then, for all k ≥ N, ∞ k−1 k−1 ∞ ∂νj ∂νj ∂νj −∂νj 1− ≥
∂CN−1 ∂CN−1 ∂C¯ ∂C¯ j=1
j=k
j=N−1
j=k
LEMMA B.26: The function λk defined by λk ((Ci )) =
∞
ξi ((Ci ))
i=k
is monotone increasing in CN−1 for all k ≤ N − 1. PROOF: By Proposition 4.4, λ1 = C¯ =
∞
ξj
j=1
is monotone increasing in CN−1 for each N − 1. By Lemma 4.1 and Propo¯ is monotone decreasing in CN−1 for all sition 4.4, ξj = μ¯ j (C1 Cj−1 C) N − 1 ≥ j. Thus, λk = C¯ −
k−1 j=1
ξj
1558
D. DUFFIE, S. MALAMUD, AND G. MANSO
is monotone increasing in CN−1 .
Q.E.D.
We are now ready to prove Proposition 4.3. PROOF OF PROPOSITION 4.3: It suffices to prove the claim for the case N = M + 1. It is known that μCN dominates μCM is the sense of first-order stochastic dominance if and only if (B.23)
∞
μCN ≥ j
j=k
∞
μCM j
j=k
for any k ≥ 1. The only difference between policies C M and C M+1 is that M+1 M CM = cH > cL = CM
By Lemma B.26, (B.23) holds for any k ≤ M. By Lemmas B.17 and B.25, this is also true for k > M. The proof of Proposition 4.3 is complete. Q.E.D. PROOF ¯ C) to (C
OF
PROPOSITION 4.4: By Lemma B.14, the function f that maps
¯ (Ci i ≥ 1)) = f (C
∞
¯ C1 Ck ) Ck μ¯ k (C
k=1
¯ Therefore, given C, the is monotone increasing in Ci and decreasing in C. unique solution C¯ to the equation (B.24)
¯ (Ci i ≥ 1)) = 0 C¯ − f (C
is monotone increasing in Ci for all i.
Q.E.D.
PROOF OF EXAMPLE 4.5 OF NONMONOTONICITY: Let Cn = 0 for n ≥ 3, as stipulated. We will check the condition ∂ν2 > 0 ∂C1 By the above, we need to check that ∂ν1 −∂νj ∂ν1 ∂ν2 > 1−
¯ ∂C ∂C1 ∂C1 ∂C¯ We have ν1 = π1 Z1
INFORMATION PERCOLATION
1559
and ν2 = (π2 + (π1 Z1 )2 )Z2 Since νi = 0 for i ≥ 2, using the properties of the function Zk , the inequality takes the form (1 + π1 Z12 )C1−2 π1 π1 2Z13 Z2 > C1−2 π1 Z12 π2 Z22 + π1 π1 2Z13 Z2 + (π1 Z1 )2 Z22 Opening the brackets, π1 π1 2Z13 Z2 > π1 Z12 (π2 Z22 + (π1 Z1 )2 Z22 ) Consider the case C1 = C2 in which Z1 = Z2 . Then, the inequality takes the form 2π1 > π2 + (π1 Z1 )2 If π2 > 2π1 , this cannot hold.
Q.E.D.
APPENDIX C: PROOFS FOR SECTION 5: OPTIMALITY PROOF OF LEMMA 5.1: Let φ be any search-effort process, and let
t
θ =− φ t
e−rs K(φs ) ds + e−rt V (Ntφ )
(t < τ)
e−rs K(φs ) ds + e−rτ u(Nτφ )
(t ≥ τ)
0
=−
τ
0
By Itô’s formula and the fact that V solves the HJB equation (6), θφ is a supermartingale, so (C.1)
V (Ni0 ) = θ0φ ≥ U(φ)
For the special case of φ∗t = Γ (Nt ), where N satisfies the stochastic differential equation associated with the specified search-effort policy function Γ , the ∗ HJB equation implies that θ∗ = θφ is actually a martingale. Thus, V (Ni0 ) = θ0∗ = E(θτ∗ ) = U(φ∗ ) It follows from (C.1) that U(φ∗ ) ≥ U(φ) for any control φ.
Q.E.D.
1560
D. DUFFIE, S. MALAMUD, AND G. MANSO
LEMMA C.1: The operator M : ∞ → ∞ , defined by ∞ η u(n) − K(c) c C (C.2) (MV )n = max + Vn+m μm c c C¯ + r + η c C¯ + r + η m=1
is a contraction, satisfying, for candidate value functions V1 and V2 , (C.3)
MV1 − MV2 ∞ ≤
cH C¯ cH C¯ + r + η
V1 − V2 ∞
In addition, M is monotonicity preserving. PROOF: The fact that M preserves monotonicity follows because the pointwise maximum of two monotone increasing functions is also monotone. To prove that M is a contraction that satisfies (C.3), we verify Blackwell’s sufficient conditions (see Stockey and Lucas (1989, Theorem 3.3, p. 54)).7 Clearly, V1 ≤ V2 implies MV1 ≤ MV2 , so M is monotone. Furthermore, for any a ≥ 0, ∞ η u(n) − K(c) c (C.4) M(V + a) = max + (Vn+m + a)μCm c c C¯ + r + η c C¯ + r + η m=1
≤ MV + a
cH C¯ cH C¯ + r + η
and so the discounting condition also holds.
Q.E.D.
The contraction mapping theorem implies the following corollary. COROLLARY C.2: The value function V is the unique fixed point of M and is a monotone increasing function. LEMMA C.3: Let
L = {V ∈ ∞ : Vn − (r + η )−1 η un is monotone decreasing} If u(·) is concave, then the operator M maps L into itself. Consequently, the unique fixed point of M also belongs to L. PROOF: Suppose that V ∈ L. Using the identity u(n) c C¯ + r + η
−
c C¯ u(n) u(n) =− r +η r + η c C¯ + r + η
7 We thank a referee for suggesting a simplification of the proof by invoking Blackwell’s theorem.
INFORMATION PERCOLATION
1561
we get Mc Vn − = =
η un r + η
∞ η un − K(c) η un c − + Vn+m μCm r + η c C¯ + r + η m=1 c C¯ + r + η
−K(c) c + ¯ ¯ cC + r + η c C + r + η ∞ η un+m η un+m − η un × + Vn+m − μCm r + η r + η m=1
Since V ∈ L and u is concave, the sequence ∞ η un+m η un+m − η un Bn = (C.5) + Vn+m − μCm r + η r + η m=1 (C.6)
=
∞
Vn+m μCm −
m=1
¯ un Cη r + η
is monotone decreasing. Since the maximum of decreasing sequences is again decreasing, the sequence (MV )n −
η un η un = max (M V ) − c n c r + η r + η
is decreasing, proving the result.
Q.E.D.
We will need the following auxiliary lemma. LEMMA C.4: If K(c) is a convex differentiable function, then c → K(c) − K (c)c is monotone decreasing for c > 0. PROOF: For any c and b in [cL cH ] with c ≥ b, using first the convexity property that K(c) − K(b) ≤ K (c)(c − b) and then the fact that the derivative of a convex function is an increasing function, we have (K(c) − K (c)c) − (K(b) − K (b)b) ≤ K (c)(c − b) − K (c)c + K (b)b = b(K (b) − K (c)) ≤ 0 the desired result.
Q.E.D.
1562
D. DUFFIE, S. MALAMUD, AND G. MANSO
PROPOSITION C.5: Suppose that the search-cost function K is convex, differentiable, and increasing. Given (μ C), any optimal search-effort policy function Γ (·) is monotone decreasing. If K(c) = κc for some κ > 0, then there is an optimal policy of the trigger form. PROOF: The optimal V solves the alternative Bellman equation ∞ η un − K(c) c C (C.7) Vn = max + Vn+m μm c c C¯ + r + η c C¯ + r + η m=1
We want to solve max f (c) = max c
c
η u(n) + cYn − K(c) c C¯ + r + η
with Yn =
∞
Vn+m μCm
m=1
Then ¯ f (c) = (Yn (r + η ) − η u(n)C)
¯ + C(K(c) − K (c)c) − (r + η )K (c) /(c C¯ + r + η )2
By Lemma C.4, the function K(c) − K (c)c is monotone decreasing and the function −(r + η )K (c) is decreasing because K(·) is convex. Therefore, the function ¯ C(K(c) − K (c)c) − (r + η )K (c) is also monotone decreasing. There are three possibilities. If the unique solution zn to ¯ ¯ C(K(z n ) − K (zn )zn ) − (r + η )K (zn ) + (Yn (r + η ) − η u(n)C) = 0
belongs to the interval [cL cH ], then f (c) is positive for c < zn and is negative for c > zn . Therefore, f (c) attains its global maximum at zn and the optimum is cn = zn . If ¯ ¯ >0 C(K(c) − K (c)c) − (r + η )K (c) + (Yn (r + η ) − η u(n)C) for all c ∈ [cL cH ], then f (c) > 0, so f is increasing and the optimum is c = cH . Finally, if ¯ ¯ <0 C(K(c) − K (c)c) − (r + η )K (c) + (Yn (r + η ) − η u(n)C)
INFORMATION PERCOLATION
1563
for all c ∈ [cL cH ], then f (c) < 0, so f is decreasing and the optimum is c = cL . By (C.5), the sequence ¯ u(n) = (r + η )Bn Yn (r + η ) − Cη is monotone decreasing. The above analysis directly implies that the optimal policy is then also decreasing. If K is linear, it follows from the above discussion that the optimum is cH if ¯ u(n) > 0 and cL if Yn (r + η ) − Cη ¯ u(n) < 0. Thus, we have a Yn (r + η ) − Cη trigger policy. Q.E.D. PROOF OF PROPOSITION 5.3: Using the fact that (r + η )Vn = sup η un − K(c) + c c∈[cL cH ]
∞
(Vn+m − Vn )Cm μm
m=1
which is a concave maximization problem over a convex set, the supremum is achieved at cL if and only if some element of the supergradient of the objective function at cL includes zero or a negative number. (See Rockafellar (1970).) This is the case provided that ∞ (Vn+m − Vn )Cm μm ≤ K (cL ) m=1
where K (cL ). By Lemma C.3, ∞ ∞ (Vn+m − Vn )Cm μm ≤ η (r + η ) (un+m − un )Cm μm m=1
m=1
≤ η (r + η )
∞ (u¯ − un )Cm μm m=1
¯ < K (cL ) for n > N
completing the proof.
Q.E.D.
APPENDIX D: PROOFS FOR SECTION 6: EQUILIBRIUM This appendix contains proofs of the results in Section 6. D.1. Monotonicity of the Value Function in Other Agents’ Trigger Level From the results of Section 5, we can restrict attention to equilibria in the form of a trigger policy C N , with trigger precision level N. For any constant c
1564
D. DUFFIE, S. MALAMUD, AND G. MANSO
in [cL cH ], we define the operator LNc , at any bounded increasing sequence g, by (LNc g)n =
1 c + η + r
2 H
× η un − K(c) + (cH2 − c C¯ N )gn + c
∞
gn+m CmN μNm
m=1
where C¯ N =
∞
CiN μNi
i=1
LEMMA D.1: Given (μN C N ), the value function V N of any given agent solves VnN = max {LNc VnN } c∈[cL cH ]
PROOF: By Corollary C.2, (D.1)
VnN ≥
∞ η un − K(c) c N N + Vn+m μCm c C¯ N + r + η c C¯ N + r + η m=1
for all c ∈ [cL cH ] and the equality is attained for some c ∈ [cL cH ]. Multiplying (D.1) by (c C¯ N + r + η ), adding (cH2 − c C¯ N )VnN to both sides of (D.1), and then dividing (D.1) by cH2 + η + r, we get VnN ≥
1 c + η + r
2 H
× η un − K(c) + (cH2 − c C¯ N )Vn + c
∞
Vn+m CmN μNm
m=1
for all c ∈ [cL cH ] and the equality is attained for some c ∈ {cL cH }. The proof is complete. Q.E.D. LEMMA D.2: The operator LNc is monotone increasing in N. That is, for any increasing sequence g, g ≥ LNc g LN+1 c PROOF: It is enough to show that if f (n) and g(n) are increasing, bounded functions with f (n) ≥ g(n) ≥ 0, then LN+1 f (n) ≥ LNc g(n). For that it suffices c
1565
INFORMATION PERCOLATION
to show that (D.2)
(c − c C¯ N+1 )f (n) + c 2 H
∞
f (n + m)CmN+1 μN+1 m
m=1
≥ (cH2 − c C¯ N )g(n) + c
∞
g(n + m)CmN μNm
m=1
Because f and g are increasing and f ≥ g, inequality (D.2) holds because ∞
N+1 m
C
N+1 m
μ
≥
m=k
∞
CmN μNm
m=k
for all k ≥ 1, based on Proposition 4.3.
Q.E.D.
PROPOSITION D.3: If N ≥ N, then V N (n) ≥ V N (n) for all n. PROOF: Let LN V = sup LNc V c∈[cL cH ]
By Lemmas D.1 and D.2,
V N = LN V N ≥ LN V N Thus, for any c ∈ [cL cH ],
V N (n) ≥ LNc V N (n)
1 η u(n) − K(c) + (cH2 − c C¯ N )V N (n) = 2 cH + η + r ∞ N N N V (n + m)cm μm +c m=1
Multiplying this inequality by cH2 + η + r, adding (C¯ N c − cH2 )V N (n), dividing by c C¯ N + r + η, and maximizing over c, we get
V N ≥ MV N where the M operator is defined in Lemma C.1, corresponding to c N . Since M is monotone, we get
V N ≥ M kV N
1566
D. DUFFIE, S. MALAMUD, AND G. MANSO
for any k ∈ N. By Lemma C.1 and Corollary C.2,
lim M k V N = V N
k→∞
Q.E.D.
and the proof is complete. D.2. Nash Equilibria We define the operator Q : ∞ → ∞ by η u(n) − K(c) ˜ = arg max (QC)(n) ¯ + r + η c Cc +
c
∞
¯ + r + η Cc
m=1
C˜
C˜
V (n + m)μ (m)
where C˜
V (n) = max c
+
η u(n) − K(c) ¯ + r + η Cc c
∞
¯ + r + η Cc
m=1
C˜
C˜
V (n + m)μ (m)
We then define the N (N) ⊂ N as
N (N) = {n ∈ N; C n ∈ Q(C N )} The following proposition is a direct consequence of Proposition 5.4 and the definition of the correspondence N . PROPOSITION D.4: Symmetric pure strategy Nash equilibria of the game are given by trigger policies with trigger precision levels that are the fixed points of the correspondence N . LEMMA D.5: The correspondence N (N) is monotone increasing in N. PROOF: From the Hamilton–Jacobi–Bellman equation (6), ∞ N N N N N (Vn+m − Vn )Cm μm (r + η )Vn = max η un + c −κ + c∈[cL cH ]
m=1
INFORMATION PERCOLATION
1567
Let first cL > 0. Then it is optimal for an agent to choose c = cH if (D.3)
−κ +
∞ N (Vn+m − VnN )CmN μNm > 0
⇔
(r + η )VnN − η un > 0
m=1
the agent is indifferent between choosing cH or cL if (D.4)
−κ +
∞ N (Vn+m − VnN )CmN μNm = 0
⇔
(r + η )VnN − η un = 0
m=1
and the agent will choose cL if the less than inequality holds in (D.3). By Lemma C.3, the set of n for which (r + η )VnN − η un = 0 is either empty or is an interval N1 ≤ n ≤ N2 . Proposition D.3 implies the required monotonicity. Let now cL = 0. Then, by the same reasoning, it is optimal for an agent to choose c = cH if and only if (r + η )VnN − η un > 0. Alternatively, (r + η )VnN − η un = 0. By Lemma C.3, the set of n for which (r + η )VnN − η un > 0 is an interval n < N1 and, hence, (r + η )VnN − η un = 0 for all n ≥ N1 . Consequently, since un is monotone increasing and concave, the sequence Zn := −κ +
∞
N (Vn+m − VnN )CmN μNm
m=1
= −κ +
∞ η (un+m − un )CmN μNm r + η m=1
is decreasing for n ≥ N1 . Therefore, the set of n for which Zn = 0 is either empty or is an interval N1 ≤ n ≤ N2 . Proposition 4.3 implies the required monotonicity. Q.E.D. PROPOSITION D.6: The correspondence N has at least one fixed point. Any ¯ fixed point is less than or equal to N. PROOF: If 0 ∈ N (0), we are done; otherwise, inf{N (0)} ≥ 1. By monotonicity, inf{N (1)} ≥ 1. Again, if 1 ∈ N (1), we are done; otherwise, continue inductively. Since there is only a finite number N¯ of possible outcomes, we must Q.E.D. arrive at some n in N (n). PROOF OF PROPOSITION 6.4: The result follows directly from Proposition D.3 and the definition of equilibrium. Q.E.D. D.3. Equilibria With Minimal Search Intensity Lemma 6.5 is an immediate consequence of the following.
1568
D. DUFFIE, S. MALAMUD, AND G. MANSO
LEMMA D.7: The operator A is a contraction on ∞ with A∞ →∞ ≤
cL2 < 1 r + η + cL2
Furthermore, A preserves positivity, monotonicity, and concavity. PROOF: We have, for any bounded sequence g, (D.5)
|A(g)n | ≤
∞ cL2 cL2 |g |μ ≤ sup gn n+m m r + η + cL2 m=1 r + η + cL2 n
which establishes the first claim. Furthermore, if g is increasing, we have gn1 +m ≥ gn2 +m for n1 ≥ n2 and for any m. Summing up these inequalities, we get the required monotonicity. Preservation of concavity is proved similarly. Q.E.D. PROOF OF THEOREM 6.6: It suffices to show that, taking the minimal-search (μ0 C 0 ) behavior as given, the minimal-effort search effort cL achieves the supremum defined by the Hamilton–Jacobi–Bellman equation at each precision n if and only if K (cL ) ≥ B, where B is defined by (8). By the previous lemma, V (n) is concave in n. Thus, the sequence ∞ (Vn+m − Vn )μ0m m=1
is monotone decreasing with n. Therefore, ∞ ∞ (V1+m − V1 )μ0m = max (Vn+m − Vn )μ0m n
m=1
m=1
We need to show that the objective function, mapping c to −K(c) + ccL
∞ (Vn+m − Vn )μ0m m=1
achieves its maximum at c = cL . Because K is convex, this objective function is decreasing on [cL cH ] and the claim follows. Q.E.D. The following lemma follows directly from the proof of Proposition D.6. LEMMA D.8: If C N ∈ Q(C 1 ) for some N ≥ 1, then the correspondence N has a fixed point n ≥ 1.
1569
INFORMATION PERCOLATION
PROOF OF PROPOSITION 6.7: By Lemma D.8, it suffices to show that Q(C 1 ) = C 0 . Suppose on the contrary that Q(C 1 ) = C 0 . Then the value function is simply Vn1 =
η u(n)
r + η
It follows from (D.3) that C1 = 0 is optimal if and only if η (u(2) − u(1))cH μ11 1 = (Vn+m − Vn1 )Cm1 μ1m < κ r + η m=1 ∞
By (D.3), we will have the inequality Vn1 ≥ Vn0 for all n ≥ 2 and a strict inequality Q.E.D. V11 > V10 , which is a contradiction. The proof is complete. APPENDIX E: PROOFS OF RESULTS IN SECTION 7: POLICY INTERVENTIONS PROOF OF LEMMA 7.1: By construction, the value function V δ associated with proportional search subsidy rate δ is monotone increasing in δ. By (D.3), the optimal trigger N is that n at which the sequence (r + η )Vnδ − η u(n) crosses zero. Hence, the optimal trigger is also monotone increasing in δ. Q.E.D. PROOF OF LEMMA 7.3: Letting VnM denote the value associated with n private signals and M public signals, we define ZnM = VnM −
η un+M
r + η
We can rewrite the HJB equation for the agent, educated with M signals at birth, in the form ∞ C (Vn+mM − VnM )μm (r + η )ZnM = sup c −κ + c∈[cL cH ]
m=1
= sup c WnM − κ + c∈[cL cH ]
∞ m=1
where WnM =
∞ η (un+M+m − un+M )μCm r + η m=1
(Zn+mM − ZnM )μCm
1570
D. DUFFIE, S. MALAMUD, AND G. MANSO
The quantity WnM is monotone decreasing with M and n by the concavity of u(·). Equivalently, ∞ c C (E.1) WnM − κ + ZnM = sup Zn+mM μm ¯ + r + η c∈[cL cH ] Cc m=1
Treating the right-hand side as the image of an operator at Z, this operator is a contraction and, by Lemma C.2, ZnM is its unique fixed point. Since WnM decreases with M, so does ZnM . By Lemma C.3, ZnM is also monotone decreasing with n. Hence, an optimal trigger policy, attaining the supremum in (E.1), is an n at which the sequence WnM − κ +
∞
Zn+mM μCm
m=1
crosses zero. Because both WnM and ZnM are decreasing with M, the trigger is also decreasing with M. Q.E.D. PROOF OF PROPOSITION 7.4: Suppose that C N is an equilibrium with M public signals, which we express as C N ∈ QM (C N ) By Lemma 7.3, we have C N1 ∈ QM−1 (C N ) for some N1 ≥ N. It follows from the algorithm at the end of Section 6 that there exists some N ≥ N with N ∈ QM−1 (C N ). The proof is completed by induction in M. Q.E.D. APPENDIX F: PROOFS OF COMPARATIVE STATICS This appendix provides proofs of the comparative statics of Proposition 8.1. PROOF OF PROPOSITION 8.1: We first study the effect of a shift in the exit intensity η . Given the market conditions (C μ), let V (η ) be the value function corresponding to intensity η and define η un ˜
Zn (η ) = (r + η ) Vn (η ) − r + η Then the argument used in the proof of Lemma 7.3 implies that ∞ 1 C (F.1) (Z˜ n+m − Z˜ n )μm Z˜ n (η ) = sup c Wn (η ) − κ + r + η m=1 c∈[cL cH ]
1571
INFORMATION PERCOLATION
where ∞ η (un+m − un )μCm Wn (η ) = r + η m=1
Since un is monotone increasing, Wn is increasing in η . By Lemma C.3, Z˜ n is monotone decreasing in n, so Z˜ n+m − Z˜ n is negative. Let η1 < η . Then, for any c ∈ [cL cH ],
(F.2)
∞ 1 ˜ Z˜ n (η ) = sup c Wn (η ) − κ + (Zn+m (η ) − Z˜ n (η ))μCm r + η m=1 c∈[cL cH ] ∞ 1 ≥ c Wn (η1 ) − κ + (Z˜ n+m (η ) − Z˜ n (η ))μCm r + η1 m=1
˜ η by Define the operator M 1 ˜ η g](n) = sup [M 1 c
c r + η1 + c C¯
1
1
(Wn (η ) − κ)(r + η ) +
∞
C m
gn+m μ
m=1
Then, multiplying (F.2) by r + η1 , adding c C¯ Z˜ n to both sides, dividing by r + ¯ and taking the supremum over c, we get that η1 + c C, ˜ η (Z(η )) ˜ ) ≥ M Z(η 1 Since Mη1 is clearly monotone increasing, by iterating we get, for any k ≥ 0, ˜ η )k (Z(η )) ˜ ) ≥ (M Z(η 1 ˜ η is a contraction and By the argument used in the proof of Lemma C.1, M 1 Z(η1 ) is its unique fixed point. Therefore, Z(η ) ≥ limk→∞ (Mη1 )k (Z(η )) = Z(η1 ). That is, Z is monotone increasing in η . Let N η be the optimal-trigger correspondence associated with exit rate η . Let also C = C N and μ = μN . It follows from the proof of Lemma D.5 that N η (N) is an interval [n1 n2 ] with the property that Zn (η ) is positive for n < n1 and is negative for n > n2 . Since Z(η ) is monotone increasing in η , so are both n1 and n2 . Monotonicity of the correspondence N combined with the same argument as in the proof of Proposition D.6 imply the required comparative static in η .
1572
D. DUFFIE, S. MALAMUD, AND G. MANSO
As for the effect of shifting the discount rate r, we rewrite the Bellman equation (6) in the form ∞ 1 (V˜n+m (r) − V˜n (r))Cm μm V˜n (r) = max η un + c −κ + c∈[cL cH ] r + η m=1 where V˜n (r) = (r + η )Vn (r). Let r1 > r. Then, since V˜n is increasing in n, we get ∞ 1 (F.3) V˜n (r) = max η un + c −κ + (V˜n+m (r) − V˜n (r))Cm μm c∈[cL cH ] r + η m=1 ∞ 1 ˜ ≥ max η un + c −κ + (Vn+m (r) − V˜n (r))Cm μm c∈[cL cH ] r1 + η m=1 An argument analogous to that used in the proof of the comparative static for η implies that V˜ (r) ≥ V˜ (r1 ). That is, V˜ (r) is decreasing in r. Therefore, Z˜ n = V˜n − η un also decreases in r and the claim follows in just the same way as it did for η . Finally, to study the impact of shifting the information-quality parameter ρ2 , we note the impact of ρ2 on the exit utility un = −
1 − ρ2
1 + ρ2 (n − 1)
Using (F.1) and following the same argument as for the parameter η , we see that it suffices to show that Wn (ρ) is monotone increasing in ρ2 . To this end, it suffices to show that un+1 (ρ) − un (ρ) is increasing in ρ2 . A direct calculation shows that the function kn (x) = −
1−x 1−x + 1 + xn 1 + x(n − 1)
is monotone decreasing for 1 − n/(n + 1) x > bn =
n n/(n + 1) − n + 1 √ It is not difficult to see that bn < b1 = 2 − 1, and the claim follows.
Q.E.D.
REFERENCES AMADOR, M., AND P.-O. WEILL (2008): “Learning From Private and Public Observations of Others’ Actions,” Working Paper, Stanford University. [1515]
INFORMATION PERCOLATION
1573
BANERJEE, A. (1992): “A Simple Model of Herd Behavior,” Quarterly Journal of Economics, 107, 797–817. [1515] BANERJEE, A., AND D. FUDENBERG (2004): “Word-of-Mouth Learning,” Games and Economic Behavior, 46, 1–22. [1515] BIKHCHANDANI, S., D. HIRSHLEIFER, AND I. WELCH (1992): “A Theory of Fads, Fashion, Custom, and Cultural Change as Informational Cascades,” Journal of Political Economy, 100, 992–1026. [1515] BLOUIN, M., AND R. SERRANO (2001): “A Decentralized Market With Common Value Uncertainty: Non-Steady States,” Review of Economic Studies, 68, 323–346. [1515] BURGUET, R., AND X. VIVES (2000): “Social Learning and Costly Information Acquisition,” Economic Theory, 15, 185–205. [1515] DIEUDONNÉ, J. (1960): Foundations of Modern Analysis. Pure Applied Mathematics, Vol. X. New York: Academic Press. [1539,1540] DUFFIE, D., AND G. MANSO (2007): “Information Percolation in Large Markets,” American Economic Review, Papers and Proceedings, 97, 203–209. [1516] DUFFIE, D., AND Y. SUN (2007): “Existence of Independent Random Matching,” Annals of Applied Probability, 17, 386–419. [1517,1518] DUFFIE, D., N. GÂRLEANU, AND L. PEDERSEN (2005): “Over-the-Counter Markets,” Econometrica, 73, 1815–1847. [1516] DUFFIE, D., G. GIROUX, AND G. MANSO (2009): “Information Percolation,” American Economic Journal: Microeconomics (forthcoming). [1516] GALE, D. (1987): “Limit Theorems for Markets With Sequential Bargaining,” Journal of Economic Theory, 43, 20–54. [1516] GROSSMAN, S. (1981): “An Introduction to the Theory of Rational Expectations Under Asymmetric Information,” Review of Economic Studies, 4, 541–559. [1515] HARTMAN, P. (1982): Ordinary Differential Equations (Second Ed.). Boston: Birkhaüser. [1542, 1547] KIYOTAKI, N., AND R. WRIGHT (1993): “A Search-Theoretic Approach to Monetary Economics,” American Economic Review, 83, 63–77. [1516] LAGOS, R., AND G. ROCHETEAU (2009): “Liquidity in Asset Markets With Search Frictions,” Econometrica, 77, 403–426. [1516] MILGROM, P. (1981): “Rational Expectations, Information Acquisition, and Competitive Bidding,” Econometrica, 50, 1089–1122. [1515] MORTENSEN, D. (1986): “Job Search and Labor Market Analysis,” in Handbook of Labor Economics, ed. by O. Ashenfelter and R. Layard. Amsterdam: Elsevier. [1516] PESENDORFER, W., AND J. SWINKELS (1997): “The Loser’s Curse and Information Aggregation in Common Value Auctions,” Econometrica, 65, 1247–1281. [1515] PISSARIDES, C. (1985): “Short-Run Equilibrium Dynamics of Unemployment Vacancies, and Real Wages,” American Economic Review, 75, 676–690. [1516] PROTTER, P. (2005): Stochastic Integration and Differential Equations. Stochastic Modelling and Applied Probabiblity, Vol. 21. Berlin: Springer-Verlag. [1525] RENY, P., AND M. PERRY (2006): “Toward a Strategic Foundation for Rational Expectations Equilibrium,” Econometrica, 74, 1231–1269. [1515] ROCKAFELLAR, R. T. (1970): Convex Analysis. Princeton Mathematical Series, Vol. 28. Princeton, NJ: Princeton University Press. [1563] RUBINSTEIN, A., AND A. WOLINSKY (1985): “Equilibrium in a Market With Sequential Bargaining,” Econometrica, 53, 1133–1150. [1516] STOCKEY, N. L., and R. E. LUCAS (1989): Recursive Methods in Economic Dynamics. CAMBRIDGE: HARVARD UNIVERSITY PRESS. [1560] SUN, Y. (2006): “The Exact Law of Large Numbers via Fubini Extension and Characterization of Insurable Risks,” Journal of Economic Theory, 126, 31–69. [1516] TREJOS, A., AND R. WRIGHT (1995): “Search, Bargaining, Money, and Prices,” Journal of Political Economy, 103, 118–141. [1516]
1574
D. DUFFIE, S. MALAMUD, AND G. MANSO
VIVES, X. (1993): “How Fast do Rational Agents Learn,” Review of Economic Studies, 60, 329–347. [1515] WEILL, P.-O. (2008): “Liquidity Premia in Dynamic Bargaining Markets,” Journal of Economic Theory, 140, 66–96. [1516] WILSON, R. (1977): “Incentive Efficiency of Double Auctions,” The Review of Economic Studies, 44, 511–518. [1515] WOLINSKY, A. (1990): “Information Revelation in a Market With Pairwise Meetings,” Econometrica, 58, 1–23. [1515] YEH, J. (2006): Real Analysis: Theory of Measure and Integration (Second Ed.). Singapore: World Scientific. [1537]
Graduate School of Business, Stanford University, Stanford, CA 94305-5015, U.S.A. and NBER;
[email protected], Ecole Polytechnique Fédérale de Lausanne, Laussane, Switzerland and Swiss Finance Institute;
[email protected], and Sloan School of Business, Massachusetts Institute of Technology, Cambridge, MA 02142, U.S.A.;
[email protected]. Manuscript received September, 2008; final revision received April, 2009.
Econometrica, Vol. 77, No. 5 (September, 2009), 1575–1605
AMBIGUITY AND SECOND-ORDER BELIEF BY KYOUNGWON SEO1 Anscombe and Aumann (1963) wrote a classic characterization of subjective expected utility theory. This paper employs the same domain for preference and a closely related (but weaker) set of axioms to characterize preferences that use second-order beliefs (beliefs over probability measures). Such preferences are of interest because they accommodate Ellsberg-type behavior. KEYWORDS: Ambiguity, Ellsberg paradox, second-order belief.
1. INTRODUCTION THE ELLSBERG (1961) PARADOX has raised questions about the subjective expected utility model and has stimulated development of a number of more general theories. In one version of the paradox (Ellsberg (2001, p. 151)), there is an urn known to contain 200 balls of four colors: RI , BI , RII , and BII . RI and RII denote two different shades of red; similarly, BI and BII denote two different shades of blue. The urn is known to contain 50 RII balls and 50 BII balls, but the number of RI (or BI ) balls is unknown. One ball is to be drawn from the urn. Consider the following six bets on the color of the ball that is drawn: 100 A B C D AB CD
RI $100 $0 $0 $0 $100 $0
BI $0 $100 $0 $0 $100 $0
50
50
RII $0 $0 $100 $0 $0 $100
BII $0 $0 $0 $100 $0 $100
Bet A gives $100 if the drawn ball is RI and $0 otherwise. The other bets are interpreted similarly. Many subjects rank C ∼ D A ∼ B and AB ∼ CD. Subjective expected utility (SEU) cannot accommodate this behavior. 1 I am indebted to Larry G. Epstein for his illuminating guidance and invaluable advice. I thank D. Ahn, P. Barelli, P. Ghirardato, F. Gul, I. Kopylov, B. Lipman, M. Marinacci, K. Nehring, B. Polak, and P. Wakker for helpful discussions and suggestions. I also thank audiences at the University of Rochester, Boston University, the University of Chicago, Washington University in St. Louis, Rice University, Yale University, Northwestern University, the Canadian Economic Theory Conference, the RUD’06 Workshop on Risk, Utility and Decision, and the 2006 NBER/NSF/CEME Conference on General Equilibrium. Finally, the detailed comments of the co-editor and the referees were extremely valuable.
© 2009 The Econometric Society
DOI: 10.3982/ECTA6727
1576
KYOUNGWON SEO
One explanation of this behavior is that the agent has in mind a second-order belief or a probability measure on probability measures. The agent forms a belief on the proportion of the RI balls or the type of the urn, and then translates bets into two-stage lotteries. Segal (1987) adopted this approach to accommodate the above ranking. He used anticipated utility theory (see Quiggin (1982), for example). However, he was well aware that “modelling the Ellsberg paradox as a two-stage lottery does not depend on anticipated utility theory,” but on nonreduction of two-stage lotteries. Klibanoff, Marinacci, and Mukerji (2005) (henceforth KMM) proposed and axiomatized a utility representation of preference involving the standard expectations of utilities, where nonreduction of a second-order belief is key to accommodating the Ellsberg paradox. Nau (2006) and Ergin and Gul (2009) characterized utility representations that, at least in special cases of their models, can be interpreted similarly to that of KMM. This paper provides a new axiomatization for a model of preference involving a second-order belief. An important difference from the cited models lies in what is assumed about the domain of preference. Models of preference typically model the ranking not only of bets, but also of all other acts—an act over a state space S is a (measurable) function from S into the set of outcomes. In the Ellsberg case, the natural state space is SE = {RI BI RII BII } and bets are binary acts over SE . I now use the Ellsberg setting to highlight the noted difference in assumptions about the domain. KMM assumed two subdomains and two corresponding preferences. One subdomain consists of acts on SE and a preference is given over this set of acts. For the other subdomain, they introduced another state space Δ(SE ), the set of all probability measures over SE . Each probability measure over SE corresponds to a particular number of RI balls in the Ellsberg urn. KMM called an act over Δ(SE ) a second-order act. They assumed that the preference over second-order acts is an SEU preference, which leads immediately to secondorder belief.2 Ergin and Gul permitted issue preference, called source dependence in Nau.3 They assumed two issues and their state space is a product space. In the Ellsberg context, one issue is which ball is drawn and the other issue is what color is each ball. The second issue determines the type of the urn and hence a probability measure over SE . Given preference on acts over the product state space they proved a representation involving a first-order belief over the product state space that can be interpreted as a second-order belief over the first issue. 2
See Section 6 for further discussion of the relation between the KMM representation result and the main theorem in this paper. 3 Nau allowed state dependence.
AMBIGUITY AND SECOND-ORDER BELIEF
1577
Therefore, KMM, Nau, and Ergin and Gul assume state spaces bigger than SE . They presumed that the analyst can observe more than just the ranking of acts over the color of the drawn ball—the ranking of acts over the “type of the urn” must also be observable. Similar remarks apply to their model in general (not only Ellsbergian) settings. The importance of the domain assumption can be illustrated in the context of an asset market. Consider a simple model where the asset price may go up (H) or go down (L). In this setting, a bet on H corresponds to buying the asset and a bet on L corresponds to selling the asset—decisions that are observed in many data sets. On the other hand, a second-order act (or a bet on the second issue) is a bet on the true nature of the market—the probability that the price goes up. But we do not observe bets on the true probability; that is, the payoffs of real-world securities depend on realizations of prices, and not separately on the mechanism that generates these realizations. This paper adopts a domain consisting of lotteries over acts defined over a basic state space, which is SE in the Ellsberg case. Arguably, this domain is closer to the set of choices involved in the Ellsberg paradox than are the domains of KMM, and Ergin and Gul. In addition, the domain in this paper is the same as that in Anscombe and Aumann (1963), one of the classic papers on SEU. Frequently, “the Anscombe–Aumann domain” is taken to be the set of all acts whose prizes are lotteries (see Kreps (1988), for example). Note, however, that in their paper, Anscombe–Aumann used the set of all lotteries over such acts. Thus, the choice objects in this paper and Anscombe–Aumann have three stages. The model in this paper, referred to as second-order subjective expected utility (SOSEU), has the representation4 V (P) = U(f ) dP(f ) and U(f ) = v u(f ) dμ dm(μ) Δ(S)
S
where P is a lottery over acts, f is an act, and m is a second-order belief (a probability measure on Δ(S)). Degenerate lotteries can be identified with acts and thus V induces the utility function U over acts. When v is linear, V collapses to Anscombe–Aumann’s SEU. SOSEU has axiomatic foundations different from SEU. In their characterization of SEU, Anscombe and Aumann assumed order, continuity, independence, reversal of order, and dominance. I drop reversal of order and modify dominance to characterize SOSEU. The functions u and v in SOSEU are cardinally unique. Thus it is possible to discuss connections between the properties of the functions and preference. In particular, v captures ambiguity attitude: if v is concave, then preference exhibits Ellsbergian behavior. As usual, u characterizes attitude toward risk. 4
Technical details are provided later.
1578
KYOUNGWON SEO
The interpretations of u and v are considered in detail in KMM; Nau, and Ergin and Gul have related results and discussions. The domain in this paper makes it possible to analyze attitudes toward ambiguity and two-stage lotteries at the same time. Specifically, SOSEU has the property that if the agent reduces two-stage lotteries to one-stage lotteries in the usual way, then he does not exhibit Ellsberg-type behavior. This prediction is confirmed in the experiment by Halevy (2007). He claimed that a descriptive theory of ambiguity aversion “should account, at the same time, for violation of reduction of compound objective lotteries.” The paper is organized as follows: Section 2 introduces the setup. In Section 3, Anscombe and Aumann’s axioms and theorem are presented. Section 4 motivates dropping their axiom reversal of order, and modifying dominance. This leads to the SOSEU representation theorem. Section 5 examines the connection between nonindifference to ambiguity and violation of reduction of two-stage lotteries. Section 6 discusses related literature. Proofs are contained in the Appendices. 2. THE SETUP For any topological space X, let Δ(X) be the set of all Borel probability measures on X, and let Cb (X) be the set of all bounded continuous functionals on X. Endow Δ(X) with the weak convergence topology, that is, for νn ν ∈ Δ(X), νn → ν if η dνn → η dν for every η ∈ Cb (X). If X is a separable metric space, so is Δ(X). (See Aliprantis and Border (1999, p. 482); these authors are henceforth denoted AB.) Let BX denote the Borel σ-algebra on X and denote by δx ∈ Δ(X) a point mass on X, defined by δx (A) = 0 if x ∈ / A and by δx (A) = 1 if x ∈ A. Let S = {s1 s2 s|S| } be a finite set of states. Let Z denote a set of outcomes or prizes, where Z is a separable metric space. An act f is a function from S into Δ(Z). Let H be the set of all acts endowed with the product topology. Preference is defined on Δ(H). I refer to an element in Δ(Z) as a one-stage lottery and refer to an element in Δ(Δ(Z)) as a two-stage lottery (or a compound lottery). A constant act (an act taking the same value for every s ∈ S) is viewed also as a one-stage lottery. Moreover, any act f is identified with δf . Then it is immediate that Δ(Z) ⊂ H ⊂ Δ(H) and hence Δ(Δ(Z)) ⊂ Δ(H). Therefore, the preference induces rankings on H, Δ(Z), and Δ(Δ(Z)). Typical elements in Δ(H) are denoted by P, Q, and R. I use f , g, and h ¯ and R¯ are typical elements for Δ(Δ(Z)), ¯ Q, for elements in H. In addition, P, and p, q, and r are typical elements for Δ(Z). Denote by (x1 α1 ; ; xn αn ) a lottery that gives x1 with probability α1 and so on, where x1 x2 xn can be outcomes, lotteries, or acts. A typical object P in Δ(H) is depicted in Figure 1.
AMBIGUITY AND SECOND-ORDER BELIEF
1579
FIGURE 1.—A typical element in Δ(H). The first and the last nodes are governed by objective probabilities α a1 a2 b1 , and b2 . The second node is selected according to the realized state s1 or s2 .
3. THE ANSCOMBE–AUMANN MODEL Preferences having an SEU form on Δ(H) were characterized by Anscombe and Aumann (1963) (henceforth AA). Using the notations and definitions of this paper, AA’s axioms and theorem can be restated.5 AXIOM 1—Order: is complete and transitive. AXIOM 2—Continuity: is continuous. DEFINITION 1: For f g ∈ H and α ∈ [0 1], αf ⊕(1 −α)g ∈ H is a componentwise mixture, that is, for every s ∈ S and every B ∈ BZ , (αf ⊕ (1 − α)g)(s)(B) = αf (s)(B) + (1 − α)g(s)(B). This operation is referred to as a second-stage mixture. AXIOM 3—Second-Stage Independence: For any α ∈ (0 1] and one-stage lotteries p q r ∈ Δ(Z), αp ⊕ (1 − α)r αq ⊕ (1 − α)r
⇐⇒
p q
Consider two lotteries αp ⊕ (1 − α)r and αq ⊕ (1 − α)r. Both give the same prize r with probability (1 − α). The two lotteries differ only in the αprobability event. So it is intuitive that the agent’s ranking between them depends only on the ranking between p and q, regardless of the common prize r. 5 Actually, they do not state the first four axioms—order, continuity, second-stage independence, and first-stage independence. Instead, they assume expected utility functions on Δ(H) and Δ(Z), respectively.
1580
KYOUNGWON SEO
FIGURE 2.—Examples of mixture operations: f ∈ H gives $100 if s1 is realized and $0 if s2 is realized; g ∈ H gives $0 for s1 and $100 for s2 . The second-stage mixture αf ⊕ (1 − α)g ∈ H is an act that gives the lottery ($100 α; $0 1 − α) for s1 and the lottery ($0 α; $100 1 − α) for s2 . The first-stage mixture αf + (1 − α)g ∈ Δ(H) is the lottery (f α; g 1 − α).
DEFINITION 2: For P Q ∈ Δ(H) and α ∈ [0 1], αP + (1 − α)Q ∈ Δ(H) is a lottery such that (αP + (1 − α)Q)(B) = αP(B) + (1 − α)Q(B) for B ∈ BH . This operation is called a first-stage mixture. For simplicity, I write αf + (1 − α)g instead of αδf + (1 − α)δg for any acts f and g. See Figure 2 for examples illustrating the mixture operations. AXIOM 4 —(First-Stage Independence): For any α ∈ (0 1] and lotteries P Q R ∈ Δ(H), αP + (1 − α)R αQ + (1 − α)R
⇐⇒
P Q
First-stage independence can be interpreted in a way similar to second-stage independence. AXIOM 5—Reversal of Order: For every f g ∈ H and α ∈ [0 1], αf ⊕ (1 − α)g ∼ αf + (1 − α)g. Reversal of order assumes that the agent is not concerned about whether the mixture operation is taken before or after the realization of the state. Later, I will discuss an argument against this axiom. AXIOM 6—AA Dominance: Let f g ∈ H and s ∈ S. If f (s ) = g(s ) for all s = s and f (s) g(s), then f g.
AMBIGUITY AND SECOND-ORDER BELIEF
1581
This axiom says that when two acts give the identical prizes except in one state s, the prizes in state s determine the agent’s ranking between the two acts. DEFINITION 3: An SEU representation is a bounded continuous mixture linear function u : Δ(Z) → R and a probability measure μ ∈ Δ(S) such that V AA represents on Δ(H), where V AA (P) = U AA (f ) dP(f ) and U AA (f ) = u(f ) dμ H
AA’s theorem can be restated.
S 6
THEOREM 3.1—AA (1963): Preference on Δ(H) satisfies order, continuity, second-stage independence, first-stage independence, reversal of order, and AA dominance if and only if it has an SEU representation. An SEU representation cannot accommodate Ellsberg-type behavior. Therefore, I proceed to develop a generalization of this model. 4. MAIN REPRESENTATION THEOREM Here I show that by dropping reversal of order and modifying AA dominance, one obtains a model of preference that can accommodate nonindifference to ambiguity. Consider the following example that illustrates that reversal of order is problematic given ambiguity. In the Ellsberg example described in the Introduction, let f be the act that gives $100 if the chosen ball is red (RI or RII ) and nothing otherwise; g gives $100 if the ball drawn is blue (BI or BII ) and nothing otherwise. Let p be ($100 1/2; $0 1/2). As Ellsberg predicted and later experiments confirmed, many people feel indifferent between f and g, but strictly prefer p to f and p to g. Compare 12 f + 12 g and 12 f ⊕ 12 g (see Figure 3). The first-stage mixture 12 f + 12 g gives ambiguous acts f or g. If the agent strictly prefers p to f and p to g, it is reasonable to assume that he strictly prefers p to 12 f + 12 g by the intuition of first-stage independence. On the other hand, the second-stage mixture 12 f ⊕ 12 g has no ambiguity and can be identified with p because it yields the lottery p whichever state is realized. Therefore, the agent will strictly prefer 12 f ⊕ 12 g to 1 f + 12 g. Under reversal of order, 12 f ⊕ 12 g and 12 f + 12 g must be indifferent. 2 This illustrates the intuition against adopting reversal of order.7 6 Under reversal of order, one of the two independence axioms is redundant. I leave both of them for comparison with the next section. 7 The preceding intuition translates to the present setting Gilboa and Schmeidler’s (1989) rationale for their axiom “uncertainty aversion,” namely, that “hedging” across ambiguous states can increase utility.
1582
KYOUNGWON SEO
FIGURE 3.—One ball is randomly drawn from the Ellsberg urn which contains 200 balls that are either red or blue. The exact number of red (or blue) balls is unknown. An act f is a bet on red and act g is a bet on blue. The second-stage mixture 12 f ⊕ 12 g is unambiguous, but the first-stage mixture is not.
However, one may think in a different way. For any number of blue balls in the urn, the final probability of getting $100 is 1/2 not only for 12 f ⊕ 12 g, but also for 12 f + 12 g. Hence the agent may be indifferent between 12 f ⊕ 12 g and 1 f + 12 g while preferring 12 f ⊕ 12 g to f and g. Implicit in this argument is that 2 1 f + 12 g becomes a two-stage lottery when the number of blue balls is given 2 and that the agent reduces the two-stage lottery to a one-stage lottery.8 The preceding argument supporting reversal of order is normatively appealing, but Halevy (2007) reported that most people who reduce compound lotteries are ambiguity neutral (see the next section). Since the argument to maintain reversal of order requires reduction, it may not be acceptable at a descriptive level. In this paper, I drop reversal of order and suggest a descriptive model to explain Ellsberg-type behavior. Recall that AA dominance deals only with H, not with Δ(H). Under reversal of order, stating properties on H is enough to describe properties on Δ(H). Since I drop reversal of order, AA dominance must be modified. Each f ∈ H and μ ∈ Δ(S) induces a one-stage lottery, namely Ψ (f μ) ≡ μ(s1 )f (s1 ) ⊕ μ(s2 )f (s2 ) ⊕ · · · ⊕ μ(s|S| )f (s|S| ) ∈ Δ(Z); that is, Ψ (f μ) is the onestage lottery, or constant act, obtained by “reducing” the act f using the prob8
See Ellsberg (2001, p. 230) for a similar argument by Pratt and Raiffa.
AMBIGUITY AND SECOND-ORDER BELIEF
1583
FIGURE 4.—If μ is assumed to be the true probability law, the decision maker translates P to Ψ (P μ).
ability law μ. For P ∈ Δ(H), define the two-stage lottery Ψ (P μ) ∈ Δ(Δ(Z)) by Ψ (P μ)(B) = P({f ∈ H : Ψ (f μ) ∈ B}) for each B ∈ BH . See Figure 4 for an example of Ψ (P μ), and recall that the preference on Δ(H) directly induces preferences over two-stage lotteries (including the object Ψ (P μ)) by its restriction to lotteries over constant acts. AXIOM 7—Dominance: For any P Q ∈ Δ(H) if Ψ (P μ) Ψ (Q μ) for all μ ∈ Δ(S), then P Q. To interpret dominance, consider an agent who is not certain of the true probability law over states, but who believes that there is a true law. Now suppose that Ψ (P μ) Ψ (Q μ) for every μ ∈ Δ(S), that is, for every probability law, he prefers the two-stage lottery induced by P to the one induced by Q. Then he must prefer P to Q. There is an implicit assumption behind this interpretation. When the probability law is given, one may interpret the choice object P ∈ Δ(H) as a threestage lottery which is out of the domain. Dominance implicitly assumes that the agent reduces stages two and three of the three-stage lottery: the two-stage lottery Ψ (P μ) is formed by reducing the acts in the support of P to constant acts using the probability law μ. See the end of this section for more discussion of dominance and for an example illustrating that dominance excludes some interesting preferences. It is instructive to compare dominance with AA dominance. Since the latter deals only with acts in H, extend AA dominance to lotteries over acts. Then the extended AA dominance states that P Q if Ψ (P μ) Ψ (Q μ) for all Dirac measures μ = δs , s ∈ S. Dominance posits the stronger hypothesis that
1584
KYOUNGWON SEO
Ψ (P μ) Ψ (Q μ) for all measures μ in Δ(S). Thus dominance is weaker than the extended AA dominance. It is also instructive to observe that the extended AA dominance implies Kreps’ reversal-of-order-style axiom (Kreps (1988, p. 107)). It states that all lotteries over Savage acts (prizes of the acts are elements in Z) that, for each state s, map naturally to the same lottery over outcomes, must be indifferent.9 The Kreps axiom is essential to SEU. As previously mentioned, dominance is a weaker axiom than the extended AA dominance, thereby allowing for more general representation than SEU. A more complete and formal comparison of dominance and AA dominance is provided in the next lemma. LEMMA 4.1: (i) Order, continuity, reversal of order, and AA dominance imply dominance. (ii) Dominance and second-stage independence imply AA dominance. The main utility representation is defined as follows. DEFINITION 4: A second-order subjective expected utility (SOSEU) representation is a probability measure m ∈ Δ(Δ(S)), a bounded continuous mixture linear function u : Δ(Z) → R, and a bounded continuous and strictly increasing function v : u(Δ(Z)) → R such that V represents on Δ(H), where (4.1) U(f ) dP(f ) V (P) =
H
U(f ) =
u(f ) dμ dm(μ)
v Δ(S)
S
The probability measure m is called a second-order belief. SOSEU can accommodate nonindifference to ambiguity. When the secondorder belief m is nondegenerate and v is nonlinear, the implied behavior cannot be explained by a unique (subjective) probability on S. Instead the agent behaves as though he has multiple priors on S and assigns a probability to each prior on S. SEU is the special case when v is linear.10 The new representation theorem follows. 9 To see that the extended AA dominance implies the Kreps axiom, consider two lotteries P and Q over Savage acts. Assume that, for each s, the two lotteries induce the same lottery over Z. Then Ψ (P δs ) = Ψ (Q δs ) for every s ∈ S and the extended AA dominance implies that P is indifferent to Q. 10 The functional form of an SOSEU representation is similar to that of KMM (2005). Many properties of the functional form are investigated in their paper.
AMBIGUITY AND SECOND-ORDER BELIEF
1585
THEOREM 4.2: Preference on Δ(H) satisfies order, continuity, second-stage independence, first-stage independence, and dominance if and only if it has an SOSEU representation. Appendix A provides a sketch of the proof and also some examples to demonstrate the tightness of the theorem. The complete proof is given in Appendix B. Lemma 4.1 suggests that reversal of order is the crucial difference between an SEU representation and an SOSEU representation. This is summarized in the next corollary. COROLLARY 4.3: Preference on Δ(H) has an SEU representation if and only if it has an SOSEU representation and satisfies reversal of order. AA assumed reversal of order. Under reversal of order, the agent does not care when the objective uncertainty is resolved and he collapses the two objective uncertainties into one objective uncertainty. Thus Corollary 4.3 says that if the agent collapses the two objective probabilities into one, he also collapses the second-order belief (on Δ(S)) into the belief (on S). Briefly consider uniqueness of the representation in Theorem 4.2. It is easy to show that u and v ◦ u are unique up to a positive affine transformation (see Appendix C). The second-order belief m is unique in some special cases—for example, if v(z) = exp(z), the representation has a form similar to a moment generating function and m is unique. However, m is not unique in general. For example, suppose that v is linear. Then any second-order belief that has the same first moment will show the same behavior. Similarly, a polynomial v of degree n implies that if two second-order beliefs, m and m , represent the same preference, they have the same moments up to nth order. See Appendix C for a characterization of the uniqueness class of measures for any given u and v. I conclude this section with an interesting example of preference that violates dominance. Gilboa and Schmeidler (1989) axiomatized the multiplepriors (MP) model on H. Consider two alternative extensions to Δ(H): u(f ) dμ dP(f ) V MP1 (P) = min μ∈C
H
V MP2 (P) =
S
min H
μ∈C
u(f ) dμ dP(f )
S
for a closed set C ⊂ Δ(S). Both representations induce the same preference on H, but only V MP1 satisfies dominance (see examples in Appendix A). The reason can be understood as follows: agents who have either representation might behave as if they were playing a game with a malevolent nature. They suspect that nature will choose a probability law μ that is most unfavorable to them. The difference between V MP1 and V MP2 lies in the agent’s view of the timing
1586
KYOUNGWON SEO
of nature’s move—it is before the first randomization (corresponding to P) in V MP1 and afterward in V MP2 . But when evaluating P ∈ Δ(H), an agent who satisfies dominance uses two-stage lotteries Ψ (P μ) as if μ, though still unknown, has already been chosen by nature. Therefore, V MP1 satisfies dominance and V MP2 does not. 5. AMBIGUITY AND COMPOUND LOTTERIES Here I discuss the relations between ambiguity attitude and a two-stage lottery. A two-stage lottery deals only with objective probabilities; ambiguity attitude deals with the situation where objective probabilities are unknown. The two may seem conceptually distinct, but in an SOSEU representation, they are closely related. An axiom on compound lotteries is introduced.11 AXIOM 8 —Reduction of Compound Lotteries (ROCL): For any p q ∈ Δ(Z) and α ∈ [0 1], αp ⊕ (1 − α)q ∼ αp + (1 − α)q. Since p and q are one-stage lotteries, αp + (1 − α)q constitutes a two-stage lottery. Observe that αp ⊕ (1 − α)q and αp + (1 − α)q have the same final outcome distribution. Thus, under ROCL, the agent considers only the final distribution and he does not care about the timing of risk resolution. An SOSEU representation does not satisfy ROCL unless v is linear. When v is nonlinear, V (αp + (1 − α)q) = αv(u(p)) + (1 − α)v(u(q)) = v(αu(p) + (1 − α)u(q)) = V (αp ⊕ (1 − α)q) Under SOSEU, the utility of any act f is given by v u(f ) dμ dm(μ) U(f ) = Δ(S)
S
which suggests the interpretation that the agent processes an act in a two-stage fashion. This suggests further a connection between the evaluations of acts and two-stage lotteries. In the following, I will show that, given other axioms, ROCL is equivalent to reversal of order and that ROCL implies neutrality to ambiguity. 11 Segal (1990) had a slightly different form of ROCL, but it is not difficult to see that the two axioms are equivalent.
AMBIGUITY AND SECOND-ORDER BELIEF
1587
LEMMA 5.1: ROCL and reversal of order are equivalent under dominance. PROOF: Since Δ(Z) ⊂ H, it is straightforward that reversal of order implies ROCL. Conversely, assume ROCL. Then, for any μ ∈ Δ(S), Ψ (αf + (1 − α)g μ) = αΨ (f μ) + (1 − α)Ψ (g μ) ∼ αΨ (f μ) ⊕ (1 − α)Ψ (g μ) = Ψ (αf ⊕ (1 − α)g μ) Applying dominance leads to αf + (1 − α)g ∼ αf ⊕ (1 − α)g.
Q.E.D.
COROLLARY 5.2: Preference has an SEU representation if and only if it has an SOSEU representation and satisfies ROCL. PROOF: By Lemma 5.1 and Corollary 4.3, this is straightforward.
Q.E.D.
An SOSEU representation reduces to SEU if and only if ROCL is satisfied. In particular, ROCL implies neutrality to ambiguity. This is consistent with Halevy’s (2007) experimental findings. Halevy designed the following experiment. There are three urns, each containing 10 balls which can be red or black.12 One ball is to be drawn. Urn 1 contains 5 red balls and 5 black balls. In urn 2, the proportion is unknown. For urn 3, a ticket is drawn from a bag containing 11 tickets with numbers 0 to 10 written on them. The number on the drawn ticket determines the number of red balls in urn 3. Each participant is asked to place a bet on the color of the drawn ball from each urn. Before any ball is drawn, the participant is given the option to sell each bet. The subject is asked the minimal price at which he/she is willing to sell the bet. Let Vi be the reservation price for urn i, i = 1 2 3. Ambiguity neutrality implies V1 = V2 and ROCL implies V1 = V3 . In Halevy’s experiment, 18 subjects set V1 = V3 and 17 out of them set V1 = V2 . Moreover, out of 86 subjects who showed V1 = V3 , 80 showed V1 = V2 . Halevy concluded that “there is a very tight association between ambiguity neutrality and reduction of compound lotteries” and that “a descriptive theory that accounts for ambiguity aversion should account—at the same time—for violation of reduction of compound objective lotteries.” The domain in this paper includes both acts and two-stage lotteries, and an SOSEU representation relates ambiguity attitude to ROCL.13 12 In his experiment, there were four urns. The fourth urn is omitted here because it is not relevant to my point. 13 Klibanoff, Marinacci, and Mukerji (2005) dealt with acts, but not with compound lotteries. Segal’s (1990) model has two-stage lotteries, but no acts.
1588
KYOUNGWON SEO
6. RELATED LITERATURE The idea of second-order probabilities appeared in Savage (1972, p. 58), but he implicitly assumed reduction and argued that second-order probabilities accomplish nothing more than first-order probabilities do. Segal (1987, 1990) allowed nonreduction of two-stage lotteries. In the former paper, he showed that a model of preferences that allow nonreduction of two-stage lotteries can accommodate the Ellsberg paradox. He used anticipated utility theory and thus his model can accommodate also the Allais paradox. Gärdenfors and Sahlin (1982, 1983) also suggested a second-order probability measure. They used it to determine the set of all “satisfactorily reliable” first-order beliefs and applied the maximin criterion for expected utilities (MMEU) rule. The violation of ROCL and the recursive structure of utility in the present model bring to mind the closely related model of Kreps and Porteus (1978), who provided axiomatic foundations for recursive expected utility with objective temporal lotteries. Their model not only has a functional form similar to SOSEU, but also takes a similar approach: Kreps and Porteus assumed independence at each stage and relaxed reduction. However, precise probabilities are not given in most real-world problems. Klibanoff and Ozdenoren (2007) incorporated subjective uncertainty to characterize subjective recursive expected utility, which does not deal with ambiguity. SOSEU is also defined on a domain that involves subjective uncertainty, but features second-order beliefs that can accommodate Ellsbergian behavior. Violations of ROCL have been documented in some experimental literature. See, for example, Ronen (1971), Snowball and Brown (1979), Schoemaker (1989), Bernasconi and Loomes (1992), Bernasconi (1994), and Busescu and Fischer (2001). However, Cubitt, Starmer, and Sugden (1998) did not find significant violation of ROCL in an experiment that tested several well known accounts of the common ratio effect. Keller (1985) reported that the framing of the problems may affect the degree of the violations, and Güth, van Damme, and Weber (2005) found that the level of econometrics education may also have similar effects. Some experiments deal with ambiguous urns and two-stage risky urns at the same time. Yates and Zukowski (1976), Chow and Sarin (2002), and Halevy (2007) designed experiments involving one-stage risky urns, ambiguous urns, and two-stage risky urns. Those urns are similar to urns 1, 2, and 3 mentioned in the previous section, respectively. All three papers report that one-stage risky urns are most preferred, ambiguous urns are least preferred, and two-stage risky urns are intermediate to the others. Halevy found a tight association between ambiguity neutrality and ROCL, while this connection is not addressed in the other papers. Kreps (1988, pp. 105–110) noted that order, continuity, and first-stage independence deliver the representation (4.1) with no further structure on the function U. He also noted that adding a reversal-of-order-style axiom and
AMBIGUITY AND SECOND-ORDER BELIEF
1589
a monotonicity assumption (p. 109, Axiom 7.16) guarantees AA’s subjective expected utility model. Theorem 4.2 (suitably modified) provides a similar result: order, continuity, first-stage independence, and dominance (instead of the reversal-of-order-style and monotonicity axioms) are equivalent to the utility ◦ Ψ (f μ) dm(μ), where U is the rerepresentation (4.1) with U(f ) = Δ(S) U striction of U to the constant acts Δ(Z).14 Finally, consider (further to the discussion in the Introduction) the relation to KMM. They additionally assumed preference 2 over the set of all (secondorder) acts on Δ(S) and imposed three axioms. The natural question is how their axioms are related to those in this paper. One needs to define 2 to discuss the connection to KMM. For f ∈ H, let f 2 : Δ(S) → Δ(Z) be the act on Δ(S) satisfying f 2 (μ) = Ψ (f μ) and define15 (6.1)
f 2 2 g2
if and only if
f g
Then preference over H induces 2 over the subset {f 2 : f ∈ H} of all secondorder acts. Similarly, preference over Δ(H) induces 2 over Δ({f 2 : f ∈ H}), the set of lotteries over the second-order acts f 2 , having the form f 2 (μ) = Ψ (f μ). Now their axioms can be considered. First, (6.1) is KMM’s consistency axiom. Second, their expected utility on lotteries axiom is equivalent to secondstage independence under order and continuity. Finally, their subjective expected utility on second order acts axiom, is the counterpart of first-stage independence and dominance, again given order and continuity. One can see the last connection by observing that Theorem 4.2 may be viewed as proving that 2 has the representation 2 2 (μ)) dm(μ) dP 2 (f 2 ) for P 2 ∈ Δ({f 2 : f ∈ H}) P −→ U(f on Δ(Z) and measure m ∈ Δ(Δ(S)). for some bounded continuous function U 2 2 Thus the restriction of to {f : f ∈ H} is a subjective expected utility preference. APPENDIX A: PROOF SKETCH AND EXAMPLES This section sketches the sufficiency proof of Theorem 4.2 and provides examples to demonstrate the tightness of the theorem. 14
See Lemma B.8, where second-stage independence does not play a role in constructing m. If f and f induce the same second-order act, the two acts must be the same, because Ψ (f μ) = Ψ (f μ) for all μ implies f = f . Thus, 2 is well defined. 15
1590
KYOUNGWON SEO
Proof Sketch First-stage independence implies that preference can be represented by V (P) = H U(f ) dP(f ). The key part of the proof is to construct the secondorder belief m ∈ Δ(Δ(S)) satisfying ◦ Ψ (f μ) dm(μ) = U(f ) for all f ∈ H, U Δ(S)
is the restriction of U to the constant acts Δ(Z). For intuition about where U existence of such a measure m, consider the discretized version where there are n available acts and k possible priors: Am = b, where ⎛ ◦ Ψ (f1 μk ) ⎞ U ◦ Ψ (f1 μ1 ) · · · U ⎠ A=⎝ U ◦ Ψ (fn μ1 ) · · · U ◦ Ψ (fn μk ) ⎛ ⎞ ⎛ ⎞ U(f1 ) m1 m = ⎝ ⎠ b = ⎝ ⎠ mk U(fn ) By Farkas’ lemma, Am = b has a nonnegative solution m if and only if, for all y ∈ Rn , AT y ≥ 0
⇒
bT y ≥ 0
By the infinite dimensional version of Farkas’ lemma (see Theorem B.1), it suffices to show that, for all signed measures t on H, ◦ Ψ (f μ) dt (f ) ≥ 0 (for all μ ∈ Δ(S)) U (A.1) H
(A.2)
U(f ) dt ≥ 0
⇒ H
To show that, under dominance, (A.1) implies (A.2), first notice that t can be decomposed into αP − βQ, where P Q ∈ Δ(H) are in the domain of objects of choice and α β ≥ 0. Then rearranging (A.1) gives16 dΨ (P μ) ≥ β dΨ (Q μ) for all μ ∈ Δ(S) (A.3) U U α Δ(Z)
Δ(Z)
16 Recall that Ψ (P μ)(B) = P({f ∈ H : Ψ (f μ) ∈ B}) for B ∈ BH . Thus, by the change of variables theorem, ◦ Ψ (f μ) dP(f ) U(p) dΨ (P μ)(p) = U Δ(Z)
The same is true for Q.
H
AMBIGUITY AND SECOND-ORDER BELIEF
1591
Normalize U such that H U d R¯ = 0 for some R¯ ∈ Δ(Δ(Z)) Consider the case d P¯ α > β ≥ 0. Other cases can be proved similarly. Recall that P¯ −→ Δ(Z) U represents preference on Δ(Δ(Z)). Then (A.3) implies β β ¯ μ) Ψ (R Ψ (P μ) Ψ (Q μ) + 1 − α α β β ¯ =Ψ Q+ 1− R μ for all μ ∈ Δ(S) α α Now apply dominance to get β β ¯ P Q+ 1− R α α Since V (P) = H U(f ) dP(f ) represents preference, (A.2) follows and thus a second-order belief m exists. = v ◦ u where u is Second-stage independence is used only to derive U a mixture-linear function on Δ(Z) and v is a strictly increasing function on u(Δ(Z)). Since u is mixture linear, U(f ) = Δ(S) v ◦ u ◦ Ψ (f μ) dm(μ) = v( S u(f ) dμ) dm(μ) follows. Δ(S) Examples for the Tightness of Theorem 4.2 Each example satisfies all but one of the axioms characterizing an SOSEU representation. EXAMPLE 1—All but Second-Stage Independence: Let u(f ) dμ dP(f ) V (P) = H
S
for some fixed μ ∈ Δ(S) and a bounded continuous but non-mixture-linear u : Δ(Z) → R. EXAMPLE 2—All but First-Stage Independence: Let u(f ) dμ dP(f ) V (P) = min μ∈C
H
S
where u is bounded, continuous, and mixture linear, and C ⊂ Δ(S) is a closed subset. To show dominance, note that Ψ (P μ) Ψ (Q μ)
(for all μ ∈ Δ(S))
1592
KYOUNGWON SEO
⇒
u(f ) dμ dP(f )
H
S
≥
u(f ) dμ dQ(f ) H
⇒
S
u(f ) dμ dP(f ) ≥ min
min μ∈C
(for all μ ∈ Δ(S))
H
μ∈C
S
u(f ) dμ dQ(f ) H
S
EXAMPLE 3—All but Dominance: Modify Example 2 by taking V (P) = min u(f ) dμ dP(f ) H
μ∈C
S
This violates (only) dominance. Let S = {1 2}, P = 12 f + 12 g Q = δh , u(f (1)) = 1, u(f (2)) = 2, u(g(1)) = 1, u(g(2)) = 0, u(h(1)) = 1, u(h(2)) = 1, and C = Δ(Δ(S)). Then V (Ψ (P μ)) = 1 = V (Ψ (Q μ)) for all μ ∈ Δ(S), but V (P) = 1/2 < 1 = V (Q). APPENDIX B: PROOFS B.1. Preliminaries Notations and definitions follow AB (1999) and Craven and Koliha (1977). For any real vector space M, let M# be the algebraic dual of M, that is, the set of all linear functionals on M. Denote by m m# an evaluation of m# ∈ M# at the point m ∈ M. Suppose that A : M → T is a linear map between two vector spaces M and T . The algebraic adjoint A# : T # → M# of A is the linear map satisfying m A# t # = Am t #
for all m ∈ M and t # ∈ T #
A dual pair is a pair M M of two vector spaces together with a function (m m ) −→ m m , from M × M into R, satisfying (i) m −→ m m is linear, (ii) m −→ m m is linear, (iii) if m m = 0 for each m ∈ M , then m = 0, and (iv) if m m = 0 for each m ∈ M, then m = 0. I will refer to (iii) and (iv) as separation properties. Given a dual pair M M , the weak topology on M is denoted by σ(M M ). Under σ(M M ) a sequence mn ∈ M converges to m ∈ M if and only if mn m → m m for all m ∈ M . It is well known that the topological dual of (M σ(M M )) may be identified with M . In other words, for each σ(M M )-continuous linear functional φ on M, there is a unique m ∈ M such that φ(m) = m m for all m ∈ M. The weak topology σ(M M) is defined symmetrically for M . From now on, for any dual pair M M , M and M are topological vector spaces equipped with the weak topologies.
AMBIGUITY AND SECOND-ORDER BELIEF
1593
Given dual pairs M M and T T , the continuity of a linear mapping A : M → T can be checked by using A# ; A is continuous if and only if A# (T ) ⊂ M . The restriction A of A# to T is called the topological adjoint of A with respect to M M and T T , or simply the adjoint of A. A nonempty set K ⊂ M is called a convex cone if K + K ⊂ K and αK ⊂ K for every α ≥ 0. The polar cone K ⊂ M of the convex cone K ⊂ M is defined as K = {m : m m ≥ 0 for all m ∈ K}. The main tool used in the paper is the following result from Craven and Koliha (1977, Theorem 2). THEOREM B.1—Generalized Farkas Theorem: Let M M and T T be dual pairs, let K be a convex cone in M, and let A : M → T be a continuous linear map. Let A(K) be closed and τ ∈ T . Then the following conditions are equivalent.17 (a) The equation Am = τ has a solution m ∈ K (b) A t ∈ K ⇒ τ t ≥ 0 B.2. Proof of Theorem 4.2 LEMMA B.2: The map (f μ) −→ Ψ (f μ) from H × Δ(S) into Δ(Z) is continuous. PROOF: Suppose that (fn μn ) converges to (f μ) in the product space H × Δ(S). Note that S is finite. Then, for any η ∈ Cb (Z) η dΨ (fn μn ) Z
=
ηd μn (s1 )fn (s1 ) ⊕ μn (s2 )fn (s2 ) ⊕ · · · ⊕ μn (s|S| )fn (s|S| )
Z
=
μn (s) Z
s∈S
→
η df (s) =
μ(s)
s∈S
η dfn (s)
Z
η dΨ (f μ) Z
Q.E.D.
PROOF OF THEOREM 4.2—Necessity: Completeness, transitivity, and continuity are clear. Second-Stage Independence: For p ∈ Δ(Z), V (p) = v(u(p)) because p does not depend on the probability measure m ∈ Δ(Δ(S)). Since v is strictly increasing, preference on Δ(Z) is represented by u. Thus second-stage independence is satisfied because u is mixture linear. 17 It is easy to see (a) ⇒ (b). Suppose that Am = τ m ∈ K, and A t ∈ K . Then τ t = Am t = m A t ≥ 0, because A t ∈ K .
1594
KYOUNGWON SEO
First-Stage Independence: Let α ∈ (0 1] and P R ∈ Δ(H). Then it is easy to see that V (αP + (1 − α)R) = αV (P) + (1 − α)V (R) First-stage independence is clear. Dominance: Let P be any element in Δ(H). By Lemma B.2 and continuity of v ◦ u, v[u(Ψ (f μ))] is jointly continuous on H × Δ(S) and hence is P × mmeasurable. Since v ◦ u is bounded, v[u(Ψ (f μ))] is P × m-integrable. Then, apply the Fubini theorem (AB (1999, p. 411)) to get V (P) = v u(f ) dμ dm(μ) dP(f ) H
Δ(S)
=
H
=
Δ(S)
S
v u(Ψ (f μ)) dm(μ) dP(f )
Δ(S)
v u(Ψ (f μ)) dP(f ) dm(μ) H
Note that by the change of variables theorem (AB (1999, p. 452)),
v u(Ψ (f μ)) dP(f ) = v ◦ u(p) dΨ (P μ)(p) = V (Ψ (P μ)) H
Thus,
Δ(Z)
V (P) =
V (Ψ (P μ)) dm(μ) Δ(S)
Since m is a nonnegative measure, this completes the necessity part of the proof. PROOF OF THEOREM 4.2—Sufficiency: When P ∼ Q for all P Q ∈ Δ(H), the representation is trivial. Thus assume that satisfies the following statement. AXIOM 9—Nondegeneracy: P Q for some P Q ∈ Δ(H). Follow Lemmas B.3–B.10 to prove sufficiency. LEMMA B.3: (i) Preference restricted to Δ(Z) is represented by a bounded continuous mixture linear function u : Δ(Z) → R. Moreover, u is unique up to positive affine transformation. (ii) Preference is represented on Δ(H) by V (P) = U(f ) dP(f ) H
AMBIGUITY AND SECOND-ORDER BELIEF
1595
for P ∈ Δ(H), where U : H → R is a bounded continuous function and is unique up to positive affine transformation. PROOF: (i) Since H is a metric space, the mapping f → δf from H into Δ(H) is an embedding (AB (1999 p. 480)). Moreover, H is a product space of Δ(Z)’s. Thus, the weak convergence topology on Δ(Z) coincides with the relative topology on Δ(Z) induced by Δ(H). Hence, continuity implies that the restriction of to Δ(Z) is continuous under the weak convergence topology on Δ(Z). Moreover, preference restricted to Δ(Z) satisfies order and (second-stage) independence. Therefore, (i) holds (see Grandmont (1972), for example). (ii) can be proved in a similar way. Q.E.D. be the restriction of U to Δ(Z). Let U LEMMA B.4: There exist p q ∈ Δ(Z) such that p q. Consequently, there is a one-stage lottery p such that U(p) = 0. PROOF: Suppose that p ∼ q for all p and q in Δ(Z). This means that P¯ ∼ Q¯ for all P¯ and Q¯ in Δ(Δ(Z)), by Lemma B.3(ii). Then, for any P Q ∈ Δ(H) and μ ∈ Δ(S), Ψ (P μ) ∼ Ψ (Q μ) because Ψ (P μ) Ψ (Q μ) ∈ Δ(Δ(Z)). By the dominance axiom, P ∼ Q for all P Q ∈ Δ(H), contradicting nondegeneracy. Q.E.D. To apply the generalized Farkas theorem, let
M = ca(Δ(S)) T = Cb (H)
M = Cb (Δ(S))
T = ca(H)
where ca(X) denotes the set of all Borel signed measures on X having bounded variation.Both M M and T T are dual pairs with bilinear operations m m = Δ(S) m dm and t t = H t dt for m ∈ M m ∈ M , t ∈ T , and t ∈ T (AB (1999, p. 475)). Let K = ca+ (Δ(S)) be the subset of M consisting of all nonnegative Borel measures on Δ(S). K is is the restriction of U to Δ(Z) and define clearly a convex cone. Recall that U a linear mapping A from M into the set of all functionals on H by ◦ Ψ (f μ) dm(μ) for f ∈ H U (B.1) (Am)(f ) = Δ(S)
The premises of the generalized Farkas theorem will be verified.
1596
KYOUNGWON SEO
LEMMA B.5: The mapping A is a linear mapping from M into T . PROOF: It suffices to show that A(M) ⊂ T = Cb (H). Let m ∈ M and as is bounded and U ◦ Ψ (fn μ) → sume that fn → f for fn f ∈ H. Note that U ◦ Ψ (f μ) by Lemma B.2. By the Lebesgue dominated convergence theorem, U ◦ Ψ (fn μ) dm(μ) → ◦ Ψ (f μ) dm(μ). Hence f → (Am)(f ) U U Δ(S) Δ(S) is continuous. Boundedness of f → (Am)(f ) comes from boundedness of U. Q.E.D. LEMMA B.6: The mapping A is continuous. PROOF: It suffices to show that A# (T ) ⊂ M . Let t ∈ T . Then A# t lies in M# , that is, A# t is a linear functional on M. Hence, # m A t = Am t = (Am)(f ) dt (f )
H
◦ Ψ (f μ) dt (f ) dm(μ) = m m U
= Δ(S)
H
◦ Ψ (f μ) dt (f ). The order of inwhere m ∈ M# is defined by m (μ) = H U ◦Ψ tegration has changed in the third equality by the Fubini theorem. Since U ◦ Ψ is t × m-integrable and the Fubini theorem can is bounded continuous, U be applied. ◦ Ψ is bounded, Now, it suffices to show that m ∈ M = Cb (Δ(S)). Since U m is bounded. To see continuity, let μn → μ for μn μ ∈ Δ(S). Since μ → ◦ Ψ (f μ) is continuous for each f ∈ H, it follows that U ◦ Ψ (f μn ) → U ◦ Ψ (f μ). Observing that U ◦ Ψ is bounded, U ◦ Ψ (f μ) dt (f ) = m (μ) m (μn ) = U ◦ Ψ (f μn ) dt (f ) → U H
H
by the Lebesgue dominated convergence theorem. Hence m ∈ M .
Q.E.D.
LEMMA B.7: A(K) is closed. PROOF: Suppose that θn = A(λn mn ) ∈ A(K) converges to θ ∈ T , where λn ∈ R+ and mn ∈ Δ(Δ(S)) Step 1: mn has a subsequence mk(n) that converges to some m ∈ Δ(Δ(S)). Since S is finite, Δ(S) is a compact metric space and so is Δ(Δ(S)) (AB (1999, p. 482)). Hence, mn has a converging subsequence. Step 2: Amk(n) t → Am t for any t ∈ T . By Step 1 and the continuity of A, Amk(n) → Am. Thus this step is proved.
AMBIGUITY AND SECOND-ORDER BELIEF
1597
Step 3: λk(n) converges to some λ ≥ 0. By Lemma B.4, take p ∈ Δ(Z) such that ◦ Ψ (p μ) = U(p) U = 0 for any μ ∈ Δ(S). Then ◦ Ψ (p μ) dm(μ) = U(p) (Am)(p) = U = 0 Δ(S)
which implies that Am = 0. Therefore, by the separation property of a dual pair, Am t¯ = 0 for some t¯ ∈ T . Note that λk(n) Amk(n) t¯ = θk(n) t¯ → θ t¯ . Then by Step 2, it follows that λk(n) → λ ≡ θ t¯ /Am t¯ . Since λn ≥ 0 for all n, then λ ≥ 0. Step 4: θ ∈ A(K). For all t ∈ T , θk(n) t = λk(n) Amk(n) t → λAm t = A(λm) t . Moreover, by the hypothesis, θk(n) t → θ t for all t ∈ T . Note that θk(n) t is a sequence in R and converges to at most one point. Thus, A(λm) t = θ t for all t ∈ T and θ = A(λm) ∈ A(K) by the separation property of a dual pair. Q.E.D. The following lemma uses the generalized Farkas theorem to prove the existence of second-order belief. ◦ Ψ (f μ) dm(μ) = LEMMA B.8: There exists m ∈ Δ(Δ(S)) such that Δ(S) U U(f ) for all f ∈ H. PROOF: It is enough to show that Am = U for some m ∈ Δ(Δ(S)), where A is defined in (B.1). First, I will prove that there exists m ∈ K = ca+ (Δ(S)) that solves Am = U. I have already shown in Lemmas B.5–B.7 that the premises of the generalized Farkas theorem are satisfied. Therefore, it suffices to show that if m A t ≥ 0 for all m ∈ K, then U t ≥ 0 Assume that m A t ≥ 0 for all m ∈ K and show that U t ≥ 0
(B.2)
◦ Ψ (f By the hypothesis, m A t = Am t = H Am dt = H Δ(S) U μ) dm(μ) dt (f ) ≥ 0 for all m ∈ K. Since δμ ∈ K for each μ ∈ Δ(S), it follows that ◦ Ψ (f μ) dt (f ) ≥ 0 for all μ ∈ Δ(S) U (B.3) H
Let t = αP − βQ by the Hahn decomposition theorem, where α β ≥ 0 and P Q ∈ Δ(H). Let α ≥ β. The other case, α < β can be proved similarly. If α = 0, the statement (B.2) is trivial because α = β = 0. Let α > 0. Note that (B.3) implies ◦ Ψ (f μ) dQ(f ) for all μ ∈ Δ(S) (B.4) U ◦ Ψ (f μ) dP(f ) ≥ γ U H
where γ = β/α.
H
1598
KYOUNGWON SEO
Recall from Lemma B.3(ii) that U is unique up to positive affine transfor is mation. Normalize U such that H U d R¯ = 0 for some R¯ ∈ Δ(Δ(Z)) Since U d R¯ = 0. Observe that, for all B ∈ BH and μ ∈ Δ(S), the restriction of U, Δ(Z) U ¯ μ)(B) = R¯ {f ∈ H : Ψ (f μ) ∈ B} Ψ (R = R¯ {p ∈ Δ(Z) : p ∈ B} ¯ ∩ Δ(Z)) = R(B) ¯ = R(B ¯ The second equality comes from the fact that R assigns zero probability outside ¯ μ) = 0 for all μ ∈ Δ(S). Then, ¯ ¯ of Δ(Z). Thus, R = Ψ (R μ) and Δ(Z) U dΨ (R by the argument in footnote 16, (B.4) implies ¯ μ) dΨ (P μ) ≥ γ dΨ (Q μ) + (1 − γ) dΨ (R U U U Δ(Z)
Δ(Z)
Δ(Z)
for all μ ∈ Δ(S) Hence by Lemma B.3(ii), it follows that (B.5)
¯ μ) for all μ ∈ Δ(S) Ψ (P μ) γΨ (Q μ) + (1 − γ)Ψ (R
Moreover, for any B ∈ BH , ¯ μ)](B) [γΨ (Q μ) + (1 − γ)Ψ (R ¯ μ)(B) = γΨ (Q μ)(B) + (1 − γ)Ψ (R = γ · Q {f ∈ H : Ψ (f μ) ∈ B} + (1 − γ) · R¯ {f ∈ H : Ψ (f μ) ∈ B} ¯ {f ∈ H : Ψ (f μ) ∈ B} = (γQ + (1 − γ)R) ¯ μ)(B) = Ψ (γQ + (1 − γ)R Therefore, by (B.5), ¯ μ) for all μ ∈ Δ(S) Ψ (P μ) Ψ (γQ + (1 − γ)R By dominance, it follows that ¯ P γQ + (1 − γ)R Therefore, by Lemma B.3(ii), ¯ = γ U dQ (B.6) U dP ≥ U d(γQ + (1 − γ)R) H
H
H
AMBIGUITY AND SECOND-ORDER BELIEF
1599
Then, by (B.6), U t =
H
U dt =
H
U d[α(P − γQ)] ≥ 0
This completes the proof of (B.2). Now, apply the generalized Farkas theorem to obtain m ∈ K = ca+ (Δ(S)) satisfying the equation Am = U or, equivalently, ◦ Ψ (f μ) dm(μ) = U(f ) for each f ∈ H U (B.7) Δ(S)
To prove that m is a probability measure, let p ∈ Δ(Z) be such that U(p) = 0 as in Lemma B.4 and let f be the constant act giving p in every state. Since U(p) = U(p) = 0, (B.7) becomes dm(μ) = 1 Q.E.D. Δ(S) Now, I will show a general property about utility representation. LEMMA B.9: Let X be a connected topological space. If two bounded continuous functions u : X → R and w : X → R represent the same preference on X, then there exists a continuous and strictly increasing function v : u(X) → R such that w = v ◦ u. PROOF: Define v on u(X) by v(y) = w(x)
if
u(x) = y
Then v is well defined and strictly increasing, and w = v ◦ u. To show the continuity of v, note that X is connected and w is continuous. Hence, v(u(X)) = w(X) is connected. Since v is (strictly) increasing, it must be continuous. Q.E.D. LEMMA B.10: There exists a bounded continuous and strictly increasing func = v ◦ u. tion v : u(Δ(Z)) → R such that U represent the same preference on Δ(Z). PROOF: Observe that u and U By Lemma B.9, a continuous and strictly increasing function v : u(Δ(Z)) → = v ◦ u. Boundedness comes from the fact that U is R exists such that U bounded. Q.E.D.
1600
KYOUNGWON SEO
Finally, by Lemma B.3(ii), V (P) = H U(f ) dP(f ) represents on Δ(H), and by Lemmas B.3(i), B.8, and B.10, it follows that ◦ Ψ (f μ) dm(μ) U U(f ) = Δ(S)
=
v ◦ u ◦ Ψ (f μ) dm(μ) Δ(S)
=
v Δ(S)
u(f ) dμ dm(μ) S
This completes the sufficiency part of the proof. B.3. Proof of Lemma 4.1 PROOF: (i) Suppose that satisfies order, continuity, reversal of order, and AA dominance. For any P ∈ Δ(H), let Π(P) be the act obtained by collapsing all the objective probabilities into Δ(Z), that is, Π is a function from Δ(H) into H such that for every B ∈ BZ and s ∈ S, Π(P)(s)(B) = H f (s)(B) dP(f ). For Π to be well defined, f (s)(B) must be P-integrable as a function of f . Step 1: Π is well defined. Since Z is metrizable, the function p → p(B) from Δ(Z) into R is measurable (AB (1999, p. 485)). Moreover, the function f → f (s) is measurable. Thus, f → f (s)(B) is measurable. Since f (s)(B) is bounded, f → f (s)(B) is P-integrable. Step 2: Z η(z) dΠ(P)(s)(z) = H Z η(z) df (s)(z) dP(f ) for any s ∈ S, η ∈ Cb (Z) and P ∈ Δ(H). When η is a measurable step function (i.e., η(Z) is a finite set), this is clear. For any η ∈ Cb (Z), take a sequence ηn of step functions such that ηn (z) converges to η(z) for each z ∈ Z. Then, by the Lebesgue dominated convergence theorem, η(z) dΠ(P)(s)(z) = lim ηn (z) dΠ(P)(s)(z) Z
Z
= lim
ηn (z) df (s)(z) dP(f ) H
=
Z
η(z) df (s)(z) dP(f ) H
Z
Step 3: Π is continuous. Fix s ∈ S. Suppose that Pn → P. Note that f → η(z) df (s)(z) is continuous. Then, by Step 2, Z
η(z) dΠ(Pn )(s)(z) = Z
η(z) df (s)(z) dPn (f )
H
Z
AMBIGUITY AND SECOND-ORDER BELIEF
→
1601
η(z) df (s)(z) dP(f )
H
=
Z
η(z) dΠ(P)(s)(z) Z
Thus, P → Π(P)(s) is continuous for every s ∈ S. Therefore, Π is continuous. Step 4: Π(P) ∼ P for any P ∈ Δ(H). Reversal of order implies that Π(P) ∼ P when P has a finite support. Since H is metrizable, the set of all probability measures on H with finite support is dense in Δ(H) (AB (1999, p. 481)). For any P ∈ Δ(H), take Pn ∈ Δ(H) with finite support such that Pn → P. Then Π(Pn ) ∼ Pn for all n. By continuity and Step 3, Π(P) = lim Π(Pn ) ∼ lim Pn = P Step 5: Π(Ψ (P μ)) = Ψ (Π(P) μ) for any P ∈ Δ(H) and μ ∈ Δ(S) For any B ∈ BZ , Π(Ψ (P μ))(s)(B) f (s)(B) dΨ (P μ)(f ) =
H
=
p(B) dΨ (P μ)(p) = Δ(Z)
=
H
= H
Ψ (f μ)(B) dP(f ) H
μ(s1 )f (s1 ) ⊕ · · · ⊕ μ(s|S| )f (s|S| ) (B) dP(f )
μ(s1 )[f (s1 )(B)] + · · · + μ(s|S| ) f (s|S| )(B) dP(f )
= μ(s1 )
H
f (s1 )(B) dP(f ) + · · · + μ(s|S| )
H
f (s|S| )(B) dP(f )
= μ(s1 )[Π(P)(s1 )(B)] + · · · + μ(s|S| ) Π(P)(s|S| )(B)
= μ(s1 ) · Π(P)(s1 ) ⊕ · · · ⊕ μ(s|S| ) · Π(P)(s|S| ) (B) = Ψ (Π(P) μ)(B) The third equality is obtained by the change of variables theorem. Step 6: satisfies dominance. Suppose that Ψ (P μ) Ψ (Q μ) for all μ ∈ Δ(S). By Steps 4 and 5, Ψ (P μ) ∼ Π(Ψ (P μ)) = Ψ (Π(P) μ) Therefore, Ψ (Π(P) μ) Ψ (Π(Q) μ) for all μ ∈ Δ(S). Since Ψ (Π(P) δs ) = Π(P)(s), it follows that Π(P)(s) Π(Q)(s) for all s ∈ S. For k = 0 1 |S|
1602
KYOUNGWON SEO
define hk ∈ H by hk (s) =
Π(P)(s) if s > k, Π(Q)(s) if s ≤ k.
Then, by AA dominance and Step 4, P ∼ Π(P) = h0 h1 · · · h|S| = Π(Q) ∼ Q which completes the proof of (i). (ii) Let f g ∈ H and suppose that f (s) = g(s) for all s = s and f (s ) g(s ) for some s ∈ S. By second-stage independence, Ψ (f μ) Ψ (g μ) for any μ ∈ Δ(S). Dominance implies f g. Q.E.D. APPENDIX C: UNIQUENESS OF THE SOSEU REPRESENTATION The following lemma provides some uniqueness properties. LEMMA C.1: Suppose that P Q for some P Q ∈ Δ(H) and let the two triples (u v m) and (u v m ) represent on Δ(H). Then the following statements hold: (i) u and u are the same up to positive affine transformation, and so are v ◦ u and v ◦ u . (ii) Δ(S) ϕ dm = Δ(S) ϕ dm for all ϕ ∈ D, where D = ϕ ∈ C(Δ(S)) : ∃λ ∈ ca(T ) such that v(μ · t) dλ(t) for all μ
ϕ(μ) = T
and
|S| T = u(Δ(Z)) ⊂ R|S| PROOF: (i) the same preference on Δ(Z), and Note that u and u represent so do P¯ → Δ(Z) v ◦ u d P¯ and P¯ → Δ(Z) v ◦ u d P¯ on Δ(Δ(Z)). (ii) Note that u = au + b for some a > 0 and b ∈ R, and v ◦ u = cv ◦ u + d for some c > 0 and d ∈ R Thus, v (ax + b) = cv(x) + d for any x ∈ u(Δ(Z)). Then v u (f ) dμ dm (μ) = v a u(f ) dμ + b dm (μ) Δ(S)
S
Δ(S)
=c
S
u(f ) dμ dm (μ) + d
v Δ(S)
S
AMBIGUITY AND SECOND-ORDER BELIEF
1603
Since Δ(S) v ( S u (f ) dμ) dm (μ) and Δ(S) v( S u(f ) dμ) dm(μ) represent the same preference, then v u(f ) dμ dm(μ) = v u(f ) dμ dm (μ) Δ(S)
S
Δ(S)
S
for allf ∈ H Since S is finite, it follows that v(μ · t) dm(μ) = Δ(S)
v(μ · t) dm (μ) Δ(S)
for all t ∈ u(Δ(Z))|S| Integrating both sides gives v(μ · t) dm(μ) dλ(t) u(Δ(Z))|S|
=
Δ(S)
u(Δ(Z))|S|
v(μ · t) dm (μ) dλ(t) Δ(S)
for any λ ∈ ca(u(Δ(Z))|S| ). Observe that (μ t) → v(μ · t) is jointly continuous and bounded, and hence m × λ-integrable. By the Fubini theorem, v(μ · t) dλ(t) dm(μ) = v(μ · t) dλ(t) dm (μ) Δ(S)
T
Δ(S)
T
for all λ ∈ ca(T ) for T = [u(Δ(Z))]|S| , which completes the proof.
Q.E.D.
By Lemma C.1, characterizing D is crucial in determining the class of m that represents the same preference. REFERENCES ALIPRANTIS, C. D., AND K. C. BORDER (1999): Infinite Dimensional Analysis. Berlin: Springer. [1578,1592,1594-1596,1600,1601] ANSCOMBE, F. J., AND R. J. AUMANN (1963): “A Definition of Subjective Probability,” Annals of Mathematical Statistics, 34, 199–205. [1577,1579,1581] BERNASCONI, M. (1994): “Nonlinear Preferences and Two-Stage Lotteries: Theories and Evidence,” The Economics Journal, 104, 54–70. [1588] BERNASCONI, M., AND G. LOOMES (1992): “Failures of the Reduction Principle in an EllsbergType Problem,” Theory and Decision, 32, 77–100. [1588] BUSESCU, D., AND I. FISCHER (2001): “The Same but Different: An Empirical Investigation of the Reducibility Principle,” Journal of Behavioral Decision Making, 14, 187–206. [1588]
1604
KYOUNGWON SEO
CHOW, C. C., AND R. SARIN (2002): “Known, Unknown and Unknowable Uncertainties,” Theory and Decision, 52, 127–138. [1588] CRAVEN, B. D., AND J. J. KOLIHA (1977): “Generalizations of Farkas’ Theorem,” SIAM Journal on Mathematical Analysis, 8, 983–997. [1592,1593] CUBITT, R., C. STARMER, AND R. SUGDEN (1998): “Dynamic Choice and the Common Ratio Effect: An Experimental Investigation,” Economic Journal, 108, 1362–1380. [1588] ELLSBERG, D. (1961): “Risk, Ambiguity and the Savage Axioms,” Quarterly Journal of Economics, 75, 643–669. [1575] (2001): Risk, Ambiguity and Decision. New York: Gerland Publishing. [1575,1582] ERGIN, H., AND F. GUL (2009): “A Subjective Theory of Compound Lotteries,” Journal of Economic Theory, 144, 899–929. [1576] GÄRDENFORS, P., AND N. E. SAHLIN (1982): “Unreliable Probabilities, Risk Taking, and Decision Making,” Synthese, 53, 361–386. [1588] (1983): “Decision Making With Unreliable Probabilities,” British Journal of Mathematical and Statistical Psychology, 36, 240–251. [1588] GILBOA, I., AND D. SCHMEIDLER (1989): “Maxmin Expected Utility With Non-Unique Prior,” Journal of Mathematical Economics, 18, 141–153. [1581,1585] GRANDMONT, J. (1972): “Continuity Properties of a von Neumann–Morgenstern Utility,” Journal of Economic Theory, 4, 45–57. [1595] GÜTH, W., E. VAN DAMME, AND M. WEBER (2005): “Risk Aversion on Probabilities: Experimental Evidence of Deciding Between Lotteries,” Homo Oeconomicus, 22, 191–209. [1588] HALEVY, Y. (2007): “Ellsberg Revisited: An Experimental Study,” Econometrica, 75, 503–536. [1578,1582,1587,1588] KELLER, L. R. (1985): “Testing of the ‘Reduction of Compound Alternatives’ Principle,” Omega, 13, 349–358. [1588] KLIBANOFF, P., AND E. OZDENOREN (2007): “Subjective Recursive Expected Utility,” Economic Theory, 30, 49–87. [1588] KLIBANOFF, P., M. MARINACCI, AND S. MUKERJI (2005): “A Smooth Model of Decision Making Under Ambiguity,” Econometrica, 73, 1849–1892. [1576,1584,1587] KREPS, D. M. (1988): Notes on the Theory of Choice. Boulder, CO: Westview Press. [1577,1584, 1588] KREPS, D. M., AND E. L. PORTEUS (1978): “Temporal Resolution of Uncertainty and Dynamic Choice Theory,” Econometrica, 46, 185–200. [1588] NAU, R. F. (2006): “Uncertainty Aversion With Second-Order Utilities and Probabilities,” Management Science, 52, 136–145. [1576] QUIGGIN, J. (1982): “A Theory of Anticipated Utility,” Journal of Economic Behavior and Organization, 3, 323–343. [1576] RONEN, J. (1971): “Some Effects of Sequential Aggregation in Accounting on Decision-Making,” Journal of Accounting Research, 9, 307–332. [1588] SAVAGE, L. J. (1972): The Foundations of Statistics. New York: Dover (original ed. Wiley, 1954). [1588] SCHOEMAKER, P. (1989): “Preferences for Information on Probabilities versus Prizes—the Role of Risk-Taking Attitudes,” Journal of Risk and Uncertainty, 2, 27–60. [1588] SEGAL, U. (1987): “The Ellsberg Paradox and Risk Aversion: An Anticipated Utility Approach,” International Economic Review, 28, 175–202. [1576,1588] (1990): “Two-Stage Lotteries Without the Reduction Axiom,” Econometrica, 58, 349–377. [1586-1588] SNOWBALL, D., AND C. BROWN (1979): “Decision Making Involving Sequential Events: Some Effects of Disaggregated Data and Dispositions Toward Risk,” Decision Sciences, 10, 527–546. [1588]
AMBIGUITY AND SECOND-ORDER BELIEF
1605
YATES, F. J., AND L. G. ZUKOWSKI (1976): “Characterization of Ambiguity in Decision Making,” Behavioral Science, 21, 19–25. [1588]
Dept. of Managerial Economics and Decision Sciences, Kellogg School of Management, Northwestern University, 2001 Sheridan Road, Evanston, IL 60208, U.S.A.;
[email protected]. Manuscript received September, 2006; final revision received May, 2009.
Econometrica, Vol. 77, No. 5 (September, 2009), 1607–1636
SOCIAL IMAGE AND THE 50–50 NORM: A THEORETICAL AND EXPERIMENTAL ANALYSIS OF AUDIENCE EFFECTS BY JAMES ANDREONI AND B. DOUGLAS BERNHEIM1 A norm of 50–50 division appears to have considerable force in a wide range of economic environments, both in the real world and in the laboratory. Even in settings where one party unilaterally determines the allocation of a prize (the dictator game), many subjects voluntarily cede exactly half to another individual. The hypothesis that people care about fairness does not by itself account for key experimental patterns. We consider an alternative explanation, which adds the hypothesis that people like to be perceived as fair. The properties of equilibria for the resulting signaling game correspond closely to laboratory observations. The theory has additional testable implications, the validity of which we confirm through new experiments. KEYWORDS: Social image, audience effects, signaling, dictator game, altruism.
1. INTRODUCTION EQUAL DIVISION OF MONETARY REWARDS and/or costs is a widely observed behavioral norm. Fifty–fifty sharing is common in the context of joint ventures among corporations (e.g., Veuglers and Kesteloot (1996), Dasgupta and Tao (1998), and Hauswald and Hege (2003)),2 share tenancy in agriculture (e.g. De Weaver and Roumasset (2002), Agrawal (2002)), and bequests to children (e.g., Wilhelm (1996), Menchik (1980, 1988)). “Splitting the difference” is a frequent outcome of negotiation and conventional arbitration (Bloom (1986)). Business partners often divide the earnings from joint projects equally, friends split restaurant tabs equally, and the U.S. government splits the nominal burden of the payroll tax equally between employers and employees. Compliance with a 50–50 norm has also been duplicated in the laboratory. Even when one party has all the bargaining power (the dictator game), typically 20 to 30 percent of subjects voluntarily cede half of a fixed payoff to another individual (Camerer (1997)).3 Our object is to develop a theory that accounts for the 50–50 norm in the dictator game, one we hope will prove applicable more generally.4 Experimental 1 We are indebted to the following people for helpful comments: Iris Bohnet, Colin Camerer, Navin Kartik, Antonio Rangel, three anonymous referees, and seminar participants at the California Institute of Technology, NYU, and Stanford University’s SITE Workshop in Psychology and Economics. We acknowledge financial support from the National Science Foundation through grant numbers SES-0551296 (Andreoni) and SES-0452300 (Bernheim). 2 Where issues of control are critical, one also commonly sees a norm of 50-plus-1 share. 3 The frequency of equal division is considerably higher in ultimatum games; see Camerer (2003). 4 Our theory is not necessarily a good explanation for all 50–50 norms. For example, Bernheim and Severinov (2003) proposed an explanation for equal division of bequests that involves a different mechanism.
© 2009 The Econometric Society
DOI: 10.3982/ECTA7384
1608
J. ANDREONI AND B. DOUGLAS BERNHEIM
evidence shows that a significant fraction of the population elects precisely 50– 50 division even when it is possible to give slightly less or slightly more,5 that subjects rarely cede more than 50 percent of the aggregate payoff, and that there is frequently a trough in the distribution of fractions ceded just below 50 percent (see, e.g., Forsythe, Horowitz, Savin, and Sefton (1994)). In addition, choices depend on observability: greater anonymity for the dictator leads him to behave more selfishly and weakens the norm,6 as do treatments that obscure the dictator’s role in determining the outcome or that enable him to obscure that role.7 A good theory of behavior in the dictator game must account for all these robust patterns. The leading theories of behavior in the dictator game invoke altruism or concerns for fairness (e.g., Fehr and Schmidt (1999), Bolton and Ockefels (2000)). One can reconcile those hypotheses with the observed distribution of choices, but only by making awkward assumptions—for example, that the utility function is fortuitously kinked, that the underlying distribution of preferences contains gaps and atoms, or that dictators are boundedly rational. Indeed, with a differentiable utility function, the fairness hypothesis cannot explain why anyone would choose equal division (see Section 2 below). Moreover, neither altruism nor a preference for fairness explains why observability and, hence, audiences play such an important role in determining the norm’s strength. This paper explores the implications of supplementing the fairness hypothesis with an additional plausible assumption: people like to be perceived as fair. We incorporate that desire directly into the utility function; alternatively, one could depict the dictator’s preference as arising from concerns about subsequent interactions.8 Our model gives rise to a signaling game wherein the dictator’s choice affects others’ inferences about his taste for fairness. Due to an intrinsic failure of the single-crossing property, the equilibrium distribution 5 For example, according to Andreoni and Miller (2002), a significant fraction of subjects (15– 30 percent) adhered to equal division regardless of the sacrifice to themselves. 6 In double-blind trials, subjects cede smaller amounts, and significantly fewer adhere to the 50–50 norm (e.g., Hoffman, McCabe, and Smith (1996)). However, when dictators and recipients face each other, adherence to the norm is far more common (Bohnet and Frey (1999)). Andreoni and Petrie (2004) and Rege and Telle (2004) also found a greater tendency to equalize payoffs when there is an audience. More generally, studies of field data confirm that an audience increases charitable giving (Soetevent (2005)). Indeed, charities can influence contributions by adjusting the coarseness of the information provided to the audience (Harbaugh (1998)). 7 See Dana, Cain, and Dawes (2006), Dana, Weber, and Kuang (2007), and Broberg, Ellingsen, and Johannesson (2007). Various papers have made a similar point in the context of the ultimatum game (Kagel, Kim, and Moser (1996), Güth, Huck, and Ockenfels (1996), and Mitzkewitz and Nagel (1993)) and the holdup problem (Ellingsen and Johannesson (2005)). However, when the recipient is sufficiently removed from the dictator, the recipient’s potential inferences about the dictator’s motives have a small effect on choices (Koch and Normann (2008)). 8 For example, experimental evidence reveals that the typical person treats others better when he believes they have good intentions; see Blount (1995), Andreoni, Brown, and Vesterlund (2002), or Falk, Fehr, and Fischbacher (2008).
SOCIAL IMAGE AND THE 50–50 NORM
1609
of transfers replicates the choice patterns listed above: there is a pool at precisely equal division, and no one gives either more or slightly less than half of the prize. In addition, consistent with experimental findings, the size of the equal division pool depends on the observability of the dictator’s choice. Thus, while our theory does leave some experimental results unexplained (see, e.g., Oberholzer-Gee and Eichenberger (2008) or our discussion of Cherry, Frykblom, and Shogren (2002) in Section 2), it nevertheless has considerable explanatory power. We also examine an extended version of the dictator game in which (a) nature sometimes intervenes, choosing an unfavorable outcome for the recipient, and (b) the recipient cannot observe whether nature intervened. We demonstrate that the equilibrium distribution of voluntary choices includes two pools, one at equal division and one at the transfer that nature sometimes imposes. An analysis of comparative statics identifies testable implications concerning the effects of two parameters. First, a change in the transfer that nature sometimes imposes changes the location of the lower pool. Second, an increase in the probability that nature intervenes reduces the size of the equal division pool and increases the size of the lower pool. We conduct new experiments designed to test those implications. Subjects exhibit the predicted behavior to a striking degree. The most closely related paper in the existing theoretical literature is Levine (1998). In Levine’s model, the typical individual acts generously to signal his altruism so that others will act more altruistically toward him. Though Levine’s analysis of the ultimatum game involves some obvious parallels with our work, he focused on a different behavioral puzzle.9 Most importantly, his analysis does not account for the 50–50 norm.10 He explicitly addresses only one feature of the behavioral patterns discussed above—the absence of transfers exceeding 50 percent of the prize—and his explanation depends on restrictive assumptions.11 As a general matter, a desire to signal altruism (rather than fairness) accords no special status to equal division, and those who care a great deal about others’ inferences will potentially make even larger transfers. 9 With respect to the ultimatum game, Levine’s main point is that, with altruism alone, it is impossible to reconcile the relatively low frequency of selfish offers with the relatively high frequency of rejections. 10 None of the equilibria Levine describes involves pooling at equal division. He exhibits a separating equilibrium in which only a single type divides the prize equally, as well as pooling equilibria in which no type chooses equal division. He also explicitly rules out the existence of a pure pooling equilibrium in which all types choose equal division. 11 In Levine’s model, the respondent’s inferences matter to the proposer only because they affect the probability of acceptance. Given his parametric assumptions, an offer of 50 percent is accepted irrespective of inferences, so there is no benefit to a higher offer. If one assumes instead that a more favorable social image always has positive incremental value, then those who are sufficiently concerned with signaling altruism will end up transferring more than 50 percent. Rotemberg (2008) extended Levine’s analysis and applied it to the dictator game, but imposed a maximum transfer of 50 percent by assumption.
1610
J. ANDREONI AND B. DOUGLAS BERNHEIM
One can view this paper as providing possible microfoundations for theories of warm-glow giving (Andreoni (1989, 1990)). It also contributes to the literature that explores the behavioral implications of concerns for social image (e.g., Bernheim (1994), Ireland (1994), Bagwell and Bernheim (1996), Glazer and Konrad (1996)). Recent contributions in that general area include Ellingsen and Johannesson (2008), Tadelis (2007), and Manning (2007). Our study is also related to the theoretical literature on psychological games, in which players have preferences over the beliefs of others (as in Geanakoplos, Pearce, and Stacchetti (1989)). With respect to the experimental literature, our work is most closely related to a small collection of papers (cited in footnote 7) that studied the effects of obscuring either a subject’s role in dividing a prize or his intended division. By comparing obscured and transparent treatments, those experiments have established that subjects act more selfishly when the outcomes that follow from selfish choices have alternative explanations. We build on that literature by focusing on a class of games for which it is possible to derive robust comparative static implications from an explicit theory of audience effects; moreover, instead of studying one obscured treatment, we test the specific implications of our theory by varying two key parameters across a collection of obscured treatments. More broadly, the experimental literature has tended to treat audience effects as unfortunate confounds that obscure “real” motives. Yet casual observation and honest introspection strongly suggest that people care deeply about how others perceive them and that those concerns influence a wide range of decisions. Our analysis underscores both the importance and feasibility of studying audience effects with theoretical and empirical precision. The paper proceeds as follows: Section 2 describes the model, Sections 3 and 4 provide theoretical results, Section 5 describes our experiment, and Section 6 concludes. Proofs of theorems appear in the Appendix. Other referenced appendices are available online (Andreoni and Bernheim (2009)). 2. THE MODEL Two players—a dictator (D) and a receiver (R)—split a prize normalized to have unit value. Let x ∈ [0 1] denote the transfer R receives; D consumes c = 1 − x. With probability 1 − p, D chooses the transfer, and with probability p, nature sets it equal to some fixed value, x0 ; then the game ends. The parameters p and x0 are common knowledge, but R cannot observe whether nature intervened. For the standard dictator game, p = 0. Potential dictators are differentiated by a parameter t, which indicates the importance placed on fairness; its value is D’s private information. The distribution of t is atomless and has full support on the interval [0 t]; H denotes
SOCIAL IMAGE AND THE 50–50 NORM
1611
the cumulative distribution function (CDF).12 We define Hs as the CDF obtained from H, conditioning on t ≥ s. D cares about his own prize, c, and his social image, m, as perceived by some audience A, which includes R (and possibly others, such as the experimenter). Preferences over c and m correspond to a utility function F(c m) that is unbounded in both arguments, twice continuously differentiable, strictly increasing (with, for some f > 0, F1 (c m) > f for all c ∈ [0 1] and m ∈ R+ ), and strictly concave in c. D also cares about fairness, judged by the extent to which the outcome departs from the most fair alternative, xF . Thus, we write D’s total payoff as U(x m t) = F(1 − x m) + tG(x − xF ) We assume G is twice continuously differentiable, strictly concave, and reaches a maximum at zero. We follow Fehr and Schmidt (1999) and Bolton and Ockenfels (2000) in assuming the players see themselves as equally meritorious in the standard dictator game, so that xF = 12 . Experiments by Cherry, Frykblom, and Shogren (2002) suggested that a different standard may apply when dictators allocate earned wealth. While our theory does not explain the apparent variation in xF across contexts, it can, in principle, accommodate that variation.13 Note that the dictator’s preferences over x and m violate the single-crossing property. Picture his indifference curves in the x m plane. As t increases, the slope of the indifference curve through any point (x m) declines if x < 12 , but rises if x > 12 . Intuitively, comparing any two dictators, if x < 12 , the one who is more fair-minded incurs a smaller utility penalty when increasing the transfer, because inequality falls; however, if x > 12 , that same dictator incurs a larger utility penalty when increasing the transfer, because inequality rises. Social image m depends on A’s perception of D’s fairness. We normalize m so that if A is certain D’s type is t, then D’s social image is t. We use Φ to denote the CDF that represent A’s beliefs about D’s type and use B(Φ) to denote the associated social image. 12
Some experiments appear to produce an atom in the choice distribution at 0, though the evidence for this pattern is mixed (see, e.g., Camerer (2003)). Our model does not produce that pattern (for p = 0 or x0 > 0) unless we assume that there is an atom in the distribution of types at t = 0. Because the type space is truncated below at 0, it may be reasonable to allow for that possibility. One could also generate a choice atom at zero with p = 0 by assuming that some individuals do not care about social image (in which case the analysis would be more similar to the case of p > 0 and x0 = 0). In experiments, it is also possible that a choice atom at zero results from the discreteness of the choice set and/or approximate optimization. 13 If the players are asymmetric with respect to publicly observed indicia of merit, the fairness of an outcome might depend on the extent to which it departs from some other benchmark, such as xF = 04. Provided the players agree on xF , similar results would follow, except that the behavioral norm would correspond to the alternate benchmark. However, if players have different views of xF , matters are more complex.
1612
J. ANDREONI AND B. DOUGLAS BERNHEIM
ASSUMPTION 1: (i) B is continuous (where the set of CDFs is endowed with the weak topology). (ii) min supp(Φ) ≤ B(Φ) ≤ max supp(Φ), with strict inequalities when the support of Φ is nondegenerate. (iii) If Φ is “higher” than Φ in the sense of first-order stochastic dominance, then B(Φ ) > B(Φ ). As an example, B might calculate the mean of t given Φ. For some purposes, we impose a modest additional requirement (also satisfied by the mean): ASSUMPTION 2: Consider the CDFs J, K, and L, such that J(t) = λK(t) + (1 − λ)L(t). If max supp(L) ≤ B(J), then B(J) ≤ B(K), where the second inequality is strict if the first is strict or if the support of L is nondegenerate.14 The audience A forms an inference Φ about t after observing x. Even though D does not observe that inference directly, he knows A will judge him based on x and, therefore, he accounts for the effect of his decision on A’s inference. Thus, the game involves signaling. We will confine attention throughout to pure strategy equilibria. A signaling equilibrium consists of a mapping Q from types (t) to transfers (x), and a mapping P from transfers (x) to inferences (Φ). We will write the image of x under P as Px (rather than P(x)) and use Px (t) to denote the inferred probability that D’s type is no greater than t upon observing x. Equilibrium transfers must be optimal given the inference mapping P (for all t ∈ [0 t], Q(t) solves maxx∈[01] U(x Px t)), and inferences must be consistent with the transfer mapping Q (for all x ∈ Q([0 t]) and t ∈ [0 t], Px (t) = prob(t ≤ t | Q(t ) = x)). We will say that Q is an equilibrium action function is there exists P such that (Q P) is a signaling equilibrium. Like most signaling models, ours has many equilibria, with many distinct equilibrium action functions. Our analysis will focus on equilibria for which the action function Q falls within a specific set: Q1 for the standard dictator game (p = 0) and Q2 for the extended dictator game (p > 0), both defined below. We will ultimately justify those restrictions by invoking a standard refinement for signaling games, the D1 criterion (due to Cho and Kreps (1987)), which insists that the audience attribute any action not chosen in equilibrium to the type that would choose it for the widest range of conceivable inferences.15 Formally, let U ∗ (t) denote the payoff to type t in a candidate equilibrium (Q P) and, for each (x t) ∈ [0 1]×[0 t], define mx (t) 14 It is perhaps more natural to assume that if max supp(L) ≤ B(K), then B(J) ≤ B(K), where the second inequality is strict if the first is strict or if the support of L is nondegenerate. That alternative assumption, in combination with Assumption 1, implies Assumption 2 (see Lemma 5 in Andreoni and Bernheim (2007)). 15 We apply the D1 criterion once rather than iteratively. Similar results hold for other standard criteria (e.g., divinity). We acknowledge that experimental tests have called into question the general validity of equilibrium refinements for signaling games (see, e.g., Brandts and Holt (1992, 1993, 1995)). Our theory nevertheless performs well in this instance, possibly because the focality of the 50–50 norm coordinates expectations.
SOCIAL IMAGE AND THE 50–50 NORM
1613
as the value of m that satisfies U(x m t) = U ∗ (t). Let Mx = arg mint∈[0t] mx (t) if mint∈[0t] mx (t) ≤ t and = [0 t] otherwise. The D1 criterion requires that, for all x ∈ [0 1]\Q([0 t]), Px places probability only on the set Mx . If the dictator’s type were observable, the model would not reproduce observed behavior: every type would choose a transfer strictly less than 12 and there would be no gaps or atoms in the distribution of voluntary choices, apart from an atom at x = 0 (see Andreoni and Bernheim (2007)). Henceforth, we will use x∗ (t) to denote the optimal transfer for type t when type is observable (i.e., the value of x that maximizes U(x t t)). 3. ANALYSIS OF THE STANDARD DICTATOR GAME For the standard dictator game, we will focus on equilibria involving action functions belonging to a restricted set, Q1 . To define that set, we must first describe differentiable action functions that achieve local separation of types. Consider a simpler game with p = 0 and types lying in some interval [r w] ⊆ [0 t]. In a separating equilibrium, for each type t ∈ [r w], t’s choice, denoted S(t), must be the value of x that maximizes the function U(x S −1 (x) t) over x ∈ S([r w]). Assuming S is differentiable, the solution satisfies the first-order condition dU = 0. Substituting x = S(t) into the first-order condition, we obtain dx (1)
S (t) = −
F2 (1 − S(t) t) 1 − F1 (1 − S(t) t) tG S(t) − 2
The preceding expression is a nonlinear first-order differential equation. We will be concerned with solutions with initial conditions of the form (r z) (a choice z for type r) such that z ≥ x∗ (r). For any such initial condition, (1) has a unique solution, denoted Srz (t).16 In the Appendix (Lemma 3), we prove that, for all r and z with z ≥ x∗ (r), Srz (t) is strictly increasing in t for t ≥ r, and ∗ there exists a unique type trz > r (possibly exceeding t) to which Srz (t) assigns ∗ equal division (i.e., Srz (trz ) = 12 ). Now we define Q1 . The action function Q belongs to Q1 if and only if it falls into one of the following three categories: EFFICIENT DIFFERENTIABLE SEPARATING ACTION FUNCTION: Q(t) = S00 (t) for all t ∈ [0 t], where S00 (t) ≤ 12 . CENTRAL POOLING ACTION FUNCTION: Q(t) =
1 2
for all t ∈ [0 t].
16 If z = x∗ (r), then S (r) is undefined, but the uniqueness of the solution is still guaranteed; one simply works with the inverse separating function (see Proposition 5 of Mailath (1987)).
1614
J. ANDREONI AND B. DOUGLAS BERNHEIM
BLENDED ACTION FUNCTIONS: There is some t0 ∈ (0 t) with S00 (t0 ) < 12 such that, for t ∈ [0 t0 ], we have Q(t) = S00 (t), and for t ∈ (t0 t], we have Q(t) = 12 . We will refer to equilibria that employ these types of action functions as, respectively, efficient differentiable separating equilibria, central pooling equilibria, and blended equilibria. A central pooling equilibrium requires U(0 0 0) ≤ U( 12 B(H) 0), so that the lowest type weakly prefers to be in the pool rather than choose his first-best action and receive the worst possible inference. A blended equilibrium requires U(S00 (t0 ) t0 t0 ) = U( 12 B(Ht0 ) t0 ), so that the highest type that separates is indifferent between separating and joining the pool.17 Figure 1 illustrates a blended equilibrium. Types separate up to t0 , and higher types choose equal division. An indifference curve for type t0 (It0 ) passes through both point A—the separating choice for t0 —and point B—the outcome for the pool. The indifference curve for any type t > t0 through point B (It>t0 ) is flatter than It0 to the left of B and steeper to the right. Therefore, all such types strictly prefer the pool to any point on S00 (t) below t0 .
FIGURE 1.—A blended equilibrium. 17 Remember that Ht0 is defined as the CDF obtained starting from H (the population distribution) and conditioning on t ≥ t0 . Because of t0 ’s indifference, there is an essentially identical equilibrium (differing from this one only on a set of measure zero) where t0 resolves its indifference in favor of 12 (that is, it joins the pool).
SOCIAL IMAGE AND THE 50–50 NORM
1615
The following result establishes the existence and uniqueness of equilibria within Q1 and justifies our focus on that set. THEOREM 1: Assume p = 0 and that Assumption 1 holds. Restricting attention to Q1 , there exists a unique equilibrium action function, QE . It is an efficient ∗ 18 differentiable separating function iff t ≤ t00 . Moreover, there exists an inference E E E mapping P such that (Q P ) satisfies the D1 criterion and, for any other equilibrium (Q P) satisfying that criterion, Q and QE coincide on a set of full measure. Thus, our model of behavior gives rise to a pool at equal division in the standard dictator game if and only if the population contains sufficiently fair∗ ). To appreciate why, consider the manner in which the minded people (t > t00 single-crossing property fails: a larger transfer permits a dictator who cares more about fairness to distinguish himself from one who cares less about fairness if and only if x < 12 . Thus, x = 12 serves as something of a natural boundary on chosen signals. In standard signaling environments (with single crossing), the D1 criterion isolates either separating equilibria or, if the range of potential choice is sufficiently limited, equilibria with pools at the upper boundary of the action set (Cho and Sobel (1990)). In our model, 12 is not literally a boundary, and indeed there are equilibria in which some dictators transfer more than 12 . However, there is only limited scope in equilibrium for transfers exceeding 12 (see Lemma 2 in the Appendix) and those possibilities do not survive the application of the D1 criterion. Accordingly, when t is sufficiently large, dictators who seek to distinguish themselves from those with lower values of t by giving more “run out of space” and must therefore join a pool at x = 12 .19 Note that our theory accounts for the behavioral patterns listed in the Introduction. First, provided that the some people are sufficiently fair-minded, there is a spike in the distribution of choices precisely at equal division, even if the prize is perfectly divisible. Second, no one transfers more than half the prize. Third, no one transfer slightly less than half the prize (recall that S00 (t0 ) < 12 for blended equilibria). Intuitively, if a dictator intends to divide the pie unequally, it makes no sense to divide it only slightly unequally, since negative inferences about his motives will overwhelm the tiny consumption gain. ∗ ∗ According to the general definition given above, t00 is defined by the equation S00 (t00 ) = 12 . Despite some surface similarities, the mechanism producing a central pool in this model differs from those explored in Bernheim (1994) and Bernheim and Severinov (2003). In those papers, the direction of imitation reverses when type passes some threshold; types in the middle are unable to adjust their choices to simultaneously deter imitation from the left and from the right. Here, higher types always try to deter imitation by lower types, but are simply unable to do that once x reaches 12 . The main result here is also cleaner in the following sense: in Bernheim (1994) and Bernheim and Severinov (2003), there is a range of possible equilibrium norms; here, equal division is the only possible equilibrium norm. 18 19
1616
J. ANDREONI AND B. DOUGLAS BERNHEIM
Our theory also explains why greater anonymity for the dictator leads him to behave more selfishly and weakens the 50–50 norm. Presumably, treatments with less anonymity cause dictators to attach greater importance to social im attaches more importance to social image than age. Formally, we say that U U if U(x m t) = U(x m t) + φ(m), where φ is differentiable, and φ (m) is strictly positive and bounded away from zero. The addition of the separable term φ(m) allows us to vary the importance of social image without altering the trade-off between consumption and equity. The following result tells us that an increase in the importance attached to social image increases the extent to which dictators conform to the 50–50 norm: attaches THEOREM 2: Assume p = 0 and that Assumption 1 holds. Suppose U more importance to social image than U. Let π and π denote the measures of and U, respectively (based on the equilibrium action types choosing x = 12 for U E QE ∈ Q1 ). Then functions Q π ≥ π with strict inequality when π ∈ (0 1). 4. ANALYSIS OF THE EXTENDED DICTATOR GAME Next we explore the theory’s implications for our extended version of the dictator game. With p > 0 and x0 close to zero, the distribution of voluntary choices has mass not only at 12 (if t is sufficiently large), but also at x0 . Intuitively, the potential for nature to choose x0 regardless of the dictator’s type reduces the stigma associated with voluntarily choosing x0 . Moreover, as p increases, more and more dictator types are tempted to “hide” their selfishness behind nature’s choice. That response mitigates the threat of imitation, thereby allowing higher types to reduce their gifts as well. Accordingly, the measure of types voluntarily choosing x0 grows, while the measure of types choosing 12 shrinks. We will focus on equilibria involving action functions belonging to a restricted set Q2 . To simplify notation, we define S t ≡ Stmax{x0 x∗ (t)} . The action function Q belongs to Q2 if and only if it falls into one of the following three categories: BLENDED DOUBLE-POOL ACTION FUNCTION: There is some t0 ∈ (0 t) and t1 ∈ (t0 t) with S t0 (t1 ) < 12 such that for t ∈ [0 t0 ], we have Q(t) = x0 ; for t ∈ (t0 t1 ], we have Q(t) = S t0 (t); and for t ∈ (t1 t], we have Q(t) = 12 . BLENDED SINGLE-POOL ACTION FUNCTION: There is some t0 ∈ (0 t) with x∗ (t0 ) ≥ x0 and S t0 (t) < 12 such that for t ∈ [0 t0 ], we have Q(t) = x0 , and for t ∈ (t0 t], we have Q(t) = S t0 (t). DOUBLE-POOL ACTION FUNCTION: There is some t0 ∈ (0 t) such that for t ∈ [0 t0 ], we have Q(t) = x0 , and for t ∈ (t0 t], we have Q(t) = 12 .
SOCIAL IMAGE AND THE 50–50 NORM
1617
We will refer to equilibria that employ such action functions as, respectively, blended double-pool equilibria, blended single-pool equilibria, and doublepool equilibria. In a blended double-pool equilibrium, type t0 must be indifferent between pooling at x0 and separating: p t t0 = U max{x0 x∗ (t0 )} t0 t0 U x0 B H (2) 0 t is the CDF for types transferring x0 .20 Also, type t1 must be indifferwhere H 0 ent between separating and joining the pool choosing 12 : p
1 U B(Ht1 ) t1 = U(S t0 (t1 ) t1 t1 ) 2
(3)
Finally, if x0 > 0, type 0 must weakly prefer the lower pool to his first-best action combined with the worst possible inference: p t 0 (4) U(0 0 0) ≤ U x0 B H 0 In a blended single pool equilibrium, (2) and (4) must hold. Finally, in a double-pool equilibrium, expression (4) must hold; also, type t0 must be indifferent between pooling at 12 and pooling at x0 , and must weakly prefer both to all x ∈ (x0 12 ) with revelation of its type: (5)
p t t0 = U 1 B Ht t0 ≥ U max{x0 x∗ (t0 )} t0 t0 U x0 B H 0 0 2
Figure 2 illustrates a blended double-pool equilibrium for x0 = 0. The indifference curve It0 indicates that type t0 is indifferent between the lower pool (point A) and separating with its first-best choice, x∗ (t0 ) (point B). All types between t0 and t1 choose a point on the separating function generated using point B as the initial condition. The indifference curve It1 indicates that type t1 is indifferent between separating (point C) and the upper pool at x = 12 (point D). A blended single-pool equilibrium omits the pool at 12 , and a doublepool equilibrium omits the interval with separation of types. Because (4) is required for all three types of equilibria described above, we will impose a condition on x0 and p that guarantees it: tp ) 0 (6) U(0 0 0) ≤ U x0 min B(H t∈[0t]
p 1−p tp (t ) ≡ ( Specifically, H )H(t ) + ( p+(1−p)H(t) )H(min{t t }). Note that if max{x0 p+(1−p)H(t) ∗ t0 tp ) = t0 , so that the x (t0 )} = x0 , then S (t0 ) = x0 . In that case, condition (2) simply requires B(H 0 outcome for t0 is the same as separation. 20
1618
J. ANDREONI AND B. DOUGLAS BERNHEIM
FIGURE 2.—A blended double-pool equilibrium.
tp ) is continuous in t, so the minimization is well deOne can show that B(H tp ) > 0. Therefore, for all p > 0, (6) is fined; moreover, for p > 0, mint∈[0t] B(H satisfied as long as x0 is not too large. One can also show that, for any x0 such that U(x0 B(H) 0) > U(0 0 0), (6) is satisfied for p sufficiently large. The following theorem establishes the existence and uniqueness of equilibria within Q2 and justifies our focus on that set. THEOREM 3: Assume p > 0, that Assumptions 1 and 2 hold, and that (6) is satisfied.21 Restricting attention to Q2 , there exists a unique equilibrium action function QE . If t is sufficiently large, QE is either a double-pool or blended doublepool action function. Moreover, there exists an inference mapping P E such that (QE P E ) satisfies the D1 criterion, and for any other equilibrium (Q P) satisfying that criterion, Q and QE coincide on a set of full measure. The unique equilibrium action function in Q2 has several notable properties. For voluntary choices, there is always mass at x0 . Nature’s exogenous choice of x0 induces players to “hide” their selfishness by mimicking that choice. 21 With some additional arguments, our analysis extends to arbitrary p and x0 . The possible equilibrium configurations are similar to those described in the text, except that there may also be an interval of separation involving types with t near zero who chose transfers below x0 along S00 . For some parameter values, existence may be problematic unless one slightly modifies the game, for example, by allowing the dictator to reveal his responsibility for the transfer.
SOCIAL IMAGE AND THE 50–50 NORM
1619
There is never positive mass at any other choice except 12 . As before, there is a gap in the distribution of choices just below 12 .22 In addition, one can show that both t0 and t1 are monotonically increasing in p. Consequently, as p increases, the mass at x0 grows, and the mass at x = 12 shrinks. Formally, this can be stated as follows: THEOREM 4: Assume p > 0, that Assumptions 1 and 2 hold, and that (6) is satisfied. Let π0 and π1 denote the measures of types choosing x = x0 and x = 12 , respectively (based on the equilibrium action function QE ∈ Q2 ). Then π0 is strictly increasing in p and π1 is decreasing (strictly if positive) in p.23 After circulating an earlier draft of this paper, we became aware of work by Dana, Cain, and Dawes (2006) and Broberg, Ellingsen, and Johannesson (2007), which shows that many dictators are willing to sacrifice part of the total prize to opt out of the game, provided that the decision is not revealed to recipients. Though we did not develop our theory with those experiments in mind, it provides an immediate explanation. Opting out permits the dictator to avoid negative inferences while acting selfishly. In that sense, opting out is similar (but not identical) to choosing an action that could be attributable to nature. Not surprisingly, a positive mass of dictator types takes that option in equilibrium. For details, see online Appendix A (Andreoni and Bernheim (2009)). 5. EXPERIMENTAL EVIDENCE We designed a new experiment to test the theory’s most direct implications: increasing p should increase the mass of dictators who choose any given x0 (close to zero) and reduce the mass of dictators who split the payoff equally. Thus, we examine the effects of varying both p and x0 . 5.1. Overview of the Experiment We divide subjects into pairs, with partners and roles assigned randomly. Each pair splits a $20 prize. To facilitate interpretation, we renormalize x, measuring it on a scale of 0 to 20. Thus, equal division corresponds to x = 10 rather than x = 05. Dictators, recipients, and outcomes are publicly identified at the conclusion of the experiment to heighten the effects of social image. For our 22
One can also show that a gap just above x0 definitely forms for p sufficiently close to unity and definitely does not form for p sufficiently close to zero. However, since we do not attempt to test those implications, we omit a formal demonstration for the sake of brevity. 23 One can also show that the measure of types choosing x = x0 converges to zero as p approaches zero; see Andreoni and Bernheim (2007).
1620
J. ANDREONI AND B. DOUGLAS BERNHEIM
purposes, there is no need to distinguish between intrinsic concern for an audience’s reaction and concern arising from subsequent social interaction.24 We examine choices for four values of p (0, 0.25, 0.5, and 0.75) and two values of x0 (0 and 1). Identifying the distribution of voluntary choices for eight parameter combinations requires a great deal of data.25 One possible approach is to use the strategy method: ask each dictator to identify binding choices for several games, in each case conditional on nature not intervening, and then choose one game at random to determine the outcome. Unfortunately, that approach raises two serious concerns. First, in piloting the study, we discovered that subjects tend to focus on ex ante fairness—that is, the equality of expected payoffs before nature’s move. If a dictator knows that nature’s intervention will favor him, he may compensate by choosing a strategy that favors the recipient when nature does not intervene. While that phenomenon raises some interesting questions concerning ex ante versus ex post fairness, concerns for ex ante fairness are properly viewed as confounds in the context of our current investigation. Second, the strategy method potentially introduces unintended and confounding audience effects. If a subject views the experimenter as part of the audience, the possibility that the experimenter will make inferences about the subject’s character from his strategy rather than from the outcome may influence his choices. Our theory assumes the relevant audience lacks that information. We address those concerns through the following measures. (i) We use the strategy method only to elicit choices for different games, not to elicit the subject’s complete strategy for a game. For each game, the dictator is only asked to make a choice if he has been informed that his choice will govern the outcome. Thus, within each game, each decision is made ex post rather than ex ante, so there is no risk that the experimenter will draw inferences from portions of strategies that are never executed. (ii) We modify the extended dictator game by making nature’s choice symmetric: nature intervenes with probability p, transferring x0 and 20 − x0 with equal probabilities (p/2). The symmetry neutralizes the tendency among dictators to compensate for any ex ante asymmetry in nature’s choice. Notably, this modification does not alter the theoretical results described in Section 4.26 (iii) Our procedures guarantee that no one 24 A similar statement applies to concerns involving experimenter demand effects in dictator games (see, e.g., List (2007)). Our experiment creates demand effects that mirror those present in actual social situations. Because they are the objects of our study, we do not regard them as confounds. 25 Suppose, for example, that we wish to have 30 observations of voluntary choices for each parameter combination. If each pair of subjects played one game, the experiment would require 1000 subjects and $15,000 in subject payments. 26 For the purpose of constructing an equilibrium, the mass at 20 − x0 can be ignored. It is straightforward to demonstrate that all types will prefer their equilibrium choices to that alternative, given it will be associated with the social image B(H). They prefer their equilibrium choices to the action chosen by t and must prefer that choice to 20 − x0 , because it provides more consumption, less inequality, and a better social image.
SOCIAL IMAGE AND THE 50–50 NORM
1621
can associate any dictator with his or her strategy. We make that point evident to subjects. (iv) Subjects’ instructions emphasize that everyone present in the lab will observe the outcome associated with each dictator. We thereby focus the subjects’ attention on the revelation of particular information to a particular audience. See Appendix B (online) for details concerning our experimental protocol and see Appendix D (online) for the subjects’ instructions. We examine two experimental conditions: one with x0 = 0 (“condition 0”) and one with x0 = 1 (“condition 1”). Each pair of subjects is assigned to a single condition and each dictator makes choices for all four values of p. Thus, we identify the effects of x0 from variation between subjects and the effects of p from variation within subjects. When p = 0, we should observe the same distribution of choices for both conditions, including a spike at x = 10, a 50–50 split. For p = 025, a second spike should appear, located at x = 0 for condition 0 and at x = 1 for condition 1. As we increase p to 0.50 and 0.75, the spikes at 10 should shrink and the spikes at x0 should grow. The subjects were 120 volunteers from undergraduate economic courses at the University of Wisconsin–Madison in March and April 2006. We divided the subjects into 30 pairs for each condition; unexpected attrition left 29 pairs for condition 1. Each subject maintained the same role (dictator or recipient) throughout. The closest existing parallel to our experiment is the “plausible deniability” treatment of Dana, Weber, and Kuang (2007), which differs from ours in the following ways: (a) the probability that nature intervenes depends on the dictator’s response time, (b) only two choices are available, and nature chooses both with equal probability, so that no choice is unambiguously attributable to the dictator, and (c) the effects of variations in the likelihood of intervention and the distribution of nature’s choice are not examined. 5.2. Main Findings Figures 3 and 4 show the distributions of dictators’ voluntary choices in condition 0 (x0 = 0) and condition 1 (x0 = 1), respectively. For ease of presentation, we group values of x into five categories: x = 0, x = 1, 2 ≤ x ≤ 9, x = 10, and x > 10.27 In both conditions, as in previous experiments, transfers exceeding half the prize are rare.28 27 Although subjects were permitted to choose any division of the $20 prize and were provided with hypothetical examples in which dictators chose allocations that involved fractional dollars, all chosen allocations involved whole dollars. 28 For condition 0, there were three violations of this prediction (involving two subjects) out of 139 total choices. One subject gave away $15 when p = 0. A second subject gave away $15 in one of two instances with p = 025 (but gave away $10 in the other instance) and gave away $11 when p = 075. For condition 1, there were only two violations (involving just one subject) out of 134 total choices. That subject chose x = 19 with p = 05 and 0.75. When asked to explain her choices on the postexperiment questionnaire, she indicated that she alternated between giving $1
1622
J. ANDREONI AND B. DOUGLAS BERNHEIM
FIGURE 3.—Distribution of amounts allocated to partners, condition 0.
These figures provide striking confirmation of our theory’s predictions. Look first at Figure 3 (condition 0). For p = 0, we expect a spike at x = 10. Indeed, 57 percent of dictators divided the prize equally. Consistent with results obtained from previous dictator experiments, a substantial fraction of subjects
FIGURE 4.—Distribution of amounts allocated to partners, condition 1.
and $19 to “give me and my partner equal opportunities to make the same $.” Thus, despite our precautions, she was clearly concerned with ex ante fairness. The total numbers of observations reported here exceeds the numbers reported in Table I because here we do not average duplicative choices for p = 025.
1623
SOCIAL IMAGE AND THE 50–50 NORM
(30 percent) chose x = 0.29 As we increase p, we expect the spike at x = 10 to shrink and the spike at x = 0 to grow. That is precisely what happens. Note also that no subject chose x = 1 for any value of p. Look next at Figure 4 (condition 1). Again, for p = 0, we expect a spike at x = 10. Indeed, 69 percent of dictators divided the prize equally, while 17 percent kept the entire prize (x = 0) and only 3 percent (one subject) chose x = 1. As we increase p, the spike at x = 10 once again shrinks. In this case, however, a new spike emerges at x = 1. As p increases to 0.75, the fraction of dictators choosing x = 1 rises steadily from 3 percent to 48 percent, while the fraction choosing x = 10 falls steadily from 69 percent to 34 percent. Notably, the fraction choosing x = 0 falls in this case from 17 percent to 10 percent. Once again, the effect of variations in p on the distribution of choices is dramatic, and exactly as predicted. Table I addresses the statistical significance of these effects by reporting estimates of two random-effects probit models. The specifications in the first two columns of results describe the probability of selecting x = x0 ; those in the last two columns describe the probability of selecting x = 10, equal division. The explanatory variables include indicators for p ≥ 025, p ≥ 05, p = 075, and x0 = 1 (with p ≥ 0 and x0 = 0 omitted). In all cases, we report marginal effects at mean values, including the mean of the unobserved individual heterogeneity. We pool data from both conditions; similar results hold for each condition separately. TABLE I RANDOM EFFECTS PROBIT MODELS: MARGINAL EFFECTS FOR REGRESSIONSa Probability of Choosing x = x0
p ≥ 025 p ≥ 050 p = 075 x0 = 1 Observations
0.467*** (0.110) 0.346*** (0.129) −0.002 (0.132) −0.524*** (0.179) 236
0.467*** (0.110) 0.345*** (0.113) −0.524*** (0.179) 236
Probability of Choosing x = 10b
−0.532*** (0.124) −0.175* (0.133) −0.042 (0.130) 0.224 (0.219) 236
−0.532*** (0.124) −0.196** (0.116)
0.224 (0.219) 236
a Standard errors given in parentheses. Significance: *** α < 001, ** α < 005, * α < 01, one-sided tests. b Equal division.
29 For instance, the fraction of dictators who kept the entire prize was 35 percent in Forsythe et al. (1994) and 33 percent in Bohnet and Frey (1999). In contrast to our experiment, however, no dictators kept the entire prize in Bohnet and Frey’s “two-way identification” condition. One potentially important difference is that Bohnet and Frey’s subjects were all students in the same course, whereas our subjects were drawn from all undergraduates enrolled in economics courses at the University of Wisconsin–Madison.
1624
J. ANDREONI AND B. DOUGLAS BERNHEIM
The coefficients in the first column of results imply that there is a statistically significant increase in pooling at x = x0 when p rises from 0 to 0.25 and from 0.25 to 0.5 (α < 001, one-tailed t-test), but not when p rises from 0.5 to 0.75. The significant negative coefficient for x0 = 1 may reflect the choices of a subset of subjects who are unconcerned with social image and who, therefore, transfer nothing. Dropping the insignificant p = 075 indicator has little effect on the other coefficients (second column of results). The coefficients in the third column of results imply that there is a statistically significant decline in pooling at x = 10 when p rises from 0 to 0.25 (α < 001, one-tailed t-test) and from 0.25 to 0.5 (α < 01, one-tailed t-test), but not when p rises from 0.5 to 0.75. As shown in the last column, the effect of an increase in p from 0.25 to 0.5 on pooling at x = 10 becomes even more statistically significant when we drop the insignificant p = 075 indicator (α < 005, one-tailed t-test). As an additional check on the model’s predictions, we compare choices across the two conditions for p = 0. As predicted, we find no significant difference between the two distributions (Mann–Whitney z = 0670 α < 050; Kolmogorov–Smirnov k = 013 α < 095). The higher fraction of subjects choosing x = 0 in condition 0 (30 percent versus 17 percent) and the higher fraction choosing x = 1 in condition 1 (3 percent versus 0 percent) suggest a modest anchoring effect, but that pattern is also consistent with chance (comparing choices of x = 0, we find t = 1145 α < 026). Our theory implies that, as p increases, a subject in condition 0 will not increase his gift, x. Five of 30 subjects violate that monotonicity prediction; for each, there is one violation. The same prediction holds for condition 1, with an important exception: an increase in p could induce a subject to switch from x = 0 to x = 1. We find four violations of monotonicity for condition 1, but two involve switches from x = 0 to x = 1. Thus, problematic violations of monotonicity are relatively uncommon (11.9 percent of subjects). As a further check on the validity of our main assumptions concerning preferences and to assess whether our model generates the right predictions for the right reasons, we also examined data on attitudes and motivations obtained from a questionnaire administered after subjects completed the experiment. Self-reported motivations correlated with choices in precisely the manner our theory predicts. For details, see Appendix C (online). 6. CONCLUDING COMMENTS We have proposed and tested a theory of behavior in the dictator game that is predicated on two critical assumptions: first, people are fair-minded to varying degrees; second, people like others to see them as fair. We have shown that this theory accounts for previously unexplained behavioral patterns. It also has sharp and testable ancillary implications which new experimental data confirm. Narrowly interpreted, this study enriches our understanding of behavior in the dictator game. More generally, it provides a theoretical framework that potentially accounts for the prevalence of the equal division norm in real-world
1625
SOCIAL IMAGE AND THE 50–50 NORM
settings. Though our theory may not provide the best explanation for all 50–50 norms, it nevertheless deserves serious consideration in many cases. In addition, this study underscores both the importance and the feasibility of studying audience effects, which potentially affect a wide range of real economic choices, with theoretical and empirical precision. APPENDIX LEMMA 1: In equilibrium, G(Q(t) − 12 ) is weakly increasing in t. PROOF: Consider two types, t and t with t < t . Suppose type t chooses x earning image m, while t chooses x earning image m . Let f = F(1 − x m), f = F(1 − x m ), g = G(x − 12 ), and g = G(x − 12 ). Mutual nonimitation requires f + t g ≥ f + t g and f + tg ≤ f + tg; thus, (g − g)(t − t) ≥ 0. Since Q.E.D. t − t > 0, it follows that g − g ≥ 0. LEMMA 2: Suppose Q(t) > 12 . Define x < 12 as the solution (if any) to G(x − ) = G(Q(t) − 12 ). Then for all t > t, Q(t ) ∈ {x Q(t)} if p = 0 and Q(t ) ∈ {x Q(t) x0 } if p > 0.30 1 2
PROOF: According to Lemma 1, G(Q(t ) − 12 ) ≥ G(Q(t) − 12 ). To prove this lemma, we show that the inequality cannot be strict unless p > 0 and Q(t ) = x0 . Suppose on the contrary that it is strict for some t , and either p = 0 or p > 0 and Q(t ) = x0 . Let t 0 = inf{τ | Q(τ) = Q(t )}. It follows from Lemma 1 that for all t > t 0 , Q(t ) = Q(t). Thus, B(PQ(t) ) ≤ t 0 ≤ B(PQ(t ) ). Since G is single-peaked, Q(t ) < Q(t). Thus, all types, including t, prefer Q(t ) to Q(t), a contradiction. Q.E.D. LEMMA 3: Assume z ≥ x∗ (r). (a) Srz (t) > x∗ (t) for t > r. (b) For all t > (t) > 0. (c) If Srz (t ) ≤ 12 and Srz (t ) ≤ 12 , type t ≥ 0 strictly prefers r, Srz ∗ > r such that Srz (t ∗ ) = (x m) = (Srz (t ) t ) to (Srz (t ) t ). (d) There exists trz 1 . (e) Srz (t) is increasing in z and continuous in r and z. 2 PROOF: (a) First consider the case of z > x∗ (r). Suppose the claim is false. Then, since the solution to (1) must be continuous, there is some t such that Srz (t ) = x∗ (t ) and Srz (t) > x∗ (t) for r ≤ t < t . As t approaches t from below, (tk ) increases without bound (see (1)). In contrast, given our assumptions Srz about F and G, the derivative of x∗ (t) is bounded within any neighborhood of t . But then Srz (t) − x∗ (t) must increase over some interval (t t ) (with t < t ), which contradicts Srz (t ) − x∗ (t ) = 0. 30 As a corollary, it follows that there is at most one value of x greater than equilibrium.
1 2
chosen in any
1626
J. ANDREONI AND B. DOUGLAS BERNHEIM
Now consider the case of z = x∗ (r). If U1 (x∗ (r) r r) = 0, then Srz (r) is indx∗ (t) ∗ finite (see (1)), while dt |t=r is finite. If U1 (x (r) r r) < 0 (which requires ∗ (r) > 0, while dxdt(t) |t=r = 0. In either case, Srz (t) > x∗ (t) x∗ (r) = 0), then Srz for t slightly larger than r; one then applies the argument in the previous paragraph. (b) Given (1), the claim follows directly from part (a). (c) Consider t and t with Srz (t ) and Srz (t ) ≤ 12 . Assume that t < t . Then
U(Srz (t ) t t ) − U(Srz (t ) t t )
t dU(Srz (t) t t ) dt = dt t
t 1 − F1 (1 − Srz (t) t) Srz < tG Srz (t) − (t) 2 t + F2 (1 − Srz (t) t) dt = 0 where the inequality follows from Srz (t) < 12 and where the final equality follows from (1). The argument for t < t is symmetric. (d) Assume the claim is false. Because Srz (t) is continuous, we have Srz (t) ∈ (0 12 ) for arbitrarily large t. Using the boundedness of F1 (implied by the continuous differentiability of F ) and the unboundedness of F in its second argument, we have limt→∞ [U(Srz (t) t r) − U(Srz (r) r r)] > 0, which contradicts part (c). (e) If z > z , then Srz (r) > Srz (r). Because the two trajectories are continuous and (for standard reasons) cannot intersect, we have Srz (t) > Srz (t) for all t > r. Continuity in r and z follows from standard properties of the solutions of differential equations. Q.E.D. PROOF OF THEOREM 1: ∗ Step 1A: If t > t00 , there is at most one equilibrium action function in Q1 and it must be either a central pooling or a blended equilibrium action function. We can rule out the existence of an efficient separating equilibrium: part ∗ Lemma 3(b) implies that G(S00 (t) − 12 ) is strictly decreasing in t for t > t00 , so according to Lemma 1, S00 cannot be an equilibrium action function. ∗ For t ∈ [0 t00 ], define ψ(t) as the solution to U(S00 (t) t t) = U( 12 ψ(t) t). The existence and uniqueness of a solution are trivial given our assumptions; continuity of ψ follows from continuity of S00 and U. In addition, ψ (t) = [G(S00 (t) − 12 ) − G(0)][F2 ( 12 ψ(t))]−1 , which implies that ψ(t) is strictly de∗ creasing in t on [0 t00 ). Note that we can rewrite the weak preference condition for a central pooling equilibrium as ψ(0) ≤ B(H) and rewrite the indiffer∗ ). ence condition for a blended equilibrium as ψ(t0 ) = B(Ht0 ) for t0 ∈ (0 t00
SOCIAL IMAGE AND THE 50–50 NORM
1627
First suppose ψ(0) ≤ B(H). B(Ht ) is plainly strictly increasing in t and ψ(t) ∗ ) for which ψ(t0 ) = B(Ht0 ) and, is strictly decreasing, so there is no t0 ∈ (0 t00 hence, no blended equilibrium action function; if there is an equilibrium action function in Q1 , it employs the unique central pooling action function. Next suppose ψ(0) > B(H), so there is no central pooling equilibrium. Note that ∗ ∗ ∗ ∗ ). The existence of a unique t0 ∈ (0 t ) = t00 < B(Ht00 ψ(t00 00 ) with ψ(t0 ) = B(Ht0 ) follows from the continuity and monotonicity of B(Ht ) and ψ(t) in t. Thus, there is at most one blended equilibrium action function. ∗ , there is at most one equilibrium action function in Q1 and Step 1B: If t ≤ t00 it must be an efficient differentiable separating action function. Notice that B(Ht ) = t ≤ ψ(t). Given the monotonicity of B(Ht ) and ψ(t) in t, we have B(Ht ) < ψ(t) for all t ∈ [0 t), which rules out both blended equilibria and central pooling equilibria. There is at most one efficient differentiable separating equilibrium action function because the solution to (1) with initial condition (r z) = (0 0) is unique. Step 1C: There exists an equilibrium action function QE ∈ Q1 and an inference function P E such that (QE P E ) satisfies the D1 criterion. ∗ . Let QE = S00 . Choose any inference function P E such that Suppose t ≤ t00 −1 PxE places probability only on S00 (x) for x ∈ [0 S00 (t)] (which guarantees consistency with QE ), and only on Mx (defined at the outset of Section 3) for x > S00 (t). Lemma 3(c) guarantees that, for each t, QE (t) is optimal within the set [0 S00 (t)]. Since (i) t prefers its equilibrium outcome to (S00 (t) t), (ii) S00 (t) ≥ S00 (t) > x∗ (t) (Lemma 3(a) and (b)), and (iii) B(PxE ) ≤ t (Assumption 1, part (ii)), we know that t also prefers its equilibrium outcome to all (x B(PxE )) for x > S00 (t). Thus, (QE P E ) satisfies the D1 criterion. ∗ and ψ(0) ≤ B(H). Let QE (t) = 12 for all t. Consider the Now suppose t > t00 E inference function P E such that P1/2 = H (which guarantees consistency with E E Q ) and Px places all weight on type t = 0 for each x = 12 . It is easy to verify that 0 ∈ Mx for all x = 12 and that all types t prefer ( 12 H) to (x 0). Thus, (QE P E ) satisfies the D1 criterion. ∗ and ψ(0) > B(H). Let t0 satisfy ψ(t0 ) = B(Ht0 ) Finally suppose t > t00 ∗ )). For t ∈ [0 t0 ), let (Step 1A showed that a solution exists within (0 t00 1 E E Q (t) = S00 (t), and for t ∈ [t0 t], let Q (t) = 2 . Choose any inference func−1 (x) for x ∈ [0 S00 (t0 )], tion P E such that (i) PxE places probability only on S00 E E (ii) P1/2 = Ht0 , and (iii) Px places probability only on Mx ∩ [0 t0 ] for x ∈ (S00 (t0 ) 12 ) ∪ ( 12 1]. It is easy to verify that Mx ∩ [0 t0 ] is nonempty for x ∈ (S00 (t0 ) 12 ) ∪ ( 12 1] (because for t > t0 , mx (t) > mx (t0 )), so the existence of such an inference function is guaranteed. Parts (i) and (ii) guarantee that P E is consistent with QE . It is easy to verify (based on Lemma 3(c) and a simple additional argument) that, for each t, QE (t) is optimal within the set [0 S00 (t0 )] ∪ { 12 }. For all x ∈ (S00 (t0 ) 12 ) ∪ ( 12 1], we have B(PxE ) ≤ t0 , from
1628
J. ANDREONI AND B. DOUGLAS BERNHEIM
which it follows (by another simple argument) that no type prefers (x B(PxE )) to its equilibrium outcome. Thus, (QE P E ) satisfies the D1 criterion. Step 1D: If an equilibrium (Q P) satisfies the D1 criterion, there is no pool at any action other than 12 . Suppose there is a pool that selects an action x = 12 . Select some t from the pool such that t > B(Px ). We claim that for any x sufficiently close to x with G(x − 12 ) > G(x − 12 ), B(Px ) ≥ t . Assuming x is chosen by some type in equilibrium, the claim follows from Lemma 1. Assuming x is not chosen by any type in equilibrium, it is easy to check that mx (t ) > mx (t ) for any t < t ; with x sufficiently close to x , we have mx (t ) < t, which then implies t ∈ / Mx and, hence, B(Px ) ≥ t . The lemma follows from the claim, because t would deviate at least slightly toward 12 . Step 1E: If an equilibrium (Q P) satisfies the D1 criterion, type t = 0 selects either x = 0 or x = 12 . Suppose Q(0) ∈ / {0 12 }. By Step 1D, PQ(0) places probability 1 on type 0. But then U(0 B(P0 ) 0) ≥ U(0 0 0) > U(Q(0) B(PQ(0) ) 0), which contradicts the premise that Q(0) is optimal for type 0. Step 1F: For any equilibrium (Q P) satisfying the D1 criterion, Q and QE (the unique equilibrium action functions within Q1 ) coincide on a set of full measure. Lemma 2 and Step 1D together imply Q(t) ≤ 12 for all t ∈ [0 t). Let t0 = sup{t ∈ [0 t] | Q(t) < 12 } (if the set is empty, then t0 = 0). We claim that Q(t) = S00 (t) for all t ∈ [0 t0 ). By Lemma 1, Q(t) is weakly increasing on t ∈ [0 t); hence, Q(t) < 12 for t ∈ [0 t0 ). By Step 1D, Q(t) fully separates all types in [0 t0 ) and is, therefore, strictly increasing on that set. Consider the restricted game in which the type space is [0 t0 − ε] and the dictator chooses x ∈ [0 Q(t0 − ε)] for small ε > 0. It is easy to construct another signaling model for which the type space is [0 t0 − ε], the dictator chooses x ∈ R, preferences are the same as in the original game for (x m) ∈ [0 Q(t0 − ε)] × [0 t0 − ε], and conditions (1)–(5) and (7) of Mailath (1987) are satisfied on the full domain R×[0 t0 − ε]. Theorem 2 of Mailath (1987) therefore implies that Q(t) (which we have shown achieves full separation on [0 t0 )) must satisfy (1) on [0 t0 −ε] for all ε > 0. The desired conclusion then follows from Step 1E, which ties down the initial condition, Q(0) = 0. There are now three cases to consider: (i) t0 = 0, (ii) t0 ∈ (0 t), and (iii) t0 = t. In case (i), we know that Q(t) = 12 for t ∈ (0 t] (t is included by Lemma 1). It is easy to check that if Q is an equilibrium, then so is Q∗ (t) = 12 for all t ∈ [0 t] (for the same inferences, if type 0 has an incentive to deviate from Q∗ , then some type close to zero would have an incentive to deviate from Q). In case (ii), we know that Q(t) = S00 (t) for t ∈ [0 t0 ) and Q(t) = 12 for t ∈ (t0 t]. It is easy to check that if Q is an equilibrium, then so is Q∗ (t) = Q(t) for t = t0 and Q∗ (t0 ) = S00 (t0 ). In case (iii), we know that Q(t) = S00 (t) for t ∈ [0 t). It is easy to check that if Q is an equilibrium, then so is Q∗ (t) = S00 (t) for
SOCIAL IMAGE AND THE 50–50 NORM
1629
t ∈ [0 t]. In each case, Q∗ ∈ Q1 , and Q and Q∗ coincide on a set of full measure. Q.E.D. PROOF OF THEOREM 2: First we claim that S00 (t) < S00 (t) for all t. It is (0) < S00 (0), so S00 (t) < S00 (t) for small t. If the claim easy to check that S00 is false, then since the separating functions are continuous, t = min{t > 0 | S00 (t) = S00 (t)} is well defined. It is easy to check that S00 (t ) < S00 (t ); moreover, because the slopes of the separating functions vary continuously with (t) < S00 (t) for all t ∈ [t t ]. But since t, there is some t < t such that S00 S00 (t ) < S00 (t ), we must then have S00 (t ) < S00 (t ), a contradiction. for Define ψ(t) as in Step 1A of the proof of Theorem 1, and define ψ(t) ∗ U analogously. Note that for t ∈ (0 t00 ), U(S00 (t) t t) < U(S00 (t) t t) + 1 ψ(t) t). It follows that ψ(t) < ψ(t). φ(t) = U( 12 ψ(t) t) + φ(t) < U( 2 If π = 1, then ψ(0) ≤ B(H), so ψ(0) < B(H), which implies π = 1 (see Step 1A of the proof of Theorem 1). If π ∈ (0 1), then ψ(0) > B(0) and there is a unique blended equilibrium for which t0 solves B(Ht0 ) = ψ(t0 ). In that case, either ψ(0) ≤ B(0), which implies π = 1 > π, or ψ(0) > B(0) and B(Ht0 ) > ψ(t0 ), which imply (given the monotonicity properties of B and ψ) t0 ) for t0 < t0 and, hence, π > π. Q.E.D. B(Ht0 ) = ψ( PROOF OF THEOREM 3: Step 3A: Equation (2) has a unique solution: t0∗ ∈ (0 t). Define the function ξ(t) as the solution to F(1 − x0 ξ(t)) + tG(x0 − 12 ) = F(1 − max{x0 x∗ (t)} t) + tG(max{x0 x∗ (t)} − 12 ). It is easy to check that for t ∈ [0 t], ξ(t) exists and satisfies ξ(t) ≥ t with strict inequality if x∗ (t) > x0 . tp ). Also note that ξ(0) = 0 < Note that we can rewrite (2) as ξ(t0 ) = B(H 0 0p ); furthermore, ξ(t) ≥ t > B(H) = B(H p ). Thus, by continuity, B(H) = B(H t there must exist at least one value of t0 ∈ (0 t) satisfying (2). Now suppose, contrary to the claim, that there are two solutions to (2): })−H(min{tt }) . Note that t and t with t > t . Define a CDF L(t) ≡ H(min{tt H(t )−H(t ) p ). One can check that H p (t) = λH p (t) + max supp(L) = t ≤ ξ(t ) = B(H t t t ) p ) ≤ B(H p ). (1 − λ)L(t), where λ = p+(1−p)H(t ∈ (0 1). By Assumption 2, B( H t t p+(1−p)H(t ) Next note that ξ (t) = {F2 (1 − max{x0 x∗ (t)} t) + [G(max{x0 x∗ (t)} − 12 ) − G(x0 − 12 )]}[F2 (1 − x0 ξ(t))]−1 > 0.31 Thus, t > t implies ξ(t ) > ξ(t ). Putting p ) ≤ B(H p ), which contrathese facts together, we have ξ(t ) < ξ(t ) = B(H t t dicts the supposition that t is a solution. Step 3B: A solution to expression (5) exists iff U(max{x0 x∗ (t0∗ )} t0∗ t0∗ ) ≤ U( 12 B(Ht0∗ ) t0∗ ). When it exists, it is unique and t0 ∈ (0 t0∗ ]. 31 For t such that x∗ (t) ≥ x0 , the envelope theorem allows us to ignore terms involving dx∗ (t)/dt. Thus, even when x∗ (t) = x0 , the left and right derivatives are identical.
1630
J. ANDREONI AND B. DOUGLAS BERNHEIM
We define the function ζ(t) as follows: (i) if U(x0 0 t) ≥ U( 12 B(Ht ) t), then ζ(t) = 0; (ii) if U(x0 0 t) < U( 12 B(Ht ) t), then ζ(t) solves U(x0 ζ(t) t) = U( 12 B(Ht ) t). Existence, uniqueness, and continuity of ζ(t) are easy to verify. Moreover, the equality in (5) is equivalent to the statement that ζ(t0 ) = tp ) t) − U(max{x0 x∗ (t)} t t) tp ). In Step 3A, we showed that U(x0 B(H B(H 0 ∗ exceeds zero for t < t0 , is less than zero for t > t0∗ , and equals zero at t = t0∗ . Consequently, the inequality in (5) holds iff t0 ≤ t0∗ . Therefore, (5) is equivalent tp ) for t0 ∈ [0 t ∗ ]. to the statement that ζ(t0 ) = B(H 0 0 We can rewrite the equation defining ζ(t) (when U(x0 0 t) < U( 12 B(Ht ) t)) as F(1 − x0 ζ(t)) = t(G(0) − G(x0 − 12 )) + F( 12 B(Ht )). The right-hand side of this expression is strictly increasing in t and the left-hand side is strictly increasing in ζ. Consequently, there exists t ∈ [0 t] such that ζ(t) = 0 for t ∈ [0 t) and ζ(t) is strictly increasing in t for t ≥ t. tp ) is weakly decreasing in t for t ∈ [0 t ∗ ]. Consider any Next note that B(H 0 two values, t , t ≤ t0∗ with t > t . By the argument in step 3A, t ≤ ξ(t ) ≤ p ). Defining L(t) and λ exactly as in Step 3A, we have B(H p ) ≤ B(H p ) by B(H t t t Assumption 2. Now suppose U(max{x0 x∗ (t0∗ )} t0∗ t0∗ ) ≤ U( 12 B(Ht0∗ ) t0∗ ). In that case, p∗ ) t ∗ ) = U(max{x0 x∗ (t ∗ )} t ∗ t ∗ ) ≤ U( 1 B(Ht ∗ ) t ∗ ), so ζ(t ∗ ) ≥ U(x0 B(H 0 0 0 0 0 0 t0 2 0 ∗ p∗ ) > 0, which also implies tp ) for all t < B(H t < t . Plainly, ζ(t) = 0 < B( H t, 0 t0 p ∗ so any solutions to (5) must lie in [ t t0 ]. Because ζ( t) = 0 ≤ B(Ht ) and p∗ ), continuity guarantees that a solution exists. Since ζ(t) is ζ(t ∗ ) ≥ B(H 0
t0
tp ) is weakly decreasing in t on [ strictly increasing and B(H t t0∗ ] , the solution is unique. Because ζ(0) = 0 < B(H), we can rule out t0 = t = 0. Finally suppose U(max{x0 x∗ (t0∗ )} t0∗ t0∗ ) > U( 12 B(Ht0∗ ) t0∗ ). In that case, p∗ ) t ∗ ) = U(max{x0 x∗ (t ∗ )} t ∗ t ∗ ) > U( 1 B(Ht ∗ ) t ∗ ), so ζ(t ∗ ) < U(x0 B(H 0 0 0 0 0 0 t0 2 0 p∗ ). Given the monotonicity of ζ and B, ζ(t) < B(H tp ) for all t < t ∗ . Hence B(H t0
0
there exists no t0 satisfying (5). Step 3C: If U(max{x0 x∗ (t0∗ )} t0∗ t0∗ ) ≤ U( 12 B(Ht0∗ ) t0∗ ), there is at most one equilibrium action function in Q2 and it must be a double-pool action function. In a blended double-pool equilibrium, U(max{x0 x∗ (t0∗ )} t0∗ t0∗ ) = ∗ ∗ U(S t0 (t0∗ ) t0∗ t0∗ ) > U(S t0 (t1 ) t1 t0∗ ) > U( 12 B(Ht1 ) t0∗ ) > U( 12 B(Ht0∗ ) t0∗ ) (where the first inequality follows from Lemma 3(c), the second from t0∗ < t1 , ∗ S t0 (t1 ) < 12 , and (3), and the third from t0∗ < t1 ), contradicting the supposition. Now consider blended single-pool equilibria. Let xm solve maxx U(x t t0∗ ). It is easy to check that xm ≤ x∗ (t). Note that U(max{x0 x∗ (t0∗ )} t0∗ t0∗ ) = ∗ ∗ U(S t0 (t0∗ ) t0∗ t0∗ ) > U(S t0 (t) t t0∗ ) > U( 12 t t0∗ ) > U( 12 B(Ht0∗ ) t0∗ ) (where the first inequality follows from Lemma 3(c), the second from xm ≤ x∗ (t) < ∗ S t0 (t) ≤ 12 , and the third from t > B(Ht0∗ )), contradicting the supposition. Fi-
SOCIAL IMAGE AND THE 50–50 NORM
1631
nally, since the solution for (5) is unique (Step 3B), there can be at most one double-pool equilibrium action function. Step 3D: If U(max{x0 x∗ (t0∗ )} t0∗ t0∗ ) > U( 12 B(Ht0∗ ) t0∗ ), there is at most one ∗ equilibrium action function in Q2 . If S t0 (t) > 12 , it must be a blended double∗ pool action function. If S t0 (t) ≤ 12 , it must be a blended single-pool action function. By Step 3B, (5) has no solution, so double-pool equilibria do not exist. From Step 3A, the value of t0∗ is uniquely determined. Analytically, ruling out ∗ blended single-pool equilibria (blended double-pool equilibria) when S t0 (t) > 1 1 t0∗ (S (t) ≤ 2 ) in the extended dictator game is analogous to ruling out efficient 2 separating equilibria (blended equilibria) when S00 (t) > 12 (S00 (t) ≤ 12 ) in the standard dictator game; we omit the details to conserve space. Step 3E: If (6) is satisfied, there exists an equilibrium action function QE ∈ Q2 and an inference function P E such that (QE P E ) satisfies the D1 criterion. If U(max{x0 x∗ (t0∗ )} t0∗ t0∗ ) ≤ U( 12 B(Ht0∗ ) t0∗ ), let QE be the double-pool action function for which the highest type in the pool at x0 is the t0 that solves ∗ (5); if U(max{x0 x∗ (t0∗ )} t0∗ t0∗ ) > U( 12 B(Ht0∗ ) t0∗ ) and S t0 (t) ≤ 12 , let QE be a blended single-pool action function for which the highest type in the pool ∗ at x0 is t0∗ ; if U(max{x0 x∗ (t0∗ )} t0∗ t0∗ ) > U( 12 B(Ht0∗ ) t0∗ ) and S t0 (t) > 12 , let QE be a blended double-pool action function for which the highest type in the pool at x0 is t0∗ and the highest type in the separating region is the t1 that solves (3). In each case, one can verify that for all t ∈ [0 t], QE (t) is type t’s best choice within QE ([0 t]). For x ∈ [0 x0 ), it is easily shown (in each case) that 0 ∈ Mx and, given (6), every type t prefers its equilibrium outcome to (x 0); therefore, let PxE place all probability on t = 0. For any unchosen x > x0 , let xL be the greatest chosen action less than x, let txH be the greatest type choosing xL , and let txL be the infimum of types choosing xL . For any unchosen x > x0 with QE (t) > x, one can show (in each case) that txH ∈ Mx and every type t prefers its equilibrium outcome to (x txH ); therefore, let Px place all probability on txH . For any unchosen x > x0 with QE (t) < x, one can show (in each case) that Mx ∩ [0 txL ] is nonempty and every type t prefers its equilibrium outcome to (x txL ); therefore, let Px be any distribution over Mx ∩ [0 txL ]. Then, in each case, (QE P E ) is an equilibrium and satisfies the D1 criterion. Step 3F: If (6) is satisfied, for any equilibrium (Q P) satisfying the D1 criterion, Q and QE (the unique equilibrium action function within Q2 ) coincide on a set of full measure. One can verify that any equilibrium satisfying the D1 criterion must have the following properties: (i) no type chooses x > 12 (any type choosing x > 12 would deviate to a slightly lower transfer in light of Lemma 2 and the inferences implied by the D1 criterion); (ii) choices are weakly monotonic in type (follows from property (i) and Lemma 1), (iii) there is no pool at any action other than x0 and 12 (the proof is similar to that of Step 1D). First we claim that Q(t) ≥ x0 ∀t. We will prove that Q(0) ≥ x0 ; the claim then follows from property (ii). If Q(0) < x0 , then B(PQ(0) ) = 0 (prop-
1632
J. ANDREONI AND B. DOUGLAS BERNHEIM
erty (iii)), so U(Q(0) B(PQ(0) ) 0) ≤ U(0 0 0). Using property (ii) and part tp ), so by (6), (iii) of Assumption 1, one can show that B(Px0 ) ≥ mint∈[0t] B(H U(x0 B(Px0 ) 0) > U(Q(0) B(PQ(0) ) 0), a contradiction. Next we claim that Q(t) = x0 for some t > 0. If not, then by property (ii) we have B(H) = B(Px0 ), and for sufficiently small t > 0, B(PQ(t) ) ≤ B(H). Given that Q(t) > x0 > x∗ (t) for small t, such t would prefer (x0 B(Px0 )) to (Q(t) B(PQ(t) )), a contradiction. Next we claim that Q(t) > x0 for some t < t. If not, then B(Px0 ) = B(H) and (applying the D1 criterion) B(Px ) = t for x slightly greater than x0 , so all types could beneficially deviate to that x, a contradiction. Property (ii) and the last three claims imply that ∃t0 ∈ (0 t) such that Q(t) = x0 for t ∈ [0 t0 ) and Q(t) > x0 for t ∈ (t0 t]. Now we claim that for all t > t0 , Q(t) ∈ {S t0 (t) 12 }. The claim is obviously true if Q(t) = 12 for all t ∈ (t0 t]. By properties (i) and (ii), there is only one other possibility: ∃t1 ∈ (t0 t] such that Q(t) ∈ (x0 12 ) for t ∈ (t0 t1 ) and Q(t) = 12 for t ∈ (t1 t]. Arguing as in Step 1F, we see that ∃z ≥ x0 such that Q(t) = St0 z (t) for t ∈ (t0 t1 ). We must have z ≥ x∗ (t0 ): if not, then by equation (1), Q (t) = St0 z (t) < 0 for t close to t0 , contrary to property (ii). Thus, z ≥ max{x0 x∗ (t0 )}. Next we rule out z > max{x0 x∗ (t0 )}: in that case, for sufficiently small ε > 0, max{x0 x∗ (t0 )} + ε is not chosen by any type, and it can be shown that the D1 criterion implies B(Pmax{x0 x∗ (t0 )}+ε ) ≥ t0 , so for small η > 0, type t0 + η strictly prefers (max{x0 x∗ (t0 )} + ε, t0 ) to (Q(t0 + η), t0 + η), a contradiction. Thus, z = max{x0 x∗ (t0 )}, which establishes the claim. Thus, Q must fall into one of three categories: (a) ∃t0 ∈ (0 t) such that Q(t) = x0 for t ∈ [0 t0 ) and Q(t) = 12 for t ∈ (t0 t]; (b) ∃t0 ∈ (0 t) with S t0 (t) ≤ 12 such that Q(t) = x0 for t ∈ [0 t0 ) and Q(t) = S t0 (t) for t ∈ (t0 t]; (c) ∃t0 ∈ (0 t) and t1 ∈ (t0 t) with S t0 (t1 ) ≤ 12 such that Q(t) = x0 for t ∈ [0 t0 ), Q(t) = S t0 (t) for t ∈ (t0 t1 ), and Q(t) = 12 for t ∈ (t0 t]. If Q falls into categories (a) or (b), let Q∗ (t) = Q(t) for t = t0 and Q∗ (t0 ) = x0 ; if Q falls into category (c), let Q∗ (t) = Q(t) for t ∈ / {t0 t1 }, Q∗ (t0 ) = x0 , and Q∗ (t1 ) = S t0 (t1 ). In each case, one can show that because Q is an equilibrium action function, so is Q∗ ; also, Q∗ ∈ Q2 , and Q and Q∗ coincide on a set of full measure. Q.E.D. PROOF OF THEOREM 4: To reflect the dependence of t0∗ (defined in Step 3A) on p, we will use the notation t0∗ (p). Let t0 (p) equal t0∗ (p) when either a blended single-pool or double-pool equilibrium exists, and equal the solution to the equality in (5) when a double-pool equilibrium exists. Let t1∗ (p) equal the solution to (3) when a blended double-pool equilibrium exists, equal t when a blended single-pool equilibrium exists, and equal t0 (p) when a doublepool equilibrium exists. Regardless of which type of equilibrium prevails, types t ∈ [0 t0 (p)] choose x = x0 and types t ∈ (t1∗ (p) t] choose x = 12 . We demonstrate that t0 (p) is strictly increasing in p and t1∗ (p) is increasing in p (strictly ∗ when t1 (p) < t), which establishes the theorem.
SOCIAL IMAGE AND THE 50–50 NORM
1633
Step 4A: t0 (p) and t1∗ (p) are continuous in p. Continuity of t0∗ (p) follows from uniqueness and continuity of the functions in (2). For similar reasons, when a solution to the equality in (5) exists, it is continuous in p. Finally, it is easy to check that when U(max{x0 x∗ (t0∗ (p))} t0∗ (p) t0∗ (p)) = U( 12 B(Ht0∗ (p) ) t0∗ (p)), the solutions to (2) and the equality in (5) coincide. Thus, t0 (p) is continuous. Continuity of the solution to (3) (when it exists) follows from the observations that (i) t0∗ (p) is continuous in p, (ii) S t0 (t) is continuous for all t0 and t, and (iii) the solution to (3), when it exists, is unique. ∗ When U(max{x0 x∗ (t0∗ )} t0∗ t0∗ ) = U( 12 B(Ht0∗ ) t0∗ ), we must have S t0 (p) (t) > 12 ∗ (otherwise type t0∗ (p) would gain by deviating to Q(t) = S t0 (p) (t)), and it is easy to check that t0∗ (p) satisfies both (3) and (5). When U(max{x0 x∗ (t0∗ )} t0∗ t0∗ ) > ∗ U( 12 B(Ht0∗ ) t0∗ ) and S t0 (p) (t) = 12 , then t solves (3). Thus, t1∗ (p) is continuous. Step 4B: t0∗ (p) is strictly increasing in p. From Step 3A, t0∗ (p) satisfies p∗ ). Consider p and p < p . One can verify that H p (t) = ξ(t0∗ (p)) = B(H τ t0 (p) p (t) + (1 − λ)L(t), where λ = ( p )( p +(1−p)H(τ) ) ∈ (0 1) and L(t) = λH τ p p +(1−p )H(τ) H(min{τt}) ∗ p ). Since the support . For τ ≤ t (p ), max supp(L) = τ ≤ ξ(τ) ≤ B(H τ 0 H(τ) p ) for τ ≤ t ∗ (p ). p ) > B(H of L is nondegenerate, Assumption 2 implies B(H τ τ 0 p ) > Note that ξ(t), defined in Step 3A, is independent of p. Therefore, B(H τ ξ(τ) for τ ≤ t0∗ (p ), so t0∗ (p ) > t0∗ (p ), as claimed. t0 (p ) > Step 4C: If a double-pool equilibrium exists for p and p < p , then t0 (p ). Recall from Step 3B that, in such cases, t0 (p) satisfies ζ(t0 (p)) = p ) is weakly dep ) and that t0 (p) ≤ t0∗ (p). We have shown that B(H B(H τ t0 (p) p ) > B(H p ) for τ ≤ t ∗ (p ) creasing in τ for τ ≤ t0∗ (p) (Step 3B) and that B(H 0 τ τ (Step 4B). Note that ζ(t), as defined in Step 3B, is independent of p. From p ) > ζ(τ) for τ ≤ these observations, it follows that B(H t0 (p ), so t0 (p ) > τ t0 (p ), as desired. Step 4D: If a blended double-pool equilibrium exists for p and p < ∗ p , then t1∗ (p ) > t1∗ (p ). We know t0∗ (p ) > t0∗ (p ). Since S t0 (p ) (t0∗ (p )) > ∗ ∗ ∗ max{x0 x∗ (t0∗ (p ))} = S t0 (p ) (t0∗ (p )), we know S t0 (p ) (t) < S t0 (p ) (t) for all t > ∗ p t0 (p ) (Lemma 3(e)). Analogously to Step 1A, define ψ (t) as the solution to ∗ U(S t0 (p) (t) t t) = U( 12 ψp (t) t). We can rewrite the solution for t1∗ (p) (when a blended double-pool equilibrium exists) as ψp (t) = B(Ht ). Arguing as in Step 1A, one can show that ψp (t) is decreasing and continuous in t, while B(Ht ) is increasing and continuous in t (and independent of p). Moreover, ∗ ∗ ∗ ∗ since S t0 (p ) (t) < S t0 (p ) (t), we have U(S t0 (p ) (t) t t) < U(S t0 (p ) (t) t t), which p p p means ψ (t) > ψ (t). Thus, the value of t satisfying ψ (t) = B(Ht ) is larger for p = p than for p = p . Q.E.D. REFERENCES AGRAWAL, P. (2002): “Incentives, Risk, and Agency Costs in the Choice of Contractual Arrangements in Agriculture,” Review of Development Economics, 6, 460–477. [1607]
1634
J. ANDREONI AND B. DOUGLAS BERNHEIM
ANDREONI, J. (1989): “Giving With Impure Altruism: Applications to Charity and Ricardian Equivalence,” Journal of Political Economy, 97, 1447–1458. [1610] (1990): “Impure Altruism and Donations to Public Goods: A Theory of Warm-Glow Giving,” Economic Journal, 100, 464–477. [1610] ANDREONI, J., AND B. D. BERNHEIM (2007): “Social Image and the 50–50 Norm: A Theoretical and Experimental Analysis of Audience Effects,” Working Paper, Stanford University and UCSD. [1612,1613,1619] (2009): “Supplement to ‘Social Image and the 50–50 Norm: A Theoretical and Experimental Analysis of Audience Effects’,” Econometrica Supplemental Material, 77, http://www.econometricsociety.org/ecta/Supmat/7384_instructions to experimental subjects. pdf. [1610,1619] ANDREONI, J., AND J. H. MILLER (2002): “Giving According to GARP: An Experimental Test of the Consistency of Preferences for Altruism,” Econometrica, 70, 737–753. [1608] ANDREONI, J., AND R. PETRIE (2004): “Public Goods Experiments Without Confidentiality: A Glimpse Into Fund-Raising,” Journal of Public Economics, 88, 1605–1623. [1608] ANDREONI, J., P. BROWN, AND L. VESTERLUND (2002): “What Makes an Allocation Fair? Some Experimental Evidence,” Games and Economic Behavior, 40, 1–24. [1608] BAGWELL, L. S., AND B. D. BERNHEIM (1996): “Veblen Effects in a Theory of Conspicuous Consumption,” American Economic Review, 86, 349–373. [1610] BERNHEIM, B. D. (1994): “A Theory of Conformity,” Journal of Political Economy, 102, 841–877. [1610,1615] BERNHEIM, B. D., AND S. SEVERINOV (2003): “Bequests as Signals: An Explanation for the Equal Division Puzzle,” Journal of Political Economy, 111, 733–764. [1607,1615] BLOOM, D. E. (1986): “Empirical Models of Arbitrator Behavior Under Conventional Arbitration,” Review of Economics and Statistics, 68, 578–585. [1607] BLOUNT, S. (1995): “When Social Outcomes Aren’t Fair: The Effect of Causal Attributions on Preferences,” Organizational Behavior Human Decision Processes, 63, 131–144. [1608] BOHNET, I., AND B. S. FREY (1999): “The Sound of Silence in Prisoner’s Dilemma and Dictator Games,” Journal of Economic Behavior & Organization, 38, 43–57. [1608,1623] BOLTON, G. E., AND A. OCKENFELS (2000): “ERC: A Theory of Equity, Reciprocity, and Competition,” American Economic Review, 90, 166–193. [1608,1611] BRANDTS, J., AND C. A. HOLT (1992): “An Experimental Test of Equilibrium Dominance in Signaling Games,” American Economic Review, 82, 1350–1365. [1612] (1993): “Adjustment Patterns and Equilibrium Selection in Experimental Signaling Games,” International Journal of Game Theory, 22, 279–302. [1612] (1995): “Limitations of Dominance and Forward Induction: Experimental Evidence,” Economics Letters, 49, 391–395. [1612] BROBERG, T., T. ELLINGSEN, AND M. JOHANNESSON (2007): “Is Generosity Involuntary?” Economic Letters, 94, 32–37. [1608,1619] CAMERER, C. F. (1997): “Progress in Behavioral Game Theory,” Journal of Economic Perspectives, 11, 167–188. [1607] (2003): Behavioral Game Theory: Experiments in Strategic Interaction. Princeton: Princeton University Press. [1607,1611] CHERRY, T. L., P. FRYKBLOM, AND J. F. SHOGREN (2002): “Hardnose the Dictator,” American Economic Review, 92, 1218–1221. [1609,1611] CHO, I.-K., AND D. KREPS (1987): “Signaling Games and Stable Equilibria,” Quarterly Journal of Economics, 102, 179–221. [1612] CHO, I.-K., AND J. SOBEL (1990): “Strategic Stability and Uniqueness in Signaling Games,” Journal of Economic Theory, 50, 381–413. [1615] DANA, J., D. M. CAIN, AND R. M. DAWES (2006): “What You Don’t Know Won’t Hurt Me: Costly (But Quiet) Exit in Dictator Games,” Organizational Behavior and Human Decision Processes, 100, 193–201. [1608,1619]
SOCIAL IMAGE AND THE 50–50 NORM
1635
DANA, J., R. A. WEBER, AND X. KUANG (2007): “Exploiting Moral Wiggle Room: Experiments Demonstrating an Illusory Preference for Fairness,” Economic Theory, 33, 67–80. [1608,1621] DASGUPTA, S., AND Z. TAO (1998): “Contractual Incompleteness and the Optimality of Equity Joint Ventures,” Journal of Economic Behavior & Organization, 37, 391–413. [1607] DE WEAVER, M., AND J. ROUMASSET (2002): “Risk Aversion as Effort Incentive: A Correction and Prima Facie Test of the Moral Hazard Theory of Share Tenancy,” Economics Bulletin, 15, 1–16. [1607] ELLINGSEN, T., AND M. JOHANNESSON (2005): “Sunk Costs and Fairness in Incomplete Information Bargaining,” Games and Economic Behavior, 50, 155–177. [1608] (2008): “Pride and Prejudice: The Human Side of Incentive Theory,” American Economic Review, 98, 990–1008. [1610] FALK, A., E. FEHR, AND U. FISCHBACHER (2008): “Testing Theories of Fairness—Intentions Matter,” Games and Economic Behavior, 62, 287–303. [1608] FEHR, E., AND K. SCHMIDT (1999): “A Theory of Fairness, Competition, and Cooperation,” Quarterly Journal of Economics, 114, 817–868. [1608,1611] FORSYTHE, R., J. HOROWITZ, N. SAVIN, AND M. SEFTON (1994): “The Statistical Analysis of Experiments With Simple Bargaining Games,” Games and Economic Behavior, 6, 347–369. [1608,1623] GEANAKOPLOS, J., D. PEARCE, AND E. STACCHETTI (1989): “Psychological Games and Sequential Rationality,” Games and Economic Behavior, 1, 60–79. [1610] GLAZER, A., AND K. A. KONRAD (1996): “A Signaling Explanation for Charity,” American Economic Review, 86, 1019–1028. [1610] GÜTH, W., S. HUCK, AND P. OCKENFELS (1996): “Two-Level Ultimatum Bargaining With Incomplete Information: An Experimental Study,” The Economic Journal, 106, 593–604. [1608] HARBAUGH, W. (1998): “The Prestige Motive for Making Charitable Transfers,” American Economic Review, 88, 277–282. [1608] HAUSWALD, R., AND U. HEGE (2003): “Ownership and Control in Joint Ventures: Theory and Evidence,” Discussion Paper 4056, CEPR. [1607] HOFFMAN, E., K. MCCABE, AND V. SMITH (1996): “Social Distance and Other-Regarding Behavior in Dictator Games,” American Economic Review, 86, 563–660. [1608] IRELAND, N. (1994): “On Limiting the Market for Status Signals,” Journal of Public Economics, 53, 91–110. [1610] KAGEL, J. H., C. KIM, AND D. MOSER (1996): “Fairness in Ultimatum Games With Asymmetric Information and Asymmetric Payoffs,” Games and Economic Behavior, 13, 100–110. [1608] KOCH, A. K., AND H. T. NORMANN (2008): “Giving in Dictator Games: Regard for Others or Regard by Others?” Southern Economic Journal, forthcoming. [1608] LEVINE, D. K. (1998): “Modeling Altruism and Spitefulness in Experiments,” Review of Economic Dynamics, 1, 593–622. [1609] LIST, J. A. (2007): “On the Interpretation of Giving in Dictator Games,” Journal of Political Economy, 115, 482–493. [1620] MAILATH, G. J. (1987): “Incentive Compatibility in Signaling Games With a Continuum of Types,” Econometrica, 55, 1349–1365. [1613,1628] MANNING, A. (2007): “Respect,” Mimeo, London School of Economics. [1610] MENCHIK, P. L. (1980): “Primogeniture, Equal Sharing, and the U.S. Distribution of Wealth,” Quarterly Journal of Economics, 94, 299–316. [1607] (1988): “Unequal Estate Division: Is It Altruism, Reverse Bequests, or Simply Noise?” in Modelling the Accumulation and Distribution of Wealth, ed. by D. Kessler and A. Masson, New York: Oxford University Press. [1607] MITZKEWITZ, M., AND R. NAGEL (1993): “Experimental Results on Ultimatum Games With Incomplete Information,” International Journal of Game Theory, 22, 171–198. [1608] OBERHOLZER-GEE, F., AND R. EICHENBERGER (2008): “Fairness in Extended Dictator Game Experiments,” Economic Analysis & Policy, 8, 1718. [1609]
1636
J. ANDREONI AND B. DOUGLAS BERNHEIM
REGE, M., AND K. TELLE (2004): “The Impact of Social Approval and Framing on Cooperation in Public Good Situations,” Journal of Public Economics, 88, 1625–1644. [1608] ROTEMBERG, J. (2008): “Minimally Acceptable Altruism and the Ultimatum Game,” Journal of Economic Behavior & Organization, 66, 457–476. [1609] SOETEVENT, A. R. (2005): “Anonymity in Giving in a Natural Context—A Field Experiment in 30 Churches,” Journal of Public Economics, 89, 2301–2323. [1608] TADELIS, S. (2007): “The Power of Shame and the Rationality of Trust,” Mimeo, Haas School of Business, U.C. Berkeley. [1610] VEUGELERS, R., AND K. KESTELOOT (1996): “Bargained Shares in Joint Ventures Among Asymmetric Partners: Is the Matthew Effect Catalyzing?” Journal of Economics, 64, 23–51. [1607] WILHELM, M. O. (1996): “Bequest Behavior and the Effect of Heirs’ Earnings: Testing the Altruistic Model of Bequests,” American Economic Review, 86, 874–892. [1607]
Dept. of Economics, University of California at San Diego, 9500 Gillman Drive, La Jolla, CA 92093, U.S.A. and NBER;
[email protected] and Dept. of Economics, Stanford University, Stanford, CA 94305-6072, U.S.A. and NBER;
[email protected]. Manuscript received August, 2007; final revision received April, 2009.
Econometrica, Vol. 77, No. 5 (September, 2009), 1637–1664
GENDER DIFFERENCES IN COMPETITION: EVIDENCE FROM A MATRILINEAL AND A PATRIARCHAL SOCIETY BY URI GNEEZY, KENNETH L. LEONARD, AND JOHN A. LIST1 We use a controlled experiment to explore whether there are gender differences in selecting into competitive environments across two distinct societies: the Maasai in Tanzania and the Khasi in India. One unique aspect of these societies is that the Maasai represent a textbook example of a patriarchal society, whereas the Khasi are matrilineal. Similar to the extant evidence drawn from experiments executed in Western cultures, Maasai men opt to compete at roughly twice the rate as Maasai women. Interestingly, this result is reversed among the Khasi, where women choose the competitive environment more often than Khasi men, and even choose to compete weakly more often than Maasai men. These results provide insights into the underpinnings of the factors hypothesized to be determinants of the observed gender differences in selecting into competitive environments. KEYWORDS: Gender and competition, matrilineal and patriarchal societies, field experiment.
1. INTRODUCTION ALTHOUGH WOMEN HAVE MADE important strides in catching up with men in the workplace, a gender gap persists both in wages and in prospects for advancement. Commonly cited explanations for such disparities range from charges of discrimination to claims that women are more sensitive than men to work–family conflicts and therefore are less inclined to make career sacrifices.2 Combining results from psychology studies (see Campbell (2002), for a review) with recent findings in the experimental economics literature (e.g., Gneezy, Niederle, and Rustichini (2003), Gneezy and Rustichini (2004), Niederle and Vesterlund (2005)), an alternative explanation arises: men are more competitively inclined than women.3 A stylized finding in this literature is that men and women differ in their propensities to engage in competitive activities, with men opting to compete more often than women, even in tasks where women are more able. Such data patterns might provide insights into why we observe a higher fraction of women than men among, for example, grammar school teachers, but the reverse among CEOs. An important puzzle in this literature relates to the underlying factors responsible for the observed differences in competitive inclinations. One oft 1 We thank our research team for aiding in the collection of these data, especially Steffen Andersen. A co-editor was instrumental in guiding us toward a much improved manuscript, both in content and in style. Four anonymous referees also provided quite useful comments, as did many seminar participants. 2 See Altonji and Blank (1999), Blau and Kahn (1992, 2000), and Blau, Ferber, and Winkler (2002). 3 See also Vandegrift, Yavas, and Brown (2004), Gneezy and Rustichini (2005), and Datta Gupta, Poulsen, and Villeval (2005).
© 2009 The Econometric Society
DOI: 10.3982/ECTA6690
1638
U. GNEEZY, K. L. LEONARD, AND J. A. LIST
heard hypothesis is that men and women are innately different (Lawrence (2006)). For example, in discussions concerning why men considerably outnumber women in the sciences, several high profile scholars have argued that men are innately better equipped to compete (see, e.g., Baron-Cohen (2003), Lawrence (2006), and the citations in Barres (2006)). An empirical regularity consistent with this notion is the fact that substantial heterogeneity exists in the competitiveness of individuals raised in quite similar environments; see, for example, the discussion of the tendency to compete in bargaining (Shell (2006)). Nevertheless, the role of nurture, or the fact that culture might be critically linked to competitive inclinations, is also an important consideration. More than a handful of our male readership can likely recall vividly their grammar school physical education teacher scorning them with the proverbial “you’re playing like a girl” rant to induce greater levels of competitive spirit. Clearly, however, the explanations might not be competing; rather the nature–nurture interaction might be of utmost importance, either because nurture enables the expression of nature (Turkheimer (1998, 2003), Ridley (2003)) or because nature and nurture co-evolve (Boyd and Richerson (1985, 2005)). Our goal in this study is to provide some insights into the underpinnings of the observed differences in competitiveness across men and women using a simple experimental task. One approach to lending insights into the source of such preference differences is to find two distinct societies and observe choices that provide direct insights into the competitiveness of the participants. After months of background research, we concluded that the Maasai tribe of Tanzania and the Khasi tribe in India provided interesting natural variation that permitted an exploration into the competitiveness hypothesis. As explained in greater detail below, while several other potentially important factors vary across these societies, the Maasai represent a patriarchal society, whereas the Khasi are a matrilineal and matrilocal society. Our experimental results reveal interesting differences in competitiveness: in the patriarchal society, women are less competitive than men, a result consistent with student data drawn from Western cultures. Yet, this result reverses in the matrilineal society, where we find that women are more competitive than men. Perhaps surprisingly, Khasi women are even slightly more competitive than Maasai men, but this difference is not statistically significant at conventional levels under any of our formal statistical tests. We view these results as providing potentially useful insights into the crucial link between culture and behavioral traits that influence economic outcomes.4 Such insights might also have import within the policy community where targeting of policies can be 4
As we discuss below, this result might be due to learning or an evolutionary process whereby the selection effects across societies generate natural differences. We argue that, in either case, culture has an influence.
GENDER DIFFERENCES IN COMPETITION
1639
importantly misguided if the underlying mechanism generating the data is ill understood. The remainder of our study proceeds as follows. The next section provides an overview of the two societies and our experimental design. We proceed to a discussion of the experimental results in Section 3. Section 4 provides various robustness tests, Section 5 concludes. 2. SOCIETAL BACKGROUND AND EXPERIMENTAL DESIGN We are sick of playing the roles of breeding bulls and baby-sitters. A Khasi man (Ahmed (1994)) Men treat us like donkeys. A Maasai woman (Hodgson (2001))
2.1. Brief Societal Backgrounds The Maasai and the Khasi represent, respectively, a patriarchal and a matrilineal/matrilocal society. Originally, we attempted to find two societies in which the roles of men and women were mirror images, but this approach found little success. Indeed, the sociological literature is almost unanimous in the conclusion that truly matriarchal societies no longer exist.5 In addition, even ordinal classification of societies on any dimension is dangerous, as culture and society are not static fixtures handed down from prehistory. Certain reports of extreme female domination in the Khasi or strong male domination among the Maasai are somewhat exaggerated and subject to charges of ethnocentrism.6 The Khasi The Khasi of Meghalaya, in northeast India are a matrilineal society, and inheritance and clan membership always follow the female lineage through the youngest daughter. Family life is organized around the mother’s house, which is headed by the grandmother who lives with her unmarried daughters, her youngest daughter (even if she is married), and her youngest daughter’s children. Additionally, her unmarried, divorced, or widowed brothers and sons reside in the home. The youngest daughter never leaves and eventually becomes the head of the household; older daughters usually form separate households adjacent to their mother’s household. Furthermore, a woman never joins 5 Campbell (2002) summarized as follows: “there are societies that are matrilineal and matrilocal and where women are accorded veneration and respect but there are no societies which violate the universality of patriarchy defined as ‘a system of organization in which the overwhelming number of upper positions in hierarchies are occupied by males’ (Goldberg, 1993, p. 14).” 6 About the Maasai in particular, there is a vigorous debate on the current and historical role of women (see Hodgson (2000, 2001) and Spencer (1965, 1994)).
1640
U. GNEEZY, K. L. LEONARD, AND J. A. LIST
the household of her husband’s family and a man usually leaves his mother’s household to join his wife’s household. In some cases, a man will practice duolocal marriage (in which he lives in both his mother’s and his wife’s households). Even in cases when a married man resides with his wife’s family, he spends much, if not most, of his time in his mother’s or sisters’ household (Nakane (1967), Van Ham (2000)). Though Khasi women do not generally assume the roles held by men in patriarchal societies (they do not become warriors or hunters, for example), they always live in households in which they or their mothers have authority over most household decisions. On the other hand, men frequently hold roles that seem to mirror those of women in patriarchal societies. The Khasi husband dwells in a household in which he has no authority or property, is expected to work for the gain of his wife’s family, and has no social roles deemed important. His role is summarized by Nakane (1967, p. 125), who provided accounts of the subservient role of Khasi men. Such status has lead to the formation of a men’s rights movement (Nonbgri (1988), Ahmed (1994), Van Ham (2000)). Perhaps the most important economic feature of Khasi society is that the return to unverifiable investment in the human capital of girls is retained within the household, whereas, in other cultures, only the verifiable component of investment can be retained through bride price or dowry. In other words, Khasi families can choose to raise exactly the daughter they would like to keep in their household, not the daughter most likely to be preferred by other households. The Maasai Age and cattle dominate the Maasai social structure. The most important distinctions between men are age-based, and almost all wealth is in cattle. The age structure prevents men from marrying until they are roughly 30 years old and polygamy is the most common form of marriage. Therefore, the average Maasai woman is married to a much older man who typically entertains multiple wives (Spencer (2003)). The plight of women among the Maasai is such that wives are said to be less important to a man than his cattle. For example, daughters are not counted in response to the question “How many children do you have?” and a Maasai man will refer to his wife and children as “property.” When their husband is absent, most Maasai women are required to seek permission from an elder male before they travel any significant distance, seek health care, or make any other important decision. Although few Maasai receive any formal education, women receive even less education than men. Their restricted roles and authority combined with the inequality of age in marriage noticeably influence the view that married women have of their roles in society. Of Samburu women (who are part of the larger Maa ethnic group and are very similar to the Maasai), Spencer (1965, p. 231) noted:
GENDER DIFFERENCES IN COMPETITION
1641
On the whole I found women were quite ignorant of many aspects of the total society and usually unhelpful as informants. Outside the affairs of their own family circle they often showed certain indifference. They were less inquisitive than the males and less quick to grasp situations. They found it harder to comprehend my remarks and questions. I had the impression that they had never been encouraged to show much initiative on their own, and this was a quality which they simply had not developed; any inborn tendencies to this had been baulked by the strictness of their upbringing. Their demeanor was sometimes listless and frequently sour. They often lacked the general conviviality and warmth that typified the adult males and it was only with ameliorating circumstances of middle-age that they tended to acquire it—and many never did.
Despite these stark differences, there are important similarities between the two societies. Khasi men are more important in their sisters’ households than in their wives’ households, and Maasai women can enjoy prestige and power in their roles as widows (if they have sons).7 Despite the fact that the Khasi elevate the importance of women and the historical evidence that they invest significantly in the human capital of their daughters,8 many important decisions in Khasi society remain the domain of men. Women do not participate in politics, civil defense, or justice, and priesthood is a male profession (Nongbri (2003)). Additionally, there is evidence that women who attempt to speak about such domains are chastised.9 2.2. Experimental Design To provide insights into whether there are gender differences in competitive choices across these two societies, we design an experiment that is identical in the two environments. In each session we recruited the participants in advance and asked each potential subject to arrive at a central place in the village (either the school or the clinic) at a given time. This attenuated selection problems since everyone was interested in participating in the experiment after they were made aware of the pecuniary incentives involved. The experiment with the Maasai was conducted in two villages in the Arumeru district in the Arusha region of Tanzania. The experiment with the Khasi was conducted in the Meghalaya region of India. Upon arrival at each experimental site, participants were directed into one of two groups randomly. These groups were separated for the entire experiment. Similar procedures were used across the societies to ensure comparability. For example, in a representative session among the Maasai, the actual experiment was conducted around a small house with four sides, called side 1, 2, 3, 7 See Hodgson (2000, 2001) for a more nuanced discussion of the Maasai and Samburu, and Lesorogol (2003) for more evidence of the attitudes of the Samburu in an experimental context. 8 A report from the Agro-Economic Research Centre for northeast India (1969) noted the very high levels of school attendance among the Khasi, and particularly the fact that almost all girls were in school at a time when few girls from other tribes ever attended school. 9 “A woman who dares to voice her opinion on public affairs is regarded as a ‘hen that crows’— a freak of nature” (Nongbri (2003, p. 187)).
1642
U. GNEEZY, K. L. LEONARD, AND J. A. LIST
and 4. The structure was such that each side of the house was private and could not be observed from any of the other sides. Subjects in each group were seated on two different sides of the small house: group 1 was seated on side 1 and group 2 was seated on side 2. One by one we privately called participants—one from each group—to the experimental area. Members of group 1 were called to side 3 and members of group 2 were called to side 4. Participants did not know the identities of participants in the other group. On each of those sides was an experimenter awaiting the participants. In a second Maasai session, we were able to use four empty classrooms, similarly isolated from each other. The setup was otherwise identical. The Khasi sessions were run similarly in a classroom setting. When a participant moved to the area where the experiment was being conducted, he/she met an experimenter who explained the task. Instructions used in the Khasi sessions are reproduced in Appendix A; the Maasai instructions are identical (both sets of original instructions are available at www.arec.umd.edu/kleonard). The instructions were translated from English to the local language (either Maasai or Khasi) and were checked by having a different person translate them back into English. The instructions were read aloud to the individual participant by the experimenter. In each session we had one male and one female experimenter to control for possible gender effects of the experimenter, and we balanced the gender of the participants to have an equal ratio of male and female participants per experimenter. The experimental task was to toss a tennis ball into a bucket that was placed 3 meters away. Participants were informed that they had 10 chances. A successful shot meant that the tennis ball entered the bucket and stayed there. The task was chosen because it was simple to explain and implement, and no gender differences in ability were expected (as was found in a pilot experiment and reinforced in the results discussed below). Furthermore, we are aware of no other popular task in these societies that is similar to the ball games that we implemented. Indeed, the Khasi are known archers and the Maasai are known lancers, but since our task can only be completed with an underhand toss, the traditional skills do not advantage men over women. In this spirit, our data represent signals of initial competitive inclinations since the task is unfamiliar (Harrison and List (2004), denote such an approach as an artefactual field experiment). Participants, who numbered 155 in total, were told that they were matched with a participant from the other group who was performing the same task at the same time in another area. For example, in the Maasai representative session discussed above, a group 1 member on side 3 was anonymously paired with a group 2 member on side 4, and both subjects were informed that their identities would remain anonymous. The only decision participants were asked to make concerned the manner in which they would be paid for their performance. They made this choice before performing the task, but only after they fully understood the instructions and the payment schemes. The two options
GENDER DIFFERENCES IN COMPETITION
1643
participants were asked to choose between were (a) X per successful shot, regardless of the performance of the participant from the other group with whom they were randomly matched or (b) 3X per successful shot if they outperformed the other participant. They were told that in case they chose the second option and scored the same as the other participant, they would receive X per successful shot. We set X to equalize payments in terms of the prevailing exchange rates, and therefore, set X equal to 500 Tanzanian shillings in Tanzania and 20 rupees in India. After choosing the incentive scheme, participants completed the task and were told how the other participant performed. Then they were asked to proceed to another location where they provided personal information in an exit survey (see Appendix B for the experimental survey) and were paid their earnings in cash. As promised, participants were never given the opportunity to learn with whom they were paired. 3. RESULTS Summary data from the postexperiment survey are presented in Table I.10 We present all information drawn from the survey, which includes queries on gender, age, years of education, income, marital status, wage earning activities, and relation to head of household. Our average subject was in the 30–40 year old age range, but the Maasai sample had slightly older subjects. Average educational attainment is roughly similar across the two groups—about 4 years of education—but is slightly higher for women (men) among the Khasi (Maasai). Income levels show similar patterns: Khasi women earn more than Khasi men, and the qualitative nature of this result is reversed among the Maasai. Considering purchasing power, the Khasi earn more than the Maasai, though if we delete one extreme Khasi outlier, the numbers are similar. Activities across the societies, marital status, and relation to head of household differences are consonant with past anthropological evidence. For example, as suggested above, the Khasi tribe is a monogamous group, whereas polygamy is practiced among the Maasai. The differences in observable characteristics across gender, both intra- and intersociety, highlight that it is important to control for as many of these factors as possible when examining the data. For example, variables such as income might importantly influence play and relationship to head of household might provide an indication of control over income. Even after this is done, however, there might remain a critical vector of other variables (whether gambling is condoned, wealth, etc.) that might vary between the societies other than the role of women. Clearly, this issue is central to inference made from data gathered across any distinct groups, and it highlights that care should be 10 The Maasai sample does not sum to 75 (34 women and 40 men). This is because one participant failed to complete the survey after the task. This person chose not to compete and had one success.
1644
U. GNEEZY, K. L. LEONARD, AND J. A. LIST TABLE I PARTICIPANT CHARACTERISTICSa Khasi Mean (Std. Dev.)
Individual Characteristics
Maasai Mean (Std. Dev.)
Pooled
Women
Men
Pooled
Women
Men
Age
30.9 (16.1)
32.1 (16.7)
28.8 (15.0)
37.8 (13.5)
36.5 (12.1)
38.9 (14.6)
Education
4.3 (3.6)
4.5 (3.6)
4.1 (3.5)
4.3 (3.9)
4.1 (4.4)
4.5 (3.5)
23,569 (76,088)
25,794 (93,429)
19,437 (20,585)
195,040 (400,538)
154,294 (341,903)
234,550 (448,855)
Activity Farmer Student Teacher Housewife Other Unemployed
0.60 (0.5) 0.23 (0.4) 0.05 (0.2) 0.01 (0.1) 0.05 (0.2) 0.06 (0.2)
0.60 (0.5) 0.21 (0.1) 0.06 (0.2) 0.00 (0.0) 0.06 (0.2) 0.08 (0.3)
0.61 (0.5) 0.25 (0.4) 0.04 (0.2) 0.04 (0.2) 0.04 (0.2) 0.04 (0.2)
0.73 (0.5) 0.00 (0.0) 0.00 (0.0) 0.17 (0.4) 0.07 (0.3) 0.00 (0.0)
0.53 (0.5) 0.00 (0.0) 0.00 (0.0) 0.38 (0.5) 0.06 (0.2) 0.00 (0.0)
0.93 (0.3) 0.00 (0.0) 0.00 (0.0) 0.00 (0.0) 0.08 (0.3) 0.00 (0.0)
Marital status Single Marr. (mono.) Marr. (poly.) Widowed Divorced
0.36 (0.5) 0.44 (0.5) 0.00 (0.0) 0.13 (0.3) 0.08 (0.3)
0.33 (0.5) 0.42 (0.5) 0.00 (0.0) 0.17 (0.4) 0.08 (0.3)
0.43 (0.5) 0.46 (0.5) 0.00 (0.0) 0.04 (0.2) 0.07 (0.3)
0.24 (0.4) 0.32 (0.5) 0.36 (0.5) 0.01 (0.1) 0.04 (0.2)
0.18 (0.4) 0.38 (0.5) 0.35 (0.5) 0.03 (0.2) 0.03 (0.2)
0.30 (0.5) 0.28 (0.5) 0.38 (0.5) 0.00 (0.0) 0.05 (0.2)
Relation to head of household HH 0.38 (0.5) Spouse 0.23 (0.4) Son/daughter 0.36 (0.5) Brother/sister 0.04 (0.2) Father/mother 0.00 (0.0)
0.39 (0.5) 0.29 (0.5) 0.31 (0.5) 0.02 (0.1) 0.00 (0.0)
0.36 (0.5) 0.11 (0.3) 0.46 (0.5) 0.07 (0.3) 0.00 (0.0)
0.53 (0.5) 0.32 (0.5) 0.09 (0.3) 0.00 (0.0) 0.03 (0.2)
0.18 (0.4) 0.71 (0.5) 0.03 (0.2) 0.00 (0.0) 0.06 (0.2)
0.85 (0.4) 0.00 (0.0) 0.15 (0.4) 0.00 (0.0) 0.00 (0.0)
52
28
75
34
40
Income
N
80
a Age denotes chronological age in years; education denotes years of education; income denotes reported yearly income (Khasi in rupees and Maasai in Tanzanian shillings); marital status denotes whether the individual is single, married (monogamous), married (polygamous), widowed, or divorced; activity denotes the wage earning activities that subjects report; relation to head of household denotes whether the individual is the household head (HH), spouse, son/daughter, brother/sister, or father/mother of the HH. The Maasai women and men observations do not sum to the total observations because we failed to obtain the gender of one participant.
taken when making inference from the data patterns observed herein. Ultimately, what is necessary to shed light on these issues is to build on our work by studying other matrilineal societies. The top panel in Table II provides a summary of competitive choices, balls successfully tossed in the bucket, and earnings across gender and societies. Figure 1 complements these summary data with an ocular depiction of the observed choices. In terms of task proficiency, subjects made roughly 25 percent
1645
GENDER DIFFERENCES IN COMPETITION TABLE II PARTICIPANT CHOICESa Khasi Mean (Std. Dev.)
Maasai Mean (Std. Dev.)
Pooled
Women
Men
Pooled
Women
Men
0.49 (0.5)
0.54 (0.5)
0.39 (0.5)
0.39 (0.5)
0.26 (0.5)
0.50 (0.5)
Success
2.38 (1.5)
2.38 (1.6)
2.36 (1.4)
2.78 (1.6)
2.97 (1.7)
2.63 (1.5)
Earnings
3.46 (3.9)
3.73 (4.2)
2.96 (3.3)
4.02 (4.3)
3.68 (4.0)
4.33 (4.5)
80
52
28
74
34
40
2.25 (1.5)
2.18 (1.5)
2.69 (1.6)
2.33 (2.2)
2.85 (1.3)
16–14–9
13–10–5
3–4–4
14–13–2
3–6–0
11–7–2
4.46 (5.2)
4.75 (5.3)
3.72 (5.0)
5.86 (6.2)
5.00 (7.7)
6.25 (5.6)
2.54 (1.6)
2.47 (1.4)
2.84 (1.6)
3.20 (1.4)
2.40 (1.7)
18–20–3
11–11–2
7–9–1
19–18–8
9–9–7
10–9–1
4.95 (5.9)
5.42 (6.2)
4.29 (4.3)
5.42 (6.2)
5.60 (6.2)
5.20 (6.3)
Experiment summary Compete
N
Those who chose to compete Success 2.23 (1.5) Won–loss–tie Earnings
Those who chose not to compete Success 2.51 (1.5) Won–loss–tie Earnings if choice reversed
a Compete denotes whether the individual opted to compete in the experiment; success denotes the number of successful attempts in the experiment (out of 10 balls tossed); earnings denotes the units earned during the experiment, where the units = successes if the agent chose not to compete, successes multiplied by 3 if the agent chose to compete and won, = successes if the agent chose to compete and tied, and = 0 if the agent chose to compete and lost; earnings if choice reversed denotes the units foregone because the agent chose not to compete.
of their attempts, and the rates of success are similar across societies and genders within each society. More importantly for our purposes, roughly half of the Khasi subjects opted to compete, whereas only 39 percent of the Maasai chose to compete. When broken down by gender, the underlying force behind the competitiveness differences across the two societies becomes clear. In the Maasai data, the gender result that we oftentimes observe in the literature is evident: whereas 50 percent of men choose to compete, only 26 percent of women select to compete. Alternatively, as Figure 1 highlights, Khasi women choose to compete more often than Khasi men—whereas 54 percent of Khasi women choose to compete, only 39 percent of Khasi men select the competitive incentive scheme. Perhaps even more surprisingly, the Khasi women
1646
U. GNEEZY, K. L. LEONARD, AND J. A. LIST
FIGURE 1.—A summary of competitive choices across gender in the two societies.
select the competitive environment more often than the Maasai men (54% versus 50%). Although the raw data summary provides some evidence that behavior varies across the two societies, there has been no attempt to control for observables— such as age, education, and income—that might influence behavior. To rectify this situation, we use the individual observations to estimate a regression model in which we regressed the individual choice to compete on a dummy variable for society, a dummy variable for gender, their interaction, the observables collected from our survey detailed in Table I, and the gender of the experimenter. Due to the dichotomous nature of the regressand, we present estimates from a probit model. Empirical results from several specifications are contained in Table III. The leftmost regression models in Table III pool the Khasi and Maasai data, and provide a sense of the data patterns across the two societies. The rightmost columns split the data by society, permitting the controls to have a heterogeneous effect in the two societies. Specification 1 (S1) can be considered the parsimonious specification, including only variables that provide the unconditional effect of gender on competition and a control for the gender of the experimenter (male exp. = 1 for male experimenter, = 0 for female experimenter). Specification 2 (S2) adds the individual level variables to S1—age, education, and income—that might be most expected to influence competitive tendencies. Specification 3 (S3) augments S2 by including the full set of controls—work activities, marital status, and relationship to head of household.11 11 Given that many of the cells are not well populated for these other controls (see Table II), we (i) made the activity variable binary (farmer or nonfarmer), (ii) we split the marital status variable to also be binary (single or married, where single includes divorced and widowed), and (iii) we split the relation to head of household variable as binary (either head of household or spouse).
1647
GENDER DIFFERENCES IN COMPETITION TABLE III REGRESSION RESULTSa Pooled Data S1
S2
Khasi S3
S1
S2
Maasai S3
S1
S2
Female
−025 −029 (0.12) (013)
−032 (015)
Khasi
−011 −014 (0.12) (013)
−015 (014)
—
—
—
—
—
—
Khasi×female
0.39 043 (0.17) (017)
046 (019)
—
—
—
—
—
—
Male exp.
0.007 −002 (0.08) (008)
−003 (008)
008 019 (011) (012)
018 (012)
−007 −016 (012) (012)
−021 (013)
Constant
−0003−003 (0.09) (017)
−009 (020)
−014 −036 (011) (020)
−034 (027)
003 014 (009) (026)
−003 (031)
015 024 (011) (013)
024 (013)
−024 −029 (012) (012)
S3
Age
—
0002 (0003)
0002 (0003)
— −0003 −0002 (0004) (0005)
—
Education
—
0005 (001)
0009 (001)
—
0003 (002)
— −0006 (002)
Income
—
—
01e–4 01e–4 (04e–5) (04e–5)
Other controls No Chi squared 7.3 (4) N 154
−02e–6 −02e–6 (02e–6) (02e–6)
0003 (002)
0001 (0005)
−027 (018)
0002 (0005) −0004 (002)
— −03e–6 −03e–6 (02e–6) (02e–6)
No Yes No No Yes No No 9.8 (7) 12.6 (10) 2.0 (2) 11.4 (5) 11.9 (8) 4.7 (2) 9.3 (5) 151 151 80 80 80 74 71
Yes 12.9 (8) 71
a The dependent variable is “compete,” and it takes on a value of 1 if the participant opted to compete and 0 otherwise. Standard errors are given in parentheses. Estimates are partial derivatives computed at the sample means from probit models. Variables are as defined in the Table I footnote. Male exp. = 1 if the experimenter was male, = 0 otherwise. Other controls include all of the other variables defined in Table I.
Regardless of which specification is preferred, empirical results suggest that females (males) compete more often than males (females) in the Khasi (Maasai) society. These data patterns are observed in the pooled data models in the leftmost columns, where both the female variable and the Khasi×female interaction are significant at conventional levels. These results suggest, for example, that among the Maasai, women are roughly 25–32 percent less likely to compete than men. For the Khasi, women are roughly 15 percent more likely to compete than Khasi men. In the pooled data, all of the other control variables, including the gender of the experimenter, are not significant at conventional levels. Models that use only Khasi data, presented in the middle columns of Table III, reveal that the observed gender differences are marginally significant. In the pooled regression models these distinctions never matter; these changes are necessary to yield parameter estimates for the models with Khasi or Maasai data only (rightmost columns). We also experimented with higher order age terms, but they were never significant.
1648
U. GNEEZY, K. L. LEONARD, AND J. A. LIST
In S1, the differences are not significant at conventional levels, suggesting that unconditionally there is no strong evidence that Khasi females compete more than Khasi males. Yet, in the two models that include controls to condition on observables, the female coefficient of 0.24 is significant at the p < 007 level. These estimates suggest that upon properly controlling for observable differences across subjects, Khasi females are 24 percent more likely to compete than Khasi men. In the robustness tests discussed below, we will find that in most of the empirical specifications this result strengthens. In the rightmost columns, the specifications using the Maasai data show effects of gender that are opposite to the Khasi data—in two of three models the female coefficient is negative and significant at the p < 005 level, with the full-blown model causing the estimate to be measured imprecisely. Among the Maasai, men are found to be 24–29 percent more likely to compete than women. In the robustness tests discussed below, the Maasai results become less statistically significant in certain models. Concerning impacts of the other regressors in the society-specific models, in the Khasi data we observe some evidence of an experimenter effect—in this case, both male and female subjects are about 18 percent more likely to compete when the experimenter is a male, an effect that is only marginally significant. In addition, participants with higher incomes opted to compete slightly more often, though again the effect is only marginally significant. Interestingly, the only control variable that approaches statistical significance in the Maasai data is the gender of the experimenter. In this case, subjects tend to compete less when the experimenter is a male, and this effect approaches statistical significance in S3.12 4. ROBUSTNESS TESTS 4.1. Group Composition One aspect of the experimental design that we chose to remain neutral was the identity of the subject’s potential competitor. This choice was in the spirit of the recent literature that begins with an exploration of the underlying subject preferences and leaves the opponent’s gender ambiguous (see, e.g., Gneezy, Niederle, and Rustichini (2003)). While in and of itself this choice does not present an inferential problem for our purposes, what is potentially troubling is the fact that our samples are unbalanced across societies: 52 of 80 Khasi subjects are female, whereas only 34 of 74 Maasai subjects are female. If subjects deduced the gender distribution of potential competitors, then our preferred interpretation might be compromised. For example, if women are more likely to compete against other women regardless of whether they are from a matrilineal society, then we might be simply observing a consequence of the subject 12 We interacted gender of the experimenter with subject gender and this variable was never significant in either society.
GENDER DIFFERENCES IN COMPETITION
1649
pool rather than a fundamental preference for competition. Considering the literature on gender and self-identity (Cross and Medson (1977)), this is an important consideration. As previously mentioned, in each society we executed the treatments in sessions, whereby the subjects in each session were split into one of two groups randomly. These two groups were separated for the entire experiment. Similar procedures were used across the societies to ensure that subjects were unaware of the identity of potential competitors. Importantly for our purposes, in each society subjects were lined up to participate and were called one by one to participate. Whether, and how, subjects deduced the gender composition of potential competitors is unknown, but it is plausible that subjects made inference on the gender distribution in the experiment by what they observed in their own surroundings. Since we do not know exactly what subjects observed within their own group, we use a broad array of sensitivity checks to model empirically the effect of group composition. This is possible because we have data on the exact order in which subjects completed the experiment within their session. We proceed by exploring “nearest neighbor” variables and systematically enlarging the set as active control variables in the regression model estimated above (S3). A first empirical augmentation simply includes a variable that depicts the gender of the subject standing immediately in front of the person (where male = 1). The next empirical specification uses the arithmetic average of the gender identity of the directly adjacent subjects.13 A third model uses the arithmetic average of the gender identity of the four nearest subjects. A fourth model expands this variable to be the average of the gender identity of the eight nearest subjects. A fifth and final model is entirely exhaustive: the arithmetic average of the gender identity of all others in the group. Table IV contains summary empirical results from estimation of these models. The columns in Table IV represent the various specifications of the group composition variable. The three panels of Table IV present the results for the pooled data, the Khasi data, and the Maasai data, in a manner consonant with Table III. Although all of the controls of S3 are included, we present only the results of interest. Most importantly, in the pooled data all of the previously discussed empirical results hold across every model, suggesting that we are finding evidence of competitive preference differences across gender, and not merely observing a consequence of the subject pool. When we split the data by society, the results become stronger in certain specifications. For example, in the second column of the middle panel of Table IV (specification “In Front”), we find that 13 This variable equals 1 for those subjects who are standing in line between two men, 0.5 for those subjects standing in line between one man and one woman, and 0 for those subjects who are standing in line between two women. Subjects at the front and end of each line have only one adjacent neighbor.
1650
U. GNEEZY, K. L. LEONARD, AND J. A. LIST TABLE IV GROUP COMPOSITION ROBUSTNESS TESTSa Specification In Front
Adjacent 2
Adjacent 4
Adjacent 8
Group
−038 (0.16)
−042 (0.16)
−041 (0.16)
−043 (0.16)
−038 (0.16)
−025 (0.16)
−028 (0.16)
−025 (0.15)
−028 (0.16)
−023 (0.16)
0.60 (0.22)
0.65 (0.23)
0.56 (0.20)
0.58 (0.20)
0.53 (0.20)
Group Composition
−016 (0.10)
−019 (0.12)
−028 (0.17)
−035 (0.21)
−023 (0.25)
Other controls N Chi squared
Yes 141 13.8 (11)
Yes 151 15.2 (11)
Yes 151 15.4 (11)
Yes 151 15.5 (11)
Yes 151 13.5 (11)
0.36 (0.15)
0.34 (0.15)
0.24 (0.14)
0.25 (0.14)
—
Group Composition
−028 (0.15)
−025 (0.18)
−068 (0.36)
−095 (0.51)
—
Other controls N Chi squared
Yes 78 16.1 (9)
Yes 80 13.9 (9)
Yes 80 15.6 (9)
Yes 80 15.6 (9)
— — —
−025 (0.20)
−027 (0.20)
−027 (0.19)
−031 (0.20)
−034 (0.20)
0.09 (0.15)
−0008 (0.17)
−0007 (0.21)
−012 (0.25)
−020 (0.27)
Yes 63 14.2 (9)
Yes 71 12.9 (9)
Yes 71 12.9 (9)
Yes 71 13.2 (9)
Yes 71 13.5 (9)
Pooled data Female Khasi Khasi×female
Khasi data Female
Maasai data Female Group Composition Other controls N Chi squared
a The dependent variable is “compete,” and it takes on a value of 1 if the participant opted to compete and 0 otherwise. Each column represents a unique model that uses a different group composition regressor. “In front” is a variable that depicts the gender of the subject standing immediately in front of the person (where male = 1); “adjacent n” uses the arithmetic average of the gender identity of the directly adjacent n subjects. “Group” is entirely exhaustive—the arithmetic average of the gender identity of all others in the group—this model is not estimable using the Khasi data alone because each group had an identical composition. Standard errors are given in parentheses. Estimates are partial derivatives computed at the sample means from probit models. Variables are defined in the Table I footnote. “Other controls” include all of the other variables defined in Table I.
in the Khasi data, women are 36 percent more likely to compete than men, a coefficient that is significant at the p < 001 level. Although in this same empirical specification, the Maasai gender result is not close to being significant at conventional levels, it gains marginal significance as the group composition variable becomes more encompassing.
GENDER DIFFERENCES IN COMPETITION
1651
Concerning the effects of the group composition variable, which are only marginally significant, we find a negative correlation in the pooled data. Yet, in the Khasi data, we observe a more consistent negative effect that gains statistical significance in some of the specifications, particularly in the models that allow a broader scope of peer effects. The inference from these models is that as the proportion of males increases in your group, the probability of choosing the competitive option decreases. The effect is also found to be negative in most of the specifications using the Maasai sample, but the t-ratios on these coefficients never reach unity. Overall, if subjects were making inference on potential competitors based on the mix they observed, then the effects reported in Table IV are consonant with the notion that women are more likely to compete against other women, especially in matrilineal societies. 4.2. An Exploration Into Who Competes Recall that our design was chosen to explore initial competitive inclinations, rather than observe choices in games that were commonplace. In doing so, we aim to capture insights into the primitive competitive preferences among agents rather than the preferences bundled with stereotypes on task, societal expectations, and the like. In this manner, it is interesting to examine the success rates among those who chose to compete versus those who chose not to compete. Since the experimental game is like no other game or task in which the agents in either society typically participate, we do not have strong priors on whether those who are more efficient at task will choose to compete. This is buttressed by the results in the literature that find even in those cases where the subjects have just executed the task and received performance feedback, those who perform well are not significantly more likely to compete or to perform better if they do choose to compete (Niederle and Vesterlund (2007); see also Datta Gupta, Poulsen, and Villeval (2005)). The expected positive relationship between task proficiency and selection into the competitive setting is further muddled if one considers results in Gneezy, Niederle, and Rustichini (2003), which suggest the competitive environment itself might induce differences in task proficiency. Together, the literature teaches us that any effort to deduce selection is quite difficult, even in experienced tasks. Our experimental game therefore represents a particularly demanding task in which to find a positive correlation between the competitive choice and success rates. The raw statistics in the middle and bottom portions of Table II paint a picture consonant with the literature—we find no evidence of any significant correlation in task proficiency and the decision to compete. What is interesting in the data is that Khasi women and Maasai men who chose to compete (i) earned the highest amount of money in their respective societies and (ii) were most likely to win the competition. For instance, the Khasi women won 13 times and lost 10, and their win rate represents the highest of any of the four Khasi groups. Compared to Maasai women, Khasi women are more likely to select
1652
U. GNEEZY, K. L. LEONARD, AND J. A. LIST
correctly, perhaps because Khasi women have a more accurate sense of their relative abilities.14 Similar data patterns are observed among Maasai men, where those who chose to compete won 11 times and lost 7, a win rate that exceeded all the other observed win rates. Furthermore, again we find that Maasai men seem to have a better understanding of relative ability than Khasi men. Clearly, however, these data patterns should be considered as only suggestive, as more work is necessary to further our understanding of the sources of such differences. 4.3. Risk Aversion One aspect of the results in Table II and the broader results reported in this study that should be considered more carefully is whether risk aversion is playing an important role in individual choices. We should stress that the manner in which we use the term “competitiveness” in this study is meant to be a catchall phrase that might be due to deeper underlying preferences, such as risk aversion. Nevertheless, risk aversion might explain the data patterns observed in Table II if the more able participants also happen to be the most risk averse. In addition, the fact that a large portion of subjects did not choose to compete— even though, with our payoff function, they should enter the competition if they believe that they would win with at least 33 percent probability—hints at some level of risk aversion. This possibility is particularly possible in our experiments since they might well be considered to be over large stakes (several days’ wages). To lend insights into these issues, we conducted parallel risk aversion experiments to explore whether the competitive differences might be driven by heterogeneous risk postures across gender groups. To operationalize a simple procedure that measures the propensity to take risks, we made use of a standard risk game (Gneezy and Potters (1997), Haigh and List (2005)) and followed the procedures in these studies as closely as possible.15 Appendix C contains the experimental instructions. Briefly, the risk experiment has subjects play a one-shot game in which they are endowed with 100 units (40 rupees for the Khasi and 1000 shillings for the Maasai). The subject must decide what portion of this endowment [0, 100] he or she desires to bet in a lottery that returned three times the bet with one-half probability and nothing with one-half probability. As illustrated in the experimental instructions contained in Appendix C, subjects were made aware of the probabilities, the payoffs, and the fact that the lottery would be played 14
We thank an anonymous referee for urging us to proceed in this direction. We also conducted a standard investment game (see, e.g., Fehr and List (2004)): we find no differences in propensities to invest across gender in either society. These results are available upon request. Additional tables are available as Supplemental Material (Gneezy, Leonard, and List (2009)). 15
GENDER DIFFERENCES IN COMPETITION
1653
directly after choices were made. Subjects were therefore aware of the fact that they could earn anywhere between 0 and 300 units from this task. Last, subjects were informed that monies earned would be paid in private at the end of the experiment. A few noteworthy items should be mentioned before proceeding to the results discussion. First, we chose the stakes to overlap with the stakes over which the ball tossing game would be played.16 Second, experimental subjects for the risk aversion task are again drawn randomly from the two societal populations, but the subject pool has no overlap with the subject pool that played the ball tossing game. This was done to avoid contamination effects while still providing a glimpse of gender differences in risk preferences. Third, beyond using these data to dig deeper into the underlying mechanism at work in this environment, these data might be interesting in their own right considering the recent findings in Henrich and McElreath (2002). They reported that there are no systematic differences in the risk preferences of men and women in two traditional societies, including the Sangu, who live just south of the Maasai in Tanzania. Table V presents the summary choices, split by gender across the two societies. In short, we report results consonant with Henrich and McElreath (2002): although the Khasi and Maasai appear to have different risk preferences, there are no gender differences observed in either society.17 Both male and female Khasi risk approximately 85 percent of their total endowment, whereas among the Maasai the average bet represents approximately 60 percent of the total endowment. A two-sample t-test rejects the hypothesis that the gambled amount for Khasi (Maasai) women is different from the gambled amount for Khasi (Maasai) men at the p = 074 (p = 092) level. 4.4. Discussion Our data show that Khasi women are more likely to choose to compete than Khasi men. Furthermore, the Khasi women compete more often than Maasai women or any group of women in the various settings in which preferences for this type of behavior have been elicited. In the very least, these findings represent existence results: it is not universally true that the average female in every society avoids competition more often than the average male in that society because we have discovered at least one setting in which this is not true. 16 In each group, the initial amount is equivalent to the payment for two successes if the participant chose piece rate, and the maximum payoff is equivalent to the payment for two successes if the participant chose competition and won. 17 We should highlight that in a document titled “Internet Enhancements for ‘Are Peasants Risk-Averse Decision-Makers’,” Henrich (2002a) expands on the results from Henrich and McElreath (2002) by showing that there are some differences between men and women in a pastoral Sangu village, but not between men and women in an agricultural Sangu village. Among the predominantly pastoral Maasai, we find no such difference.
1654
U. GNEEZY, K. L. LEONARD, AND J. A. LIST TABLE V RAW DATA SUMMARY FOR THE RISK AVERSION GAMEa Average Bet (Standard Deviation)
Proportion bet
Khasi Women
Khasi Men
Maasai Women
Maasai Men
86.5 (3.3)
85.0 (4.0)
60.7 (4.1)
61.3 (4.2)
a The amount in the cell is the average (standard deviation) amount bet of 100. A two-sample t -test (assuming equal variances) rejects the hypothesis that the bets for Khasi women are different from the bets for Khasi men (pvalue 0.74) and rejects the hypothesis that the bets for Maasai women are different from the bets for Maasai men (p-value 0.92).
To the best of our knowledge, this is the first demonstration of such reversal. In this section, we explore three possible explanations for this result: nature, nurture, and the co-evolution of nature and nurture. In an extreme sense, one can consider the nature hypothesis to be one whereby women are inherently less competitive than men due to innate differences.18 Our data rule out this simplistic view of gender and competition. However, the fact that Khasi women have different preferences than other women does not prove that their behavior is not genetically determined. It is possible that some feature of their environment caused the Khasi to follow a different evolutionary path and, therefore, to have different psychological profiles from other women. Were this true, matriliny and matrilocal marriage may be outcomes, not causes of female competitiveness. However, the evolution of behavioral characteristics is thought to take place on a scale that would rule out such a process. Evolutionary psychologists maintain that the human mind was formed by 1 million years of common evolution that ended only 12,000 years ago, and it is therefore impossible that systematic psychological differences across populations of humans can be caused by evolution (Daly and Wilson (1983), Campbell (2002)). At the opposite end of the scientific spectrum, there is the view of human behavior as a “tabula rasa” (the mind as a blank slate). Thus, the nurture hy18
This, of course, does not suggest that all women are less competitive than men, but rather there exists a difference in the distribution of types. A large body of literature in evolutionary biology and sociobiology documents differences in competitiveness between males and females across a myriad of species. Such differences in competitiveness are said to arise because of differences in the cost of reproduction and the level of investment in offspring. Because the costs associated with raising offspring are higher for females than for males, females will increase the fitness of their genes by insuring the survival of fewer children. Males, on the other hand, increase their fitness by competing for access to the most fertile or fit females. Thus, for women, reproductive success is partially independent of the success of other females, whereas for men, reproductive success comes at the cost of the success of other males. We direct the interested reader to Knight (2002) or Tregenza and Wedell (2002) for recent overviews. The debate is a classic in the field (see Darwin (1871), Bateman (1948), and Trivers (1972)).
GENDER DIFFERENCES IN COMPETITION
1655
pothesis is that competitiveness differences are not due to biological or evolutionary reasons, but rather to culture.19 Gender socialization begins at the moment we are born, with the simple question “Is it a boy or a girl?” (Gleitman, Friedlund, and Reisberg (2000, p. 499)), and societal gender roles are taught to us by, or learned through the imitation of, family, peers, and the media.20 This view of human behavior has suffered a setback over the past few decades as evidence of the genetic origins of abnormal and normal behavior accumulates (Ilies, Avery, and Bouchard (2006)). For example, many personality traits have been shown to be highly hereditary in twin studies (Turkheimer (2004), Loehlin (1993)). Notwithstanding this evidence that many behavioral traits are genetically determined, it is possible that preferences for competition are not and, therefore, that features of the culture or environment determine the degree to which females avoid or seek competition. Recent literature reminds us that subtle differences in culture can lead to large differences in behavior. Genetic endowments can strongly determine behavior within an environment, but may do a poor job of explaining differences across environments. For example, IQ (which many believe to be genetically determined) has increased at a rate of 0.30 points per year for at least the last 100 years (Flynn (2007, p. 112)). One reasonable interpretation of this fact is that genetic endowment leads to valid IQ differences within one cohort but not across cohorts facing widely differing environments. Madsen (1967) and Shapira and Madsen (1969) found differences in attitudes about cooperation versus competition between rural and urban children from similar cultures in Mexico, and urban and kibbutz children in Israel. Eisenberg and Mussen (1989) suggested that “in poor agricultural communities, children must cooperate in working with other members of their families to raise enough food for the family’s survival.” There is an extensive literature on the methods by which children learn behavior (Whitings (1963), Eisenberg and Mussen (1989), Harris (1998)). Thus, any number of subtle influences on children or adults can cause differences in attitudes to competition even if the behavior is broadly framed by genetic endowment. 19
An entertaining twist highlighting the power of this argument can be found in the 1988 movie Twins, which starred Arnold Schwarzenegger, a physically perfect and innocent man, and his twin, Danny DeVito, a short, overweight small-time crook. Such differences are suggested to have occurred because Schwarzenegger was raised in a pristine environment, whereas DeVito spent his childhood on the streets. 20 Socialization and gender socialization begins at early childhood (Martin, Wood, and Little (1990)), is taught by family, and is reinforced by culture (Burn (1996), Basow (1980), Crespi (2003)), teachers (Eisenberg and Mussen (1989)), and peers (Harris (1998)). Nothing in this process requires a view of culture as something forced on children. Indeed, the human species is particularly adept at social learning and many see enculturation as a process of voluntary imitation of successful individuals (Henrich and Gil-White (2001)). The socialization base of gender differences is not limited to young ages, see, for example, Riley, Bowless, Babcock, and Lai (2004).
1656
U. GNEEZY, K. L. LEONARD, AND J. A. LIST
The underlying cultural differences between the Khasi and the Maasai offer room for speculation on environmental factors. Khasi girls are raised in the same household where they spend their whole lives (matrilocal) and, as heads of households, they enjoy significant authority over important decisions (matriliny). Thus, the fact that women can be raised exclusively for the benefit of their mothers’ and grandmothers’ households may mean that innate competitiveness does not need to be discouraged or competitiveness is encouraged. In addition, if pastoral groups such as the Maasai have different attitudes toward competition than agricultural groups such as the Khasi, this could help to explain the differences between Maasai men and Khasi men.21 Boyd and Richerson (2005) argued that social learning is a more natural form of cultural transmission than explicit training or socialization. Individuals may choose to copy successful individuals as much if not more than common individuals, and certainly do not need to be told that they should imitate. Whereas Maasai men are explicitly indoctrinated during the transition to manhood, Khasi women are not indoctrinated, but may choose to imitate the behavior of older women in their households or successful women in their social circles: prestige-based learning. Henrich and Gil-White (2001) suggested that freely conferred deference (prestige) is an adaptation that allows potential imitators proximity to individuals who represent models of successful behavior. This model of social learning highlights the facts that individuals may choose who to imitate, that access and proximity improve the fidelity of social learning, and that those who are imitated gains status from being imitated. In this vein, the Khasi institutions of matrilocal residence and matrilineal inheritance may perform a similar role in cultural transmission. The fact that women live in (or next to) their maternal grandmothers’ residences for their whole lives allows access and proximity (though only to mothers, aunts, great-aunts, and grandmothers). In addition, Khasi women are in a position to pass on accumulated wealth, and if competitiveness is differentially rewarded, women who learn competitiveness from their mothers will benefit both from their own efforts and from those of their mothers. Furthermore, female heads of households, even if they do not gain status by being imitated by their daughters, have an incentive to encourage success in their daughters. Unlike families in other societies, the household can gain directly from the long-term successes of their daughters. We asked men and women in both cultures to perform unfamiliar tasks in an isolated setting where learning and imitation were not possible, however. Faced with a new task, Khasi men and women make a choice for themselves, based on their own preferences. Would women who are imitating competitive women choose to be competitive in an experimental setting? Does imitation lead to preferences for competition or simply preferences for activities that 21 Henrich and McElreath (2002) suggested that risk preferences could vary between agricultural and pastoral groups, opening the possibility that competitive preferences may also differ.
GENDER DIFFERENCES IN COMPETITION
1657
happen to be competitive? Henrich and McElreath (2002) suggested that men and women in Sangu and Mapucha cultures are risk-loving as revealed by their preferences in an experimental setting, but that imitation of successful men and women causes them to make choices in their everyday lives that reflect risk aversion. Imitated behavior is generally context-specific, whereas the experiment, if properly done, lacks cultural cues. If this view of the experimental setting is correct, Khasi women are not imitating the successful strategies of women when they choose competition; they are displaying their true preferences for competition. Our data—and the fact that the experimental task reveals true, not imitated behavior—are also consistent with a series of models that are intermediate between those of nature and nurture. In such models, both biology and society play a role in forming preferences, not just observed behaviors, and allow for the impact of current societal features to be less important than the impact of past societal features. Those Khasi institutions that favor the transmission of a behavior through social learning also favor the transmission of genetically inherited characteristics, such as innate competitiveness. Many scholars suggest that the view of the human mind as undifferentiated across cultures fails to take into account the possibility that culture and genetics can interact. The study of gene-culture co-evolution in mathematical modeling suggests that when a particular genetic characteristic favors the transmission of a particular cultural feature, and that cultural feature also increases the fecundity of the genetic characteristic, evolution can occur at a much faster pace (Feldman and Cavalli-Sforza (1976), Cavalli-Sforza and Feldman (1981), Boyd and Richerson (1985, 2005), Laland and Brown (2002), Mesoudi and Laland (2007)). The possibility of gene-culture co-evolution leads us to focus primarily on the institutions of matrilocal residence and matrilineal inheritance. First, as we have noted above, matrilocal residence creates a particular relationship between mothers, grandmothers, and daughters that benefits all three women. Second, matrilineal inheritance can reinforce any genetic tendency to competition by passing on both wealth and genetic disposition to daughters. In an environment of high childhood mortality, wealth can greatly increase the probability of survival.22 In addition, there is evidence that genetic inheritance is different in cultures that practice matrilocal marriage. Oota, SettheethamIshida, Twawech, Ishida, and Stoneking (2001) found significant variation in Y-chromosome features, but less variation in maternal DNA for matrilocal tribes in Thailand, and found the opposite for patrilocal groups in the same location. In other words, the cultural choice to displace men or women from their maternal homes, by itself, alters the process of genetic inheritance. However, the very process that would favor genes linked to competitiveness (if they exist) would also favor competitiveness learned from the imitation of 22
Indian census data from 1891–1911 suggests that only 50 percent of girls survived to the age of 15 (Mari Bhat (1989)). In addition, Pritchett and Summers (1996) estimated the short-term elasticity of child mortality with respect to income at about −0.2.
1658
U. GNEEZY, K. L. LEONARD, AND J. A. LIST
successful women. A model in which competitiveness improves the “evolutionary fitness” of the institutions of matriliny and matrilocal marriage, and the institutions of matriliny and matrilocal marriage increase the “evolutionary fitness” of competitiveness does not require biological evolution of DNA. Girls who imitate the behavior of successful competitive women are more likely to survive childhood and will inherit greater wealth if those women are also their mothers or grandmothers. In turn, their wealth and success make them more preferable as a model for younger girls (likely their daughters and nieces) and more likely to have surviving children. This process is subtly different from that outlined under the nurture hypothesis. If competitiveness has evolved (biologically or socially) over time, it is not necessary that Khasi families teach their daughters to be competitive. Rather, the prevalence of competitiveness in the society could increase over time due to the superior fitness of this personality trait within this institutional environment, whether it is learned through imitation or is inherited genetically. In addition, this view suggests that current cultural features might be less important than past cultural features in explaining current preferences; evolution of socially learned behavior is not instantaneous. 5. CONCLUDING REMARKS The link between gender and competition has been shown in several recent experimental studies. The importance of these results should not be understated: in both a positive and normative sense, these insights have the potential to explain important puzzles in economics and in social science more generally. In this study we use an experimental task to explore whether there are gender differences in selecting into competitive environments across two distinct societies: the Maasai in Tanzania and the Khasi in India. The societies are unique in that the Maasai represent an example of a patriarchal society, whereas the Khasi are matrilineal. We observe some interesting data patterns. For example, Maasai men compete at roughly twice the rate as Maasai women, evidence that is consistent with data from Western societies that use different tasks and smaller relative stake levels. Yet, this data pattern is reversed among the Khasi, where women choose the competitive environment more often than Khasi men. We interpret these results as potentially providing insights into the underlying sources of the observed gender differences. We should, however, caution the reader that even though we find suggestive results, care should be taken when making inference from the data patterns observed herein because several important factors vary across the two societies. And, we have sampled a limited number of villages. We suspect that our results will not be a universal truth amongst all matrilineal villages, rather other important factors will interact with matriliny to produce the data patterns observed herein. More research is certainly warranted. Viewed through the lens of extant models, our results might have import within the policy community. For example, policy-makers often are searching
GENDER DIFFERENCES IN COMPETITION
1659
for efficient means to reduce the gender gap. If the difference in reaction to competition is based primarily on nature, then some might advocate, for example, reducing the competitiveness of the education system and labor markets to provide women with more chances to succeed. If the difference is based on nurture, or an interaction between nature and nurture, on the other hand, the public policy might be to target socialization and education at early ages as well as later in life to eliminate this asymmetric treatment of men and women with respect to competitiveness. Our study suggests that there might be some value in this second avenue. We trust that future research will refine this insight and more thoroughly explore the sources of gender preference differences. APPENDIX A: EXPERIMENTAL PROTOCOL (KHASI SESSIONS) Welcome to this study of decision-making. The experiment will take about 15 minutes. The instructions are simple, and if you follow them carefully, you can earn a considerable amount of money. All the money you earn is yours to keep and will be paid to you, in cash, immediately after the experiment ends. In addition to any earnings you might have in this task, you will be paid 20 rupees to participate. The task that we ask you to perform today is throwing this ball into this bucket from this line. (Show them the ball, bucket, and line.) You will have 10 tries. We now ask you to choose one of two options according to which you will be paid in the experiment. OPTION 1: If you choose this option, you will get 20 rupees for each time you get the ball in the bucket in your 10 tries. So if you succeed 1 time, then you will get 20 rupees. If you succeed 2 times, then you will get 40 rupees. If you succeed 3 times, you will get 60 rupees, and so on. OPTION 2: If you choose this option, you will receive a reward only if you succeed more times than the person who is playing in the next room. If you succeed more than this person, you will be paid 60 rupees for every time you succeed. So if you succeed 1 time, then you will get 60 rupees. If you succeed 2 times, then you will get 120 rupees. If you succeed 3 times, you will get 180 rupees and so on. But you will only receive a reward if you are better than the person in the next room. If you both succeed the same number of times, you will both get 20 rupees for each success. We now ask you to choose how you want to be paid: according to Option 1 or Option 2. Now you may play. Record both their ID number and their choice. Allow the participant to toss the balls and record the result on the back of his/her ID card. You can record the result of each toss with a check mark and
1660
U. GNEEZY, K. L. LEONARD, AND J. A. LIST
X (check mark for success and X for failure). At the end of the 10 tosses, write the total number of successes on the back of the card and the money value of each toss (based on his/her choice). Also write down whether his/her succeeded more than his/her opponent with Y or N. √√√ √ √√√ For example, X XX 7x20 Y. You do not need to write the total payment on the card. Tell the participant he/she must go to the person who will fill out an exit survey. Once he/she has filled out this survey, he/she should take the card and the survey to the “cashier” and he/she will receive payment. If they ask you what to do: Tell them that you cannot give them advice about what to choose and offer to read the script to them again. APPENDIX B: INDIVIDUAL CHARACTERISTICS SURVEY (USED WITH KHASI AND MAASAI)
GENDER DIFFERENCES IN COMPETITION
1661
APPENDIX C: EXPERIMENTAL INSTRUCTIONS FOR RISK AVERSION GAME (KHASI SESSIONS) Welcome to this study of decision-making. The experiment will take about 15 minutes. The instructions are simple, and if you follow them carefully, you can earn a considerable amount of money. All the money you earn is yours to keep and will be paid to you, in cash, immediately after the experiment ends. In addition to any earnings you might have in this task, you will be paid 20 rupees to participate. At the beginning of this experiment you will receive 40 rupees. You are asked to choose the portion of this amount (between 0 and 40) that you wish to invest in a risky option. The rest of the money will be accumulated in your total balance. The risky investment: there is an equal chance that the investment will fail or succeed. If the investment fails, you lose the amount you invested. If the investment succeeds, you receive 3 times the amount invested. How do we determine if you win? After you have chosen how much you wish to invest, you will toss a coin to determine whether you win or lose. If the coin comes up heads, you win 3 times the amount you chose to invest. If the coin comes up tails, you lose the amount invested. Examples 1. If you choose to invest nothing, you will get the 40 rupees for sure. That is, the coin flip would not affect your profits. 2. If you choose to invest all of the 40 rupees, then if the coin comes up heads, you win 120 rupees, and if the coin comes up tails, you win nothing and end up with 0. 3. If you choose to invest 20, then if the coin comes up heads, you win 80 (20 + 3 ∗ 20), and if the coin lands on tails, you win 20. Do you have any questions? Ask them how much they would like to invest. REFERENCES AHMED, S. Z. (1994): “What Do Men Want?” The New York Times, February 15th, A21. [1639, 1640] ALTONJI, J. G., AND R. BLANK (1999): “Race and Gender in the Labor Market,” in Handbook of Labor Economics, Vol. 3c, ed. by O. Ashenfelter and D. Card. Amsterdam: Elsevier, 3144–3259. [1637] AGRO-ECONOMIC RESEARCH CENTER FOR NORTH EAST INDIA, (1969): Rural Life in the Assam Hills: Case Studies of Four Villages. Studies in Rural Changes—Assam Series. Calcuta: K. L. Mukhopadhyaya. [1641] BARON-COHEN, S. (2003): The Essential Difference. Men, Women, and the Extreme Male Brain. London: Allan Lane. [1638] BARRES BEN, A. (2006): “Does Gender Matter?” Nature, 442, 133–136. [1638] BASOW, S. A. (1980): Sex-Role Stereotypes: Traditions and Alternatives. Monterey, CA: Brooks/Cole Publishing Company. [1655] BATEMAN, A. J. (1948): “Intra-Sexual Selection in Drosophila,” Heredity, 2, 349–368. [1654]
1662
U. GNEEZY, K. L. LEONARD, AND J. A. LIST
BLAU, F. D., AND L. M. KAHN (1992): “The Gender Earnings Gap: Learning From International Comparisons,” American Economic Review, 82, 533–538. [1637] (2000): “Gender Differences in Pay,” Journal of Economic Perspectives, 14, 75–99. [1637] BLAU, F. D., M. FERBER, AND A. WINKLER (2002): The Economics of Women, Men and Work (Fourth Ed.). Englewood Cliffs, NJ: Prentice Hall. [1637] BOYD, R., AND P. J. RICHERSON (1985): Culture and the Evolutionary Process. Chicago: Chicago University Press. [1638,1657] (2005): The Origin and Evolution of Culture. Oxford, U.K.: Oxford University Press. [1638,1656,1657] BURN, S. M. (1996): The Social Psychology of Gender. New York: McGraw-Hill. [1655] CAMPBELL, A. A. (2002): Mind of Her Own: The Evolutionary Psychology of Women. Oxford, U.K.: Oxford University Press. [1637,1639,1654] CAVALLI-SFORZA, L. L., AND M. W. FELDMAN (1981): Cultural Transmission and Evolution: A Quantitative Approach. Princeton, NJ: Princeton University Press. [1657] CRESPI, I. (2003): “Gender Socialization Within the Family: A Study on Adolescents and Their Parents in Great Britain,” available at http://www.iser.essex.ac.uk/files/conferences/bhps/2003/ docs/pdf/papers/crespi.pdf. [1655] CROSS, S. E., AND L. MEDSON (1977): “Models of Self: Self-Construals and Gender,” Psychological Bulletin, 122, 5–37. [1649] DALY, M., AND M. WILSON (1983): Sex, Evolution and Behavior. Belmont, CA: Wadsworth. [1654] DARWIN, C. (1871): The Descent of Man, and Selection in Relation to Sex. London: John Murray. [1654] DATTA GUPTA, N., A. POULSEN, AND M. C. VILLEVAL (2005): “Male and Female Competitive Behavior—Experimental Evidence,” available at http://ideas.repec.org/p/iza/izadps/dp1833. html. [1637,1651] EISENBERG, N., AND H. PAUL (1989): The Roots of Prosocial Behavior in Children. Cambridge: Cambridge University Press. [1655] FEHR, E., AND J. A. LIST (2004): “The Hidden Costs and Returns of Incentives—Trust and Trustworthiness Among CEOS,” Journal of the European Economic Association, 2, 743–771. [1652] FELDMAN, M. W., AND L. L. CAVALLI-SFORZA (1976): “Cultural and Biological Evolutionary Process, Selection for a Trait Under Complex Transmission,” Theoretical Population Biology, 9, 238–259. [1657] FLYNN, J. R. (2007): What Is Intelligence?: Beyond the Flynn Effect. Cambridge: Cambridge University Press. [1655] GLEITMAN, H., A. J. FRIDLUND, AND D. REISBERG (2000): Basic Psychology. New York: Norton & Company. [1655] GNEEZY, U., AND J. POTTERS (1997): “An Experiment on Risk Taking and Evaluation Periods,” Quarterly Journal of Economics, 112, 631–645. [1652] GNEEZY, U., AND A. RUSTICHINI (2004): “Gender and Competition at a Young Age,” American Economic Review Papers and Proceedings, May, 377–381. [1637] (2005): “Executives versus Teachers: Gender, Competition and Selection,” Working Paper, University of California, San Diego. [1637] GNEEZY, U., K. L. LEONARD, AND J. A. LIST (2009): “Supplement to ‘Gender Differences in Competition: Evidence From a Matrilineal and a Patriarchal Society’,” Econometrica Supplemental Material, 77, http://www.econometricsociety.org/ecta/Supmat/6690_Tables.zip. [1652] GNEEZY, U., M. NIEDERLE, AND A. RUSTICHINI (2003): “Performance in Competitive Environments: Gender Differences,” Quarterly Journal of Economics, 118, 1049–1074. [1637,1648,1651] GOLDBERG, S. (1993): Why Men Rule, a Theory of Male Dominance. Peru, IL: Open Court. [1639] HAIGH, M., AND J. A. LIST (2005): “Do Professional Traders Exhibit Myopic Loss Aversion? An Experimental Analysis,” Journal of Finance, 60, 523–535. [1652] HARRIS, J. R. (1998): The Nurture Assumption: Why Children Turn Out the Way They Do. New York: Free Press. [1655]
GENDER DIFFERENCES IN COMPETITION
1663
HARRISON, G., AND J. A. LIST (2004): “Field Experiments,” Journal of Economic Literature, XLII, 1013–1059. [1642] HENRICH, J., AND F. J. GIL -WHITE (2001): “The Evolution of Prestige: Freely Conference Deference as a Mechanism for Enhancing the Benefits of Cultural Transmission,” Evolution and Human Behavior, 22, 165–196. [1655,1656] HENRICH, J., AND R. MCELREATH (2002): “Are Peasants Risk-Averse Decision Makers,” Current Anthropology, 42, 172–181. [1653,1656,1657] (2002a): “Internet Enhancements for ‘Are Peasants Risk-Averse Decision-Makers’,” available at www.psych.ubc.ca/~henrich/Website/Papers/riskenhancements.pdf. [1653] HODGSON, D. L. (2000): “Gender, Culture and the Myth of the Patriarchal Pastoralist,” in Rethinking Pastoralism in Africa, ed. by D. L. Hodgson. London: James Currey. [1639,1641] (2001): Once Intrepid Warriors: Gender, Ethnicity, and the Cultural Politics of Maasai Development. Bloomington, IN: Indiana University Press. [1639,1641] ILIES, R., R. D. ARVEY, AND T. J. BOUCHARD (2006): “Darwinism, Behavioral Genetics, and Organizational Behavior: A Review and Agenda for Future Research,” Journal of Organization Behavior, 27, 121–141. [1655] KNIGHT, J. (2002): “Sexual Stereotypes,” Nature, 415, 254–256. [1654] LALAND, K. N., AND G. BROWN (2002): Sense and Nonsense: Evolutionary Perspectives on Human Behavior. Oxford, U.K.: Oxford University Press. [1657] LAWRENCE, P. A. (2006): “Men, Women, and the Ghosts in Science,” PLOS Biology, 4, 13–15. [1638] LESOROGOL, C. K. (2003): “Transforming Institutions Among Pastoralists: Inequality and Land Privatization,” American Anthropologist, 105, 531–541. [1641] LOEHLIN, J. (1993): “Nature, Nurture and Conservatism in the Australian Twin Study,” Behavior Genetics, 23, 287–290. [1655] MADSEN, M. C. (1967): “Cooperative and Competitive Motivation of Children in Three Mexican Subcultures,” Psychological Reports, 20, 1307–1320. [1655] MARI BHAT, P. N. (1989): “Mortality and Fertility in India, 1881–1961: A Reassessment,” in India’s Historical Demography: Studies in Famine, Disease and Society, ed. by T. Dyson. London: Curzon, 73–118. [1657] MARTIN, C. L., J. K. LITTLE, AND C. H. WOOD (1990): “The Development of Gender Stereotype Components,” Child Development, 61, 1891–1904. [1655] MESOUDI, A., AND K. N. LALAND (2007): “Culturally Transmitted Paternity Beliefs and the Evolution of Human Mating Behaviour,” Proceedings of the Royal Society of London, Ser. B, 274, 1273–1278. [1657] NAKANE, C. (1967): Garo and Khasi: A Comparative Study in Matrilineal Systems. Paris: Mouton & Co. [1640] NIEDERLE, M., AND L. VESTERLUND (2005): “Do Women Shy Away From Competition? Do Men Compete too Much?” available at http://www.stanford.edu/~niederle/Women. Competition.pdf. [1637] (2007): “Do Women Shy Away From Competition? Do Men Compete too Much?” Quarterly Journal of Economics, 122, 1067–1101. [1651] NONGBRI, T. (1988): “Gender and the Khasi Family Structure: The Meghalaya Succession to Self-Acquired Property Act, 1984,” Sociological Bulletin, 7, 71–82. [1640] (2003): Development, Ethnicity and Gender. Jaipur/New Delhi: Rawat Publications. [1641] OOTA, H., W. SETTHEETHAM-ISHIDA, D. TWAWECH, T. ISHIDA, AND M. STONEKING (2001): “Human mtDNA and Y-chromosome Variation Is Correlated With Matrilocal versus Patrilocal Residence,” Nature Genetics, 29, 20–21. [1657] PRITCHETT, L., AND L. H. SUMMERS (1996): “Wealthier Is Healthier,” The Journal of Human Resources, 31, 841–866. [1657] RIDLEY, M. (2003): Nature via Nurture. New York: Harper Collins. [1638]
1664
U. GNEEZY, K. L. LEONARD, AND J. A. LIST
RILEY BOWLESS, H., L. BABCOCK, AND L. LAI (2004): “It Depends Who Is Asking and Who You Ask: Social Incentives for Sex Differences in the Propensity to Initiate Negotiation,” available at http://www.ksg.harvard.edu/wappp/research/Bowles_Babcock_Lai.pdf. [1655] SHAPIRA, A., AND M. C. MADSEN (1969): “Cooperative and Competitive Behavior of Kibbutz and Urban Children in Israel,” Child Development, 40, 609–617. [1655] SHELL, R. (2006): Bargaining for Advantage: Negotiation Strategies for Reasonable People (Second Ed.). New York: Penguin Books. [1638] SPENCER, P. (1965): The Samburu; A Study of Gerontocracy in a Nomadic Tribe. Berkeley, CA: University of California Press. [1639,1640] (1994): “Becoming Maasai, Being in Time,” in Being Maasai: Ethnicity and Identity in East Africa, ed. by T. Spear and R. Waller. London: James Currey. [1639] (2003): Time, Space and the Unknown: Maasai Configurations of Power and Providence. London: James Currey. [1640] TREGENZA, T., AND N. WEDELL (2002): “Polyandrous Females Avoid Costs of Inbreeding,” Nature, 415, 71–73. [1654] TRIVERS, R. L. (1972): “Parental Investment and Sexual Selection,” in Sexual Selection and the Descent of Man, ed. by B. Campbell. Chicago: Aldine, 136–177. [1654] TURKHEIMER, E. (1998): “Heritability and Biological Explanation,” Psychological Review, 105, 782–791. [1638] (2004): “Spinach and Ice Cream: Why Social Science Is so Difficult,” in Behavior Genetics Principles: Perspectives in Development, Personality, and Psychopathology, ed. by L. F. DiLalla. Washington, DC: American Psychological Association. [1655] TURKHEIMER, E., A. HALEY, M. WALDRON, B. D’ONOFRIO, AND I. I. GOTTESMAN (2003): “Socioeconomic Status Modifies Heritability of IQ in Young Children,” Psychological Science, 14, 623–628. [1638] VANDEGRIFT, D., A. YAVAS, AND P. BROWN (2004): “Men, Women and Competition: An Experimental Test of Labor Market Behavior,” Mimeo. [1637] VAN HAM, P. (2000): The Seven Sisters of India: Tribal Worlds Between Tibet and Burma. Munich, London, and New York: Prestel Publishers. [1640] WHITINGS, B. B. (1963): Six Cultures: Studies of Child Rearing. New York: Wiley. [1655]
Rady School of Management, University of California–San Diego, Otterson Hall, 9500 Gilman Dr., La Jolla, CA 92093-0553, U.S.A.;
[email protected], University of Maryland, 2200 Symons Hall, Colledge Park, MD 20783, U.S.A.;
[email protected], and Dept. of Economics, University of Chicago, 1126 East 59th Street, Chicago, IL 60637, U.S.A., NBER, and Dept. of Economics, CentER, PO Box 90153, 5000 LE Tilburg, The Netherlands;
[email protected]. Manuscript received September, 2006; final revision received December, 2008.
Econometrica, Vol. 77, No. 5 (September, 2009), 1665–1682
NOTES AND COMMENTS INFERENCE IN DYNAMIC DISCRETE CHOICE MODELS WITH SERIALLY CORRELATED UNOBSERVED STATE VARIABLES BY ANDRIY NORETS1 This paper develops a method for inference in dynamic discrete choice models with serially correlated unobserved state variables. Estimation of these models involves computing high-dimensional integrals that are present in the solution to the dynamic program and in the likelihood function. First, the paper proposes a Bayesian Markov chain Monte Carlo estimation procedure that can handle the problem of multidimensional integration in the likelihood function. Second, the paper presents an efficient algorithm for solving the dynamic program suitable for use in conjunction with the proposed estimation procedure. KEYWORDS: Dynamic discrete choice models, Bayesian estimation, MCMC, nearest neighbors, random grids.
1. INTRODUCTION DYNAMIC DISCRETE CHOICE MODELS (DDCMs) describe the behavior of a forward-looking economic agent who chooses between several alternatives repeatedly over time. Estimation of the deep structural parameters of these models is a theoretically appealing and promising area in empirical economics. One important feature of DDCMs that was often assumed away in the literature due to computational difficulties is serial correlation in unobserved state variables. Ability, productivity, health status, taste idiosyncrasies, and many other unobservables are, however, likely to be persistent over time. This paper develops a computationally attractive method for inference in DDCMs with serially correlated unobservables. Advances in simulation methods and computing speed over the last two decades made the Bayesian approach to statistical inference practical. Bayesian methods are now applied to many problems in statistics and econometrics that are difficult to tackle by the classical approach. Static discrete choice models and, more generally, models with latent variables, are one of those areas where the Bayesian approach was particularly fruitful; see for example Albert and Chib (1993), McCulloch and Rossi (1994), and Geweke, Keane, and Runkle (1994). Similarly to the static case, the likelihood function for a DDCM can be thought of as an integral over latent variables (the unobserved state 1 I am grateful to John Geweke, my dissertation advisor, for guidance and encouragement through the work on this project. I thank Charles Whiteman, Luke Tierney, and Jeffrey Rosenthal for helpful suggestions. I also thank Latchezar Popov for reading and discussing some of the proofs in this paper. Comments by Elena Pastorino, Yaroslav Litus, Marianna Kudlyak, a co-editor, and anonymous referees helped to improve the manuscript. I acknowledge financial support from the Economics Department graduate fellowship and the Seashore Dissertation fellowship at the University of Iowa. All remaining errors are mine.
© 2009 The Econometric Society
DOI: 10.3982/ECTA7292
1666
ANDRIY NORETS
variables). If the unobservables are serially correlated, computing this integral is very hard. A Markov chain Monte Carlo (MCMC) algorithm is employed in this paper to handle this issue. An important obstacle for Bayesian estimation of DDCMs is the computational burden of solving the dynamic program (DP) at each iteration of the estimation procedure. Imai, Jain, and Ching (2005), from now on IJC, were the first to attack this problem and consider application of Bayesian methods for estimation of DDCMs. Their method uses an MCMC algorithm that solves the DP and estimates the parameters at the same time. The Bellman equation is iterated only once for each draw of the parameters. To obtain the approximations of the expected value functions for the current MCMC draw of the parameters, the authors used kernel smoothing over the approximations of the value functions from the previous MCMC iterations. This paper extends the work of IJC in several dimensions. IJC employd MCMC to “solve the DP problem and estimate the parameters simultaneously” rather than handle more flexible specifications for unobservables. Their theory does not apply to a Gibbs sampler that includes blocks for simulating unobservables. In contrast, I develop an algorithm that applies MCMC to handle serially correlated unobservables and possibly other interesting forms of heterogeneity that would lead to hard integration problems in computing the likelihood function. Second, the algorithm developed in this paper can be applied to more general DDCMs: models with infinite state space and random state transitions (IJC’s algorithm works for finite state space and deterministic transitions for all state variables except independent and identically distributed (i.i.d.) errors). I achieve this more general applicability of the algorithm in part by using nearest neighbors instead of the kernel smoothing used by IJC. Also, in addition to approximating the value function in the parameter space, an algorithm for solving the DP has to deal with an integration problem for computing the expectations of the value functions. My prescriptions for handling this integration problem differ from IJC’s. Finally, this paper develops theory that justifies statistical inference made on the basis of the algorithm’s output. In the Bayesian framework, most inference exercises involve computing posterior expectations of some functions. IJC showed that the last draw from their algorithm will converge in distribution to the posterior. I show that sample averages from my algorithm can be used to approximate posterior expectations, and this is exactly how MCMC output is used in practice. The proposed method was experimentally evaluated on two different DDCMs: a binary choice model of optimal bus engine replacement (Rust (1987)) and a model of medical care use and work absence (Gilleskie (1998)). Experiments are excluded from this paper for brevity. They can be found in Norets (2007, 2008). In summary, experiments demonstrate that ignoring serial correlation in unobservables of DDCMs can lead to serious misspecification errors and that the proposed method for handling serially correlated unobservables is feasible, accurate, and reliable.
INFERENCE IN DDCMS WITH CORRELATED UNOBSERVABLES
1667
The paper is organized as follows. Section 2 describes setup and estimation of a general DDCM. The algorithm for solving the DP and corresponding convergence results are presented in Sections 3 and 4. Proofs of the theoretical results can be found in the Supplemental Material (Norets (2009b)). 2. SETUP AND ESTIMATION OF DDCMS Eckstein and Wolpin (1989), Rust (1994), and Aguirregabiria and Mira (2007) surveyed the literature on estimation of DDCMs. Below, I introduce a general model setup and emphasize possible advantages of the Bayesian approach to the estimation of these models, especially in treating the time dependence in unobservables. I also briefly discuss most relevant previous research. Under weak regularity conditions (see, e.g., Rust (1994)), a DDCM can be described by the Bellman equation (1)
V (st ; θ) = max V (st dt ; θ) dt ∈D
where V (st dt ; θ) = u(st dt ; θ) + βE{V (st+1 ; θ)|st dt ; θ} is an alternativespecific value function, u(st dt ; θ) is a per-period utility function, st ∈ S is a vector of state variables, dt is a control from a finite set D, θ ∈ Θ is a vector of parameters, β is a time discount factor, and V (st ; θ) is a value function or lifetime utility of the agent. The state variables are assumed to evolve according to a controlled first order Markov process with a transition law denoted by f (st+1 |st dt ; θ) for t ≥ 1; the distribution of the initial state is denoted by f (s1 |θ). This formulation embraces a finite horizon case if time t is included in the vector of the state variables. In estimable DDCMs, some state variables, denoted here by yt , are assumed to be unobserved by econometricians. The observed states are denoted by xt . All the state variables st = (xt yt ) are known to the agent at time t. Examples of the unobserved state variables include taste idiosyncrasy, health status, ability, and returns to patents. The unobservables play an important role in the estimation. The likelihood function is a product of integrals over the unobservables I p yTi i xTi i dTi i y1i x1i d1i |θ d yTi i · · · y1i (2) p(x d|θ) = i=1 T
i i ∈ {1 I}, I is the number of the obwhere (x y d) = {xti yti dti }t=1 served individuals, Ti is the number of time periods individual i is observed, p yTi i xti dti y1i x1i d1i |θ
=
Ti t=1
p(dti |yti xti ; θ)f (xti yti |xt−1i yt−1i dt−1i ; θ)
1668
ANDRIY NORETS
f (·|·; θ) is the state transition density, {x0i y0i d0i } = ∅, and p(dti |yti xti ; θ) is the choice probability conditional on all state variables. In general, evaluation of the likelihood function in (2) involves computing multidimensional integrals of an order equal to Ti times the number of components in yt , which becomes very difficult for large Ti and/or multidimensional unobservables yt . That is why in previous literature the unobservables were often assumed to be i.i.d. In a series of papers, Rust developed a dynamic multinomial logit model, where he assumed that the utility function of the agents is additively separable in the unobservables and that the unobservables are extreme value i.i.d. In this case, the integration in (2) can be performed analytically. Pakes (1986) used Monte Carlo simulations to approximate the likelihood function in a model of binary choice with a serially correlated onedimensional unobservable. More recently, several authors estimated models with particular forms of serial correlation in unobservables by adopting the method of Keane and Wolpin (1994), which uses Monte Carlo simulations to compute the likelihood and interpolating regressions to speed up the solution to the DP.2 Even for DDCMs with special forms of serial correlation that reduce the dimension of integration in (2), estimation is still very hard. In this paper, I propose a computationally attractive Bayesian approach to estimation of DDCMs with serial correlation in unobservables. In the Bayesian framework, the high-dimensional integration over yt for each parameter value can be circumvented by employing Gibbs sampling and data augmentation. In models with latent variables, the Gibbs sampler typically has two types of blocks: (a) parameters conditional on other parameters, latent variables, and the data; (b) latent variables conditional on other latent variables, parameters, and the data (this step is called data augmentation). Draws from this Gibbs sampler form a Markov chain with the stationary distribution equal to the joint distribution of the parameters and the latent variables conditional on the data. The densities for both types of blocks are proportional to the joint density of the data, the latent variables, and the parameters. Therefore, to construct the Gibbs sampler, we need to be able to evaluate the joint density of the data, the latent variables, and the parameters. For a textbook treatment of these ideas, see Chapter 6 in Geweke (2005). It is straightforward to obtain an analytical expression for the joint density of the data, the latent variables, and the parameters under the parameterization 2 For example, Erdem and Keane (1996) estimated a model in which consumer perceptions of products are modelled by a sum of a parameter and an i.i.d. component, and thus are serially correlated. Consumer product usage requirements are modelled similarly in Erdem, Imai, and Keane (2003). In Sullivan (2006), a job match-specific wage draw persists for the duration of a match. In Keane and Wolpin (2006), women draw from husbands earnings distribution and the draw stays fixed for the duration of the match. It is also common to allow for serial correlation in unobservables induced by latent types (see, for example, Keane and Wolpin (1997)). I thank an anonymous referee for bringing these references to my attention.
INFERENCE IN DDCMS WITH CORRELATED UNOBSERVABLES
1669
of the Gibbs sampler in which the unobserved state variables are directly used as the latent variables in the sampler (3)
p(θ x d y) = p(θ)
Ti I
p(dti |xti yti ; θ)
i=1 t=1
× f (xti yti |xt−1i yt−1i dt−1i ; θ) where p(dti |xti yti ; θ) = 1{V (yti xti dti ;θ)≥V (yti xti d;θ)∀d∈D} (yti xti dti ; θ) is an indicator function and p(θ) is a prior density for the parameters. In this Gibbs sampler, the conditional density of a parameter given the data, the rest of the parameters, and the latent variables will be proportional to (3). Since (3) includes a product of indicator functions p(dti |yti xti ; θ), in this Gibbs sampler, the distributions for parameter blocks will be truncated to a region defined by inequality constraints that are nonlinear in θ: (4)
V (yti xti dti ; θ) ≥ V (yti xti d; θ) ∀d ∈ D ∀t ∈ {1 Ti } ∀i ∈ {1 I}
For realistic sample sizes, the number of these constraints is very large and the algorithm is impractical; for example, parameter draws from an acceptance sampling algorithm never got accepted in experiments with a sample size of more than 100 observations. The same situation occurs under the parameterization in which utdi = u(yti xti dti ; θ) are used as the latent variables in the sampler instead of some or all of the components of yti . The complicated truncation region (4) in drawing the parameter blocks could be avoided if we use Vti = {Vtdi = V (sti d; θ) d ∈ D} as latent variables in the sampler. Under this parameterization, the joint density of the data, the latent variables, and the parameters (needed for construction of the Gibbs sampler) does not have a convenient analytical form because Vtdi depends on other unobservables through the expected value function, which can only be approximated numerically. In general, even evaluation of a kernel of this distribution is not easy. However, under some reasonable assumptions on the unobservables, a feasible Gibbs sampler can be constructed. In particular, let us assume that the unobserved part of the state vector includes some components that do not affect the distribution of the future state. Let us denote them by νt and denote the other (possibly serially correlated) components by t ; so, yt = (νt t ). This assumption means that the transition law f (xt+1 νt+1 t+1 |xt t d; θ) and thus the expected value function E{V (st+1 ; θ)|st d; θ} do not depend on νt . The presence of νt is well justified in an estimable model. If the support of these unobservables is sufficiently large and if they enter the utility function in a particular way, then the econometric model will be consistent with any possible sequence of observed choices (specification for unobservables is then
1670
ANDRIY NORETS
called saturated (Rust (1994, p. 3102))). If, in contrast, all the unobservables do affect the expected value function E{V (st+1 ; θ)|st d; θ}, then the desirable saturation property might not hold or be very difficult to establish. Since the expected value function E{V (st+1 ; θ)|st d; θ} does not depend on νt , the alternative specific value functions Vti = {u(νti ti xti d; θ) + βE[V (st+1 ; θ)|ti xti d; θ] d ∈ D} will depend on νti only through u(νti ti xti d; θ). The per-period utility u(·) and the distribution for νti can be specified in such a way that p(Vti |θ xti ti ) has a convenient analytical expression (or at least a quickly computable density kernel). In this case, a marginal conditional decomposition of the joint distribution of the data, the parameters, and the latent variables will consist of parts with analytical or easily computable expressions. Construction of the Gibbs sampler in this case is illustrated by the following example. EXAMPLE 1—A Model of Optimal Bus Engine Replacement (Rust (1987)): In this model, a maintenance superintendent of a transportation company decides every time period whether to replace an engine for each bus in the company’s fleet. The observed state variable is bus mileage xt since the last engine replacement. The per-period utility is the negative of per-period costs. If the engine is not replaced at time t, then u(xt t νt dt = 1; α) = α1 xt + t ; otherwise, u(xt t νt dt = 2; α) = α2 + νt , where t and νt are the unobserved state variables, α1 is the negative of per-period maintenance costs per unit of mileage, and α2 is the negative of the costs of engine replacement. The bus mileage is discretized into M = 90 intervals X = {1 M}. The change in the mileage (xt+1 − xt ) evolves according to a multinomial distribution on {0 1 2} with parameters η = (η1 η2 η3 ). Rust assumed that t and νt are extreme value i.i.d. Under this assumption, the integrals over yt = (t νt ) in the Bellman equation (1) and in the likelihood function (2) can be computed analytically. Rust used the maximum likelihood method to estimate the model. Since the expression for the likelihood function involves the expected value functions, Rust’s algorithm solves the DP numerically on each iteration of the estimation procedure. Rust’s assumptions on unobservables considerably reduce computational burden. However, it is reasonable to expect that engine-specific maintenance costs represented by t are serially correlated. Thus, one could assume νt is i.i.d. N(0 h−1 ν ) truncated to an interval [−ν ν], t is N(ρt−1 h−1 ) truncated to E = [− ], and 0 = 0. When t is serially correlated, the dimension of integration in the likelihood function can exceed 200 for Rust’s data. It would be very hard to compute these integrals on each iteration of an estimation procedure. The Gibbs sampler with data augmentation described below can handle this problem. Ti for i = Each bus/engine i is observed for Ti time periods: {xti dti }t=1 1 I. When the engine is replaced, the state is reinitialized: xt−1 = 1, t−1 = 0. Therefore, a bus with a replaced engine can be treated as a separate
INFERENCE IN DDCMS WITH CORRELATED UNOBSERVABLES
1671
observation. The parameters are θ = (α η ρ h ); hν is fixed for normalizaTi tion. The latent variables are {Vti ti }t=1 , i = 1 I, where Vti = xti α1 − α2 + ti − νti + Fti (θ ti ) and Fti (θ ) = β(E[V (x ν ; θ)| xti dti = 1; θ] − E[V (x ν ; θ)| xti dti = 2; θ]). A compact space for parameters (required by the theory in the following sections) Θ is defined as αi ∈ [−α α], ρ ∈ [−ρ ρ], h ∈ [hl hr ], and η belongs to a two-dimensional simplex. The joint distribution of the data, the parameters, and the latent variables is (5)
i p(θ; {xti dti ; Vti ti }t=1 ; i = 1 I)
T
= p(θ)
Ti I
p(dti |Vti )p(Vti |xti ti ; θ)
i=1 t=1
× p(xti |xt−1i ; dt−1i ; η)p(ti |t−1i ρ h ) where p(θ) is a prior, p(xti |xt−1i ; dt−1i ; η) = ηxti −xt−1i +1 p(dti |Vti ) = 1{dti =1Vti ≥0 or dti =2Vti <0} p(ti |t−1i θ) 2 h1/2 exp {−05h (ti − ρt−1i ) } =√ 1E (ti ) 2π[ ([ − ρt−1i ]h05 ) − ([− − ρt−1i ]h05 )]
and
(6) (7)
p(Vti |xti ti ; θ) 2 = exp −05hν Vti − [xti α1 − α2 + ti + Fti (θ ti )] · 1[−νν] Vti − [xti α1 − α2 + ti + Fti (θ ti )] ·√
hν05 2π[ (νhν05 ) − (−νhν05 )]
Densities for Gibbs sampler blocks will be proportional to the joint distribution in (5). In this Gibbs sampler the observed choice optimality constraints do not involve parameters and affect only blocks for simulating Vti | · · ·, which will have a normal truncated distribution proportional to (6) and (7), and also truncated to R+ if dti = 1 or to R− otherwise. Efficient algorithms for simulating from truncated normal distributions are readily available; see, for example, Geweke (1991).
1672
ANDRIY NORETS
The density for ti | · · · is p(ti | · · ·) ∝
(8)
exp {−05hν (Vti − [xti α1 − α2 + ti + Fti (θ ti )])2 }
([ − ρt−1i ]h05 ) − ([− − ρt−1i ]h05 ) · 1[−νν] Vti − [xti α1 − α2 + ti + Fti (θ ti )] · exp{−05h (t+1i − ρti )2 − 05h (ti − ρt−1i )2 } · 1E (ti )
Direct simulation from ti | · · · could be difficult. However, the kernel of this density can be evaluated numerically (approximations to Fti (θ ti ) are discussed in the next section). Therefore, a Metropolis-within-Gibbs3 algorithm can be used for this Gibbs sampler block. A convenient transition density for this Metropolis-within-Gibbs step is a truncated normal density proportional to (8). Assuming a normal prior N(ρ h−1 ρ ) truncated to [−ρ ρ], 2 (Vti − [xti α1 − α2 + ti + Fti (θ ti )]) exp −05hν
p(ρ| · · ·) ∝
·
it
([ − ρt−1i ]h05 ) − ([− − ρt−1i ]h05 )
it
1[−νν] Vti − [xti α1 − α2 + ti + Fti (θ ti )]
it
(9)
· exp{−05hρ (ρ − ρ)2 } · 1[−ρρ] (ρ)
Ti
Ti 2 where hρ = hρ + i t=2 t−1i and ρ = h−1 ρ (hρ ρ + h i t=2 ti t−1i ). A Metropolis-within-Gibbs algorithm with truncated normal transition density proportional to (9) can be used for this Gibbs sampler block. Blocks for other parameters can be constructed in a similar way; see Norets (2007). The Gibbs sampler presented in this example can be generalized and applied to different models with other interesting forms of heterogeneity such as individual-specific parameters. Also, components of νt do not have to enter the utility function linearly. The essential requirement is the ability to evaluate a kernel of p(Vti |θ xti ti ) quickly. The Gibbs sampler outlined above requires computing the expected value functions for each new parameter draw θm from 3 To produce draws from some target distribution, the Metropolis or Metropolis–Hastings MCMC algorithm only needs values of a kernel of the target density. The draws are simulated from a transition density and they are accepted with probability that depends on the values of the target density kernel and the transition density. For more details, see, for example, Chib and Greenberg (1995).
INFERENCE IN DDCMS WITH CORRELATED UNOBSERVABLES
1673
the MCMC iteration m and each observation in the sample. The following section describes how the approximations of the expected value functions can be efficiently obtained. 3. ALGORITHM FOR SOLVING THE DP For a discussion of methods for solving the DP for a given parameter vector θ, see Rust (1996). Below, I introduce a method of solving the DP suitable for use in conjunction with the Bayesian estimation of a general DDCM. This method uses an idea from Imai, Jain, and Ching (2005): to iterate the Bellman equation only once at each step of the estimation procedure and use information from previous steps to approximate the expectations in the Bellman equation. However, the way the previous information is used differs for the two methods. A detailed comparison is given in Section 3.2. 3.1. Algorithm Description In contrast to conventional value function iteration, this algorithm iterates the Bellman equation only once for each parameter draw. First, I will describe how the DP solving algorithm works and then how the output of the DP solving algorithm is used to approximate the expected value functions in the Gibbs sampler. The DP solving algorithm takes a sequence of parameter draws θm , m = 1 2 as an input from the Gibbs sampler, where m denotes the Gibbs sampler iteration. For each θm , the algorithm generates random states smj ∈ S, ˆ j = 1 N(m). At each random state, the approximations of the value funcm mj tions V (s ; θm ) are computed by iterating the Bellman equation once. At this one iteration of the Bellman equation, the future expected value functions are computed by importance sampling over value functions V k (skj ; θk ) from previous iterations k < m. The random states smj are generated from a density g(·) > 0 on S. This density g(·) is used as an importance sampling source density in approximating the ˆ expected value functions. The collection of the random states {smj }N(m) j=1 will be referred to below as the random grid. (Rust (1997) showed that value function iteration on random grids from a uniform distribution breaks the curse of dimensionality for DDCMs.) The number of points in the random grid at ˆ iteration m is denoted by N(m) and will be referred to below as the size of the random grid (at iteration m). ˆ For each point in the current random grid smj , j = 1 N(m), the approximation of the value function V m (smj ; θm ) is computed according to (10)
V m (s; θ) = max u(s d; θ) + βEˆ (m) [V (s ; θ)|s d; θ] d∈D
1674
ANDRIY NORETS
Not all of the previously computed value functions V k (skj ; θk ), k < m, are used in importance sampling for computing Eˆ (m) [V (s ; θ)|s d; θ] in (10). To converge, the algorithm has to forget the remote past. Thus, at each iteration m, I keep track only of the history of length N(m): {θk ; skj V k (skj ; θk ) m−1 ˆ ˜ j = 1 N(k)} k=m−N(m) . In this history, I find N(m) closest to θ parameter draws. Only the value functions corresponding to these nearest neighbors are used in importance sampling. Formally, let {k1 kN(m) } be the ˜ iteration numbers of the nearest neighbors of θ in the current history: k1 = arg mini∈{m−N(m)m−1} θ − θi and (11)
kj =
arg min i∈{m−N(m)m−1}\{k1 kj−1 }
θ − θi
˜ j = 2 N(m)
If the arg min returns a multivalued result, I use the lexicographic order for (θi − θ) to decide which θi is chosen first. If the result of the lexicographic selection is also multivalued, θi = θj , then I choose θi over θj if i > j. This particular way to resolve the multivaluedness of the arg min might seem irrelevant for implementing the method in practice; however, it is used in the proof of the measurability of the supremum of the approximation error, which is necessary for the uniform convergence results. A reasonable choice for the norm in (11) would be θ = θT H θ θ, where H θ is the prior precision for the parameters. Importance sampling is performed as (12)
Eˆ (m) [V (s ; θ)|s d; θ] ˆ i) ˜ N(k N(m)
=
V ki (ski j ; θki ) · f (ski j | s d; θ)/g(ski j ) i=1
j=1
ˆ r) ˜ N(k N(m)
r=1
f (skr q | s d; θ)/g(skr q )
q=1
ˆ i) ˜ N(k N(m)
(13)
=
i=1
V ki (ski j ; θki )Wki jm (s d θ)
j=1
The target density for importance sampling is the state transition density f (·|s d; θ). The source density is the density g(·) from which the random grid on the state space is generated. In general, g(·) should give reasonably high probabilities to all parts of the state space that are likely under f (·|s d; θ) with reasonable values of the parameter θ. To reduce the variance of the approximation of expectations produced by importance sampling, one should make g(·) relatively high for the states that result in large absolute values for value functions (g(s ) that minimizes the variance of the importance sampling approximation to the expectation is proportional to |V (s ; θ)f (s |s d; θ)|).
INFERENCE IN DDCMS WITH CORRELATED UNOBSERVABLES
1675
Section 3.3 formally presents the assumptions on model primitives and reˆ ˜ strictions on g(·), N(m), N(m), and N(m) that are sufficient for algorithm convergence. After V m (smj ; θm ) are computed from (10) and (12), they can be used in a formula similar to (12) to obtain the approximations of the expectations m E[V (st+1 ; θm )|xti m ti d; θ ] on iteration m of the Gibbs sampler. 3.2. Comparison With Imai, Jain, and Ching (2005) An algorithm for solving the DP has to deal with an integration problem for computing the expectations of the value functions in addition to approximating the value function in the parameter space. My prescriptions for handling this integration problem differ from IJC’s. IJC used kernel smoothing over all N(m) previously computed value functions to approximate the expected value functions. They also generated only one new state at each iteraˆ tion, N(m) = 1 ∀m. For a finite observed state space, deterministic transitions for the observed states, and i.i.d. unobservables IJC proved convergence of their DP solution approximations. To handle compact state space and random ˆ state transitions I introduce growing random grids: N(m) increases with m. A fixed random grid size that works for IJC’s i.i.d. errors does not seem to be enough for general random transitions. When the size of the random grid grows, the nearest neighbor (NN) algorithm that I use to approximate value functions in the parameter space is computationally much more efficient than the kernel smoothing used by IJC. The computational advantage of using the NN algorithm in this case stems from the fact that importance sampling over the random grids has to be performed only for a few nearest neighbors and not for the whole tracked history of length N(m). The convergence results I obtain are also stronger. IJC proved uniform convergence in probability for their DP solution approximations. For the NN algorithm, I establish complete uniform convergence, which implies uniform a.s. convergence. Furthermore, the NN algorithm easily accommodates more than one iteration of the Bellman equation for each parameter draw to improve the approximation precision in practice. Overall, the nearest neighbors method is not just a substitute for kernel smoothing that might work better in higher dimensions (see, e.g., Scott (1992, pp. 189–190)), but an essential part of the algorithm that, in conjunction with random grids, makes it computationally efficient and applicable to more general model specifications. 3.3. Theoretical Results The following assumptions on the model primitives and the algorithm parameters are made: ASSUMPTION 1: Θ ⊂ RJΘ and S ⊂ RJS are compact, and β ∈ (0 1) is known.
1676
ANDRIY NORETS
The assumption of compactness of the parameter space is standard in econometrics. Fixing β is also a usual practice in the literature on estimation of DDCMs. ASSUMPTION 2: u(s d; θ) is continuous in (θ s) (and thus bounded on compacts). ASSUMPTION 3: f (s |s d; θ) is continuous in (θ s s ) and g(s) is continuous in s. Discrete state variables can be accommodated by defining densities with respect to the counting measure. Assumptions 1–3 imply continuity of V (s; θ) in (θ s) (see Proposition 4 in the Supplemental Material (Norets (2009b)) or Norets (2009a) for more general results). ASSUMPTION 4: The density of the state transition f (·|·) and the importance sampling density g(·) are bounded above and away from zero, which gives inf f (s |s d; θ)/g(s ) ≥ f > 0
θs sd
and
sup f (s |s d; θ)/g(s ) ≤ f < ∞
θs sd
Assumption 4 can be relaxed. The support of the transition density can be allowed to depend on the decision d and the discrete state variables if they take a finite number of values. Deterministic transitions for discrete state variables and, in some cases, for continuous state variables (e.g., setting t = 0 when dt = 2 in Rust’s engine replacement model) can be accommodated. Corollaries 1 and 2 below describe changes in the DP solving algorithm required to relax Assumption 4. ˆ ASSUMPTION 5: ∃δˆ > 0 such that P(θm+1 ∈ A|ωm ) ≥ δλ(A) for any Borel m measurable A ⊂ Θ, any m, and any feasible history ω = {ω1 ωm }, where λ is the Lebesgue measure. The history includes all the parameter and latent variable draws from the Gibbs sampler and all the random grids from the DP solving ˆ algorithm: ωt = {θt V t t ; stj j = 1 N(t)}. Assumption 5 means that at each iteration of the algorithm, the parameter draw can get into any part of Θ. This assumption should be verified for each specific DDCM and the corresponding parameterization of the Gibbs sampler. The assumption is only a little stronger than standard conditions for convergence of the Gibbs sampler; see Corollary 4.5.1 in Geweke (2005). Since a careful practitioner of MCMC would have to establish convergence of the Gibbs sampler, a verification of Assumption 5 should not require much extra effort.
INFERENCE IN DDCMS WITH CORRELATED UNOBSERVABLES
1677
˜ = [t γ2 ], ASSUMPTION 6: Let 1 > γ0 > γ1 > γ2 ≥ 0 and N(t) = [t γ1 ], N(t) γ −γ ˆ ˆ = 1, where [x] is the integer part of x. N(t) = [t 1 2 ], and N(0) Multiplying the functions of t in Assumption 6 by positive constants will not affect any of the theoretical results below. THEOREM 1: Under Assumptions 1–6, the approximation to the expected value function in (12) converges uniformly and completely to the exact value: that is, the following statements hold: (i) supsθd |Eˆ (t) [V (s ; θ) | s d; θ] − E[V (s ; θ) | s d; θ]| is measurable.
∞ (ii) For any ˜ > 0 there exists a sequence {zt } such that t=0 zt < ∞ and P sup Eˆ (t) [V (s ; θ) | s d; θ] − E[V (s ; θ) | s d; θ] > ˜ ≤ zt sθd
COROLLARY 1: Let the state space be a product of a finite set Sf and a bounded rectangle Sc ∈ RJSc , S = Sf × Sc . Let f (sf sc |sf sc ; θ) be the state transition density with respect to the product of the counting measure on Sf and the Lebesgue measure on Sc . Assume for any sf ∈ Sf and d ∈ D, we can define S(sf d) ⊂ S such that f (sf sc |sf sc d; θ) > 0 for any (sf sc ) ∈ / S(sf d) S(sf d) and any sc ∈ Sc and f (sf sc |sf sc d; θ) = 0 for any (sf sc ) ∈ and any sc ∈ Sc . For each sf ∈ Sf and d ∈ D, let density gsf d (·) be such that infθ∈Θ(sf sc )∈S(sf d)sc ∈Sc f (sf sc |sf sc d; θ)/gsf d (sf sc ) ≥ f > 0 and supθ∈Θ(s sc )∈S(sf d)sc ∈Sc f (sf sc |sf sc d; θ)/gsf d (sf sc ) ≤ f < ∞. In the DP solvf ing algorithm, generate the random grid over the state space for each discrete state mj sf ∈ Sf and decision d ∈ D: ssf d ∼ gsf d (·), and use these grids to compute the approximations of the expectations E(V (s ; θ)|sf sc d; θ). Then the conclusions of Theorem 1 hold. COROLLARY 2: If the transition for the discrete states is independent from the other states, then a more efficient alternative would also work. Let us denote the transition probability for the discrete states by f (sf |sf d; θ). Suppose that for f (sc |sc d; θ) and some g(·) defined on Sc , Assumption 4 holds and the random grid scmj is generated only on Sc from g(·). Consider the following approximation of the expectations, Eˆ (m) [V (s ; θ)|sf sc d; θ], in the DP solving algorithm: (14) f (sf |sf d; θ) sf ∈Sf (sf d) ˆ i) ˜ N(k N(m)
×
V ki (sf ski j ; θki )f (ski j | s d; θ)/g(ski j ) i=1
j=1
ˆ r) ˜ N(k N(m)
r=1
q=1
f (skr q | s d; θ)/g(skr q )
1678
ANDRIY NORETS
where Sf (sf d) denotes the set of possible future discrete states given the current state sf and decision d. Then the conclusions of Theorem 1 hold. 4. CONVERGENCE OF POSTERIOR EXPECTATIONS In Bayesian analysis, most inference exercises involve computing posterior expectations of some functions. For example, the posterior mean and the posterior standard deviation of a parameter and the posterior probability that a parameter belongs to a set can all be expressed in terms of posterior expectations. More importantly, the answers to the policy questions that DDCMs address also take this form. Using the uniform complete convergence of the approximations of the expected value functions, I prove the complete convergence of the approximated posterior expectations under mild assumptions on a kernel of the posterior distribution. ASSUMPTION 7: Assume that ti , θ, and νtki have compact supports E, Θ, and [−ν ν] correspondingly, where νtki denotes the kth component of νti . Let the joint posterior distribution of the parameters and the latent variables be proportional to a product of a continuous function and indicator functions, (15)
p(θ V ; F|d x) ∝ r(θ V ; F(θ )) · 1Θ (θ) 1E (ti )p(dti |Vti ) · it
·
1[−νν] qk (θ Vti ti Fti (θ ti ))
itk
where r(θ V ; F) and qk (θ Vti ti Fti ) are continuous in (θ V F), F = {Ftdi ∀i t d} stands for a vector of the expected value functions, and Fti are the corresponding subvectors. Also assume that the level curves of qk (θ Vti ti Fti ) corresponding to ν and −ν have zero Lebesgue measure, (16)
λ[(θ V ) : qk (θ Vti ti Fti ) = ν] = λ[(θ V ) : qk (θ Vti ti Fti ) = −ν] = 0
This assumption is likely to be satisfied for most models formulated on a bounded state space, in which distributions are truncated to bounded regions required by the theory. The kernel of the joint distribution for the engine replacement example from Section 2 has the form in (15). Condition (16) is also easy to verify. In Rust’s model, qd (θ Vti ti Fti ) = u(xti d) + tdi + Ftdi (θ ti ) − Vtdi = ν defines a continuous function Vtdi = u(xti d) + tdi + Ftdi (θ ti ) − ν. Since the Lebesgue measure of the graph of a continuous function is zero, (16) will be satisfied.
INFERENCE IN DDCMS WITH CORRELATED UNOBSERVABLES
1679
THEOREM 2: Let h(θ V ) be a bounded function. Under Assumptions 1–7, the expectation of h(θ V ) with respect to the approximated posterior that uses the DP solution approximations Fˆ n from step n of the DP solving algorithm converges completely (and thus a.s.) to the true posterior expectation of h(θ V ) as
∞n → ∞. In particular, for any ε > 0, there exists a sequence {zn } such that n=0 zn < ∞ and the probability P h(θ V )p(θ V ; F|d x) d(θ V ) n − h(θ V )p(θ V ; Fˆ |d x) d(θ V ) > ε is bounded above by zn . One way to apply Theorem 2 is to stop the DP solving algorithm at an iteration m and run the Gibbs sampler for extra n iterations using the DP solution Eˆ (m) [V (s ; θ)|s d; θ] from iteration m. If the Gibbs sampler is uniformly ergodic (see Tierney (1994)) for any fixed approximation Eˆ (m) [V (s ; θ)|s d; θ], then for any δ > 0 and ε > 0 there exist m and N such that for all n ≥ N, m+n 1 P h(θi V i i ) n i=m+1 − h(θ V )p(θ V ; F|d x) d(θ V ) > ε ≤ δ If we do not stop the DP solving algorithm and run it together with the Gibbs sampler, then the stochastic process for (θi V i i ) will not be a Markov chain. In this case, results from the adaptive MCMC literature (e.g., Roberts and Rosenthal (2006)) can be adapted to prove laws of large numbers and convergence in distribution for (θi V i i ). THEOREM 3: Let us define the following two conditions. (a) The MCMC algorithm that uses the exact DP solutions is uniformly ergodic: for any ε > 0 there is N such that N P ((θ V ) ·; F) − P(·; F|d x) ≤ ε for any (θ V ), where P N (· ·) is the Markov transition kernel implied by N iterations of the MCMC algorithm, P(·; F|d x) is the posterior probability measure, and · is the bounded variation norm.
1680
ANDRIY NORETS
(b) The transition kernel that uses the approximate DP solutions converges uniformly in probability to the transition kernel that uses the exact solutions P sup P((θ V ) ·; F) − P((θ V ) ·; F n ) → 0
θV
as n → ∞
Conditions (a) and (b) imply the following two results: (i) The MCMC algorithm that uses the approximate DP solutions is ergodic: for any (θ0 V 0 0 ) and any ε > 0 there exists M such that for any i ≥ M, sup P((θi V i i ) ∈ A|θ0 V 0 0 ) − P(A; F|d x) ≤ ε A
(ii) A weak law of large numbers (WLLN) holds: for any (θ0 V 0 0 ) and a bounded function h(·), n
P
h(θ V ) n → i
i
i
h(θ V )p(θ V ; F|d x) d(θ V )
i=1
Norets (2007) showed how to establish condition (a) for the MCMC algorithm used for inference in the engine replacement example. A verification of condition (b) involves arguments and assumptions similar to those employed in the statement and proof of Theorem 2. (Theorem 2 implies strong convergence for the approximated posterior probability, while here we need a similar result for the approximated Markov transition probability.) 5. CONCLUSION This paper presents a feasible method for Bayesian inference in dynamic discrete choice models with serially correlated unobserved state variables. I construct the Gibbs sampler, employing data augmentation and Metropolis steps, that can successfully handle multidimensional integration in the likelihood function of these models. The computational burden of solving the DP at each iteration of the estimation algorithm can be reduced by efficient use of the information obtained on previous iterations. Serially correlated unobservables are not the only possible source of intractable integrals in the likelihood function of DDCMs. The Gibbs sampler algorithm can be extended to allow for other interesting features in DDCMs such as individual-specific coefficients, missing data, macroshocks, and cohort effects. The proposed theoretical framework is flexible and leaves room for experimentation. For details on implementation and experiments, the interested reader is referred to Norets (2007, 2008). Overall, combined with efficient DP solution strategies, standard computational tools of Bayesian analysis seem to be very promising in making more elaborate DDCMs estimable.
INFERENCE IN DDCMS WITH CORRELATED UNOBSERVABLES
1681
REFERENCES AGUIRREGABIRIA, V., AND P. MIRA (2007): “Dynamic Discrete Choice Structural Models: A Survey,” Working Paper 297, Department of Economics, University of Totonto. [1667] ALBERT, J. H., AND S. CHIB (1993): “Bayesian Analysis of Binary and Polychotomous Response Data,” Journal of the American Statistical Association, 88, 669–679. [1665] CHIB, S., AND E. GREENBERG (1995): “Understanding the Metropolis–Hastings Algorithm,” The American Statistician, 49, 327–335. [1672] ECKSTEIN, Z., AND K. WOLPIN (1989): “The Specification and Estimation of Dynamic Stochastic Discrete Choice Models: A Survey,” Journal of Human Resources, 24, 562–598. [1667] ERDEM, T., AND M. KEANE (1996): “Decision-Making Under Uncertainty: Capturing Dynamic Brand Choice Processes in Turbulent Consumer Goods Markets,” Marketing Science, 15, 1–20. [1668] ERDEM, T., S. IMAI, AND M. KEANE (2003): “A Model of Consumer Brand and Quantity Choice Dynamics Under Price Uncertainty,” Quantitative Marketing and Economics, 1, 5–64. [1668] GEWEKE, J. (1991): “Efficient Simulation From the Multivariate Normal and Student-t Distributions Subject to Linear Constraints and the Evaluation of Constraint Probabilities,” in Computing Science and Statistics: Proceedings of the Twenty-Third Symposium on the Interface, ed. by E. M. Keramidas. Fairfax: Interface Foundation of North America, 571–578. [1671] (2005): Contemporary Bayesian Econometrics and Statistics. Hoboken, NJ: WileyInterscience. [1668,1676] GEWEKE, J., M. KEANE, AND D. RUNKLE (1994): “Alternative Computational Approaches to Inference in the Multinomial Probit Model,” The Review of Economics and Statistics, 76, 609–632. [1665] GILLESKIE, D. (1998): “A Dynamic Stochastic Model of Medical Care Use and Work Absence,” Econometrica, 6, 1–45. [1666] IMAI, S., N. JAIN, AND A. CHING (2005): “Bayesian Estimation of Dynamic Discrete Choice Models,” Unpublished Manuscript, Queens University. [1666,1673,1675] KEANE, M., AND K. WOLPIN (1994): “The Solution and Estimation of Discrete Choice Dynamic Programming Models by Simulation and Interpolation: Monte Carlo Evidence,” Review of Economics and Statistics, 76, 648–672. [1668] (1997): “The Career Decisions of Young Men,” Journal of Political Economy. 105, 417–522. [1668] (2006): “The Role of Labor and Marriage Markets, Preference Heterogeneity and the Welfare System in the Life Cycle Decisions of Black, Hispanic and White Women.” Working Paper 06-004, PIER. Available at http://ssrn./com/abstract=889550. [1668] MCCULLOCH, R., AND P. ROSSI (1994): “An Exact Likelihood Analysis of the Multinomial Probit Model,” Journal of Econometrics, 64, 207–240. [1665] NORETS, A. (2007): “Bayesian Inference in Dynamic Discrete Choice Models,” Ph.D. Dissertation, The University of Iowa. [1666,1672,1680] (2008): “Implementation of Bayesian Inference in Dynamic Discrete Choice Models,” Unpublished Manuscript, Princeton University. [1666,1680] (2009a): “Continuity and Differentiability of Expected Value Functions in Dynamic Discrete Choice Models,” Unpublished Manuscript, Princeton University. [1676] (2009b): “Supplement to ‘Inference in Dynamic Discrete Choice Models With Serially Correlated Unobserved State Variables’,” Econometrica Supplemental Material, 77, http://www.econometricsociety.org/ecta/Supmat/7292 proofs.pdf. [1667,1676] PAKES, A. (1986): “Patents as Options: Some Estimates of the Value of Holding European Patent Stocks,” Econometrica, 54, 755–784. [1668] ROBERTS, G., AND J. ROSENTHAL (2006): “Coupling and Ergodicity of Adaptive MCMC,” available at http://www.probability.ca/jeff/research.html. [1679] RUST, J. (1987): “Optimal Replacement of GMC Bus Engines: An Empirical Model of Harold Zurcher,” Econometrica, 55, 999–1033. [1666,1670]
1682
ANDRIY NORETS
(1994): “Structural Estimation of Markov Decision Processes,” in Handbook of Econometrics, ed. by R. Engle and D. McFadden. Amsterdam: North-Holland. [1667,1670] (1996): “Numerical Dynamic Programming in Economics,” in Handbook of Computational Economics, ed. by H. Amman, D. Kendrick, and J. Rust. Amsterdam: North-Holland. Available at http://gemini.econ.umd.edu/jrust/sdp/ndp.pdf. [1673] (1997): “Using Randomization to Break the Curse of Dimensionality,” Econometrica, 65, 487–516. [1673] SCOTT, D. (1992): Multivariate Density Estimation. New York: Wiley-Interscience. [1675] SULLIVAN, P. (2006): “A Dynamic Analysis of Educational Attainment, Occupational Choices, and Job Search,” Paper 861, MPRA, University Library of Munich, Germany. [1668] TIERNEY, L. (1994): “Markov Chains for Exploring Posterior Distributions,” The Annals of Statistics, 22, 1758–1762. [1679]
Dept. of Economics, Princeton University, 313 Fisher Hall, Princeton, NJ 08544, U.S.A.;
[email protected]. Manuscript received July, 2007; final revision received March, 2009.
Econometrica, Vol. 77, No. 5 (September, 2009), 1683–1701
ASYMPTOTICS FOR STATISTICAL TREATMENT RULES BY KEISUKE HIRANO AND JACK R. PORTER1 This paper develops asymptotic optimality theory for statistical treatment rules in smooth parametric and semiparametric models. Manski (2000, 2002, 2004) and Dehejia (2005) have argued that the problem of choosing treatments to maximize social welfare is distinct from the point estimation and hypothesis testing problems usually considered in the treatment effects literature, and advocate formal analysis of decision procedures that map empirical data into treatment choices. We develop large-sample approximations to statistical treatment assignment problems using the limits of experiments framework. We then consider some different loss functions and derive treatment assignment rules that are asymptotically optimal under average and minmax risk criteria. KEYWORDS: Statistical decision theory, treatment assignment, minmax, minmax regret, Bayes rules, semiparametric models.
1. INTRODUCTION ONE MAJOR GOAL of treatment evaluation in the social and medical sciences is to provide guidance on how to assign individuals to treatments. For example, a number of studies have examined the problem of “profiling” individuals to identify those likely to benefit from a social program.2 Manski (2000, 2002, 2004) and Dehejia (2005) suggested placing the problem within a decisiontheoretic framework and specifying a loss function that quantifies the consequences of choosing different treatments under different states of nature. Schlag (2006) and Stoye (2006) derived exact minmax-regret rules for randomized experiments with a discrete covariate and a bounded continuous outcome. Despite these important results, it is difficult to obtain exact optimality results in many empirically relevant settings, in the same way that it is difficult to obtain exactly optimal estimators or hypothesis tests. In this paper, we develop large sample results to compare treatment rules and show how to construct approximately optimal procedures from efficient estimates of treatment effect parameters. The data could come from a randomized experiment or an observational data source, and we allow for unrestricted outcome and covariate distributions (including continuously distributed covariates). The key requirements are a local asymptotic normality condition and a welfare contrast parameter that is point-identified. When social welfare contrasts are point-identified, there will typically exist many treatment rules that 1
We are grateful to Gary Chamberlain, Michael Jansson, Charles Manski, Ping Yu, numerous seminar participants, a co-editor, and the referees for comments. Porter thanks the National Science Foundation for research support under grant SES-0438123. 2 See, for example, Worden (1993), O’Leary, Decker, and Wandner (1998, 2005), Berger, Blacck, and Smith (2001), and Black, Smith, Berger, and Noel (2003). © 2009 The Econometric Society
DOI: 10.3982/ECTA6630
1684
K. HIRANO AND J. R. PORTER
are consistent, in the sense that they assign the “better” treatment with probability approaching 1. Our goal here is to make finer comparisons among rules and to base these comparisons on risk rather than conventional statistical criteria that are not tightly connected to the underlying decision problem. We first study regular parametric models, using a local parametrization so that the problem of determining whether to assign the treatment does not become trivial as the sample size increases. Using Le Cam’s limits of experiments framework (see Le Cam (1986)), we show that the treatment assignment problem is asymptotically equivalent to a simpler problem, in which one observes a single draw from a shifted Gaussian distribution and must decide whether a linear function of the mean vector is greater than zero. We solve the approximate version of the decision problem and then construct a sequence of decision rules in the original problem that asymptotically matches the solution. The forms of optimal rules will depend on the loss function and the way in which risk (expected loss) is aggregated over the parameter space—for example, we could work with average (Bayes) risk or minmax risk. We consider some specific loss functions and show that some simple rules based on efficient parameter estimates are asymptotically optimal under average and minmax risk criteria. We then extend the results to a semiparametric setting, where the welfare gain of the treatment can be expressed as a regular functional of the unknown distribution. The analysis in this setting mirrors the parametric setting, but involves a Gaussian sequence model instead of a finite-dimensional Gaussian model. We obtain both minmax and average risk optimality results; to define average risk in the semiparametric case, we propose a class of prior weightings on the tangent space. We illustrate our results by showing that Manski’s conditional empirical success rules are asymptotically optimal under certain symmetric loss functions according to both average risk and minmax risk efficiency criteria. All proofs are contained in the online Supplemental Material (Hirano and Porter (2009)). 2. STATISTICAL TREATMENT ASSIGNMENT PROBLEM Following Manski (2000, 2002, 2004), we consider a social planner, who assigns individuals to different treatments based on their observed background variables. Suppose that a randomly drawn individual has covariates X with probability distribution FX on a space X . The set of possible treatment values is T = {0 1}. The planner observes X = x and assigns the individual to treatment 1 according to a treatment rule δ(x) = Pr(T = 1|X = x) Let Y0 and Y1 denote potential outcomes for the individual, with conditional probability distributions F0 (·|x) and F1 (·|x) on the same space Y . Given a rule δ, the outcome distribution conditional on X = x is Fδ (·|x) = δ(x)F1 (·|x) + (1 − δ(x))F0 (·|x)
ASYMPTOTICS FOR STATISTICAL TREATMENT RULES
1685
For a given outcome distribution F , let the social welfare be a functional W (F). We define W0 (x) = W (F0 (·|x))
W1 (x) = W (F1 (·|x))
One optimal rule is then δ∗ (x) = 1(W1 (x) > W0 (x)). Of course, this rule is generally infeasible since F0 and F1 (and hence W0 and W1 ) are unknown. We suppose that F0 and F1 belong to families of distributions indexed by a parameter θ ∈ Θ, where the parameter space could be finite dimensional or infinite dimensional. Let w0 (θ x) and w1 (θ x) denote the values for W0 (x) and W1 (x) under θ. It will be convenient to work with the welfare contrast g(θ x) = w1 (θ x) − w0 (θ x) We assume that w0 and g are continuously differentiable in θ for FX -almost all x.3 Suppose we have some data that are informative about θ, such as data from a randomized experiment or an observational study. For simplicity, we assume that the data Z n = (Z1 Zn ) are independent and identically distributed (i.i.d.) with Zi ∼ Pθ on some space Z .4 We will consider below a sequence of experiments En = {Pθn θ ∈ Θ} as the sample size grows. EXAMPLE 2.1: Dehejia (2005) used data from a randomized evaluation comparing the Greater Avenues for Independence (GAIN) program to the standard Aid to Families with Dependent Children (AFDC) program for welfare recipients in Alameda County, California. The outcome of interest is individual earnings in various quarters after the program. Since many welfare recipients had zero earnings, Dehejia used a Tobit model Yi = max{0 α1 Xi + α2 Ti + α3 Xi · Ti + i } iid
where Ti = 1 denotes receipt of the experimental program and i |Xi Ti ∼ N(0 σ 2 ). Dehejia computed posterior distributions based on observation of the n experimental subjects and then produced predictive distributions for a hypothetical (n + 1)th subject to assess different treatment assignment rules. In our notation, θ = (α1 α2 α3 σ) and Z n = {(Ti Xi Yi ) : i = 1 n}. A randomized statistical treatment rule is a mapping δ : Z n × X → [0 1]. We interpret it as the probability of assigning a (future) individual with covariate X to treatment, given past data Z n : δ(z n x) = Pr(T = 1|Z n = z n X = x) 3 For a discussion of the relationship between the net social welfare and traditional measures of effects of treatments, such as the average treatment effect, see Dehejia (2003). 4 The i.i.d. assumption could be weakened to allow for dependent data satisfying local asymptotic normality, but at the cost of complicating the arguments below.
1686
K. HIRANO AND J. R. PORTER
Let L(δ θ x) be some loss function, which specifies penalties for using the rule δ when the true parameter is θ and the future individual’s covariate is X = x. We will discuss some specific choices for loss below. Given a loss L, the risk of a rule δ(z n x) under θ is n R(δ θ x) = Eθ L(δ(Z x) θ x) = L(δ(z n x) θ x) dPθn (z n ) We evaluate risk pointwise in x. In principle, we could integrate the risk over the marginal distribution of X, but this pointwise form fits most naturally with our local asymptotic approximations. The risk of a decision rule can vary with θ and, typically, there will not exist a rule that uniformly dominates all other rules unless one restricts the class of rules substantially. There are two classic ways to define an ordering over risk functions: one can average the risk of a rule with respect to some prior measure Π on Θ, obtaining a Bayes risk: R(δ θ x) dΠ(θ) = L(δ(z n x) θ x) dPθn (z n ) dΠ(θ) Alternatively, one can focus on worst-case risk: sup R(δ θ x) = sup L(δ(z n x) θ x) dPθn (z n ) θ∈Θ
θ∈Θ
3. REGULAR PARAMETRIC MODELS 3.1. Limit Experiment In this section we consider regular parametric models, where the likelihood is smooth in a finite-dimensional parameter. To develop asymptotic approximations, we adopt a local parametrization, as is standard in the literature on efficiency of estimators and test statistics. The local parametrization is used to derive an asymptotic description of the treatment assignment problem using the limits of experiments framework (Le Cam (1986)). Although this framework is typically applied to study point estimation and hypothesis testing, it applies much more broadly to general statistical decision problems. In regular parametric models, a simple Gaussian shift model provides an approximation to the original decision problem. We first reparametrize the model in terms of local alternatives.5 The idea is to consider values for θ such that g(θ x) is “close” to 0 and there is a nontrivial 5 Alternatively, we could use large-deviations asymptotics, in analogy with Bahadur efficiency of hypothesis tests. Manski (2004) used finite-sample large-deviations results to bound the risk properties of certain types of treatment assignment rules in a binary-outcome randomized experiment. Puhalskii and Spokoiny (1998) developed a large-deviations version of asymptotic statistical decision theory and applied it to estimation and hypothesis testing.
ASYMPTOTICS FOR STATISTICAL TREATMENT RULES
1687
difficulty distinguishing between the effects of the two treatments as sample size grows. In our setting, for a given value of x, we center the localization around θ0 such that (3.1)
g(θ0 x) = 0
and consider parameter sequences of the form θ0 + √hn for h ∈ Rk . This is the same localization device used in local asymptotic power calculations and efficiency bounds, although here the centering value θ0 is tied to a particular covariate value x. Equation (3.1) is not the only case of interest, but for establishing asymptotic optimality, it is the key case to focus on. For combinations of (θ0 x) such that g(θ0 x) = 0, the √ treatment that is better at θ0 will be better for all local alternatives θ0 + h/ n asymptotically, and many rules, including the rules we will propose below, will select the better treatment with probability approaching 1. Our localization around a centering value satisfying (3.1) ensures that we are looking at the hardest cases, where it is difficult to determine the best treatment even with large sample sizes. To simplify the notation, we will suppress the dependence on x in the remainder of the analysis, writing g(θ x) as g(θ) and similarly for other quantities. All results should be interpreted as being stated for a fixed x.6 Let Θ be an open subset of Rk and suppose θ0 ∈ Θ satisfies Equation (3.1). We assume that the sequence of experiments En = {Pθn θ ∈ Θ} satisfies differentiability in quadratic mean (DQM) at θ0 : there exists a function s : Z → Rk such that 2 1 1/2 1/2 1/2 dPθ0 +h (z) − dPθ0 (z) − h s(z) dPθ0 (z) 2 = o( h 2 ) as
h → 0
The function s is the score function associated with the statistical model E1 and can usually be calculated as the derivative of the log-likelihood function. Let I0 = Eθ0 [ss ]. The DQM assumption implies that the log-likelihood ratios of the original model converge weakly to the log-likelihood ratios of a multivariate normal experiment and is the basis for the following result, which specializes Theorems 7.2 and 15.1 of Van der Vaart (1998). PROPOSITION 3.1: Let Θ be an open subset of Rk and suppose θ0 ∈ Θ. Let En = {Pθn θ ∈ Θ} satisfy DQM with I0 nonsingular. Consider a sequence of treatment assignment rules δn (z n ) in the experiments En and let βn (h) = Eθ0 +h/√n [δn (Z n )]. Suppose βn (h) → β(h) for every h. Then there exists a function δ : Rk → [0 1] 6 We will reintroduce the dependence on x in the last example of the paper, where we examine a specific treatment assignment rule based on conditional sample averages.
1688
K. HIRANO AND J. R. PORTER
such that for every h ∈ Rk , β(h) = δ(Δ) dN(Δ|h I0−1 ) where N(Δ|h I0−1 ) is the multivariate normal distribution with mean h and variance I0−1 . Proposition 3.1 shows that any converging sequence of treatment rules in the original problem is matched by some treatment rule in a simpler experiment where Δ has a shifted normal distribution with known variance. In this sense, the N(h I0−1 ) model is a “limit experiment” for the original problem. 3.2. Loss Functions Having obtained an asymptotic version of the statistical experiment, we need to complete the specification of the decision problem by choosing a loss function, and then examine the limiting forms of loss, risk, and Bayes risk. Generally, we will need to normalize, or modify somewhat, the loss function in the original problem so that limiting versions of risk and Bayes risk are well defined and lead to useful comparisons of treatment rules. A key component of the loss and risk functions that we will consider is the welfare contrast g(θ). Let g˙√be the vector√of partial derivatives of g at θ0 . Then, since g(θ0 ) = 0, we have ng(θ0 + h/ n) → g˙ h as n → ∞. 3.2.1. Asymmetric Losses We consider two loss functions that penalize differently for two types of errors—assigning to treatment 1 when treatment 0 is better and vice versa. The first loss has been used in the literature on hypothesis testing and gives fixed penalties for the two types of errors: Hypothesis Testing Loss: (1 − δ) if g(θ) > 0, H L (δ θ) = K · δ if g(θ) ≤ 0, where K > 0. For a rule δ, the loss can be written h h H = 1 g θ0 + √ >0 L δ θ0 + √ n n h ≤0 + δ K · 1 g θ0 + √ n h >0 − 1 g θ0 + √ n
ASYMPTOTICS FOR STATISTICAL TREATMENT RULES
1689
This converges as n → ∞ for values of h such that g˙ h = 0. Due to the discontinuity in the loss LH , the case g˙ h = 0 presents a problem for taking limits, but we can define a lower bound limit as ˙ h > 0) + δ[K · 1(g˙ h < 0) − 1(g˙ h > 0)] LH ∞ (δ h) = 1(g For a converging sequence of rules δn with βn (h) and β(h) as defined in Proposition 3.1, we can define risk and its limiting lower bound as h h H = 1 g θ0 + √ >0 Rn δn θ0 + √ n n h ≤0 + βn (h) K · 1 g θ0 + √ n h >0 − 1 g θ0 + √ n ˙ h > 0) + β(h)[K · 1(g˙ h < 0) − 1(g˙ h > 0)] RH ∞ (δ h) = 1(g For Bayes risk, let Π be a prior on the parameter space Θ, with a Lebesgue density π(θ) that is positive and continuous at θ0 . Define h h H H Bn (δn Π) = Rn δn θ0 + √ π θ0 + √ dh n n Since the prior is smooth around θ0 and places mass zero on h such that g˙ h = 0, its limit is H B∞ (δ) = π(θ0 ) RH ∞ (δ h) dh Asymmetric Welfare Regret: Tetenov (2007) proposed a loss that extends loss H by penalizing based on the amount of welfare lost by choosing the worse treatment: g(θ)(1 − δ) if g(θ) > 0, T L (δ θ) = −Kδg(θ) if g(θ) ≤ 0, with K > 0. The case K = 1 corresponds to “welfare regret,” which we will discuss further below. The loss can be written as h h h T = g θ0 + √ 1 g θ0 + √ > 0 (1 − δ) L δ θ0 + √ n n n h h ≤ 0 δ − K · g θ0 + √ 1 g θ0 + √ n n
1690
K. HIRANO AND J. R. PORTER
Here, the case g˙ h = 0√ does not present a problem for taking limits, but we do need to normalize by n so that the limit is nondegenerate: √ T h nL δ θ0 + √ n → LT∞ (δ h) = (g˙ h)1(g˙ h > 0)(1 − δ) − K · (g˙ h)1(g˙ h ≤ 0)δ We then define risk and Bayes risk analogously to loss H. 3.2.2. Welfare Loss Since w0 (θ) + δg(θ) is social welfare, it is natural to use its negative as a loss function: LW (δ θ) = −w0 (θ) − δ · g(θ) To keep the second term on the right nondegenerate when sample size in√ creases, we would typically scale the loss by n. However, this will lead the first term, involving w0 , to diverge.7 To obtain nondegenerate limits, we could recenter welfare loss as LR (δ θ) = LW (δ θ) − −w0 (θ) − 1(g(θ) > 0)g(θ) = g(θ) 1(g(θ) > 0) − δ This subtracts off the loss associated with the infeasible optimal rule δ∗ = 1(g(θ) > 0) and leads to what is called regret loss.8 Then √ R h → LR∞ (δ h) = g˙ h[1(g˙ h > 0) − δ] nL δ θ0 + √ n Note that regret loss is equal to Tetenov’s loss with K = 1, so we will not need to treat regret loss separately in the formal results to follow. The recentering does not affect Bayes rules. Consider a rule δn that minimizes BnW (δn Π) h h W n n n √ = L δn (z ) θ0 + √ dPθ0 +h/ n (z )π θ0 + √ dh n n Clearly, adding any function f (θ) of the parameter to the loss function does not change the solution. Hence the minimizer of BnW will also minimize Bayes 7
This is the motivation for Assumption 1 in the Supplemental Material. Savage (1951) suggested this type of centering in a discussion of the minmax criterion. The label “regret” is standard in the decision theory literature. 8
ASYMPTOTICS FOR STATISTICAL TREATMENT RULES
1691
risk for regret loss. However, the recentering does affect the minmax solution. In this sense, only certain combinations of loss function and optimality criterion have nontrivial asymptotic approximations. 3.3. Asymptotic Optimality We have approximated the sequence of statistical experiments by a Gaussian one and we have approximated loss functions by certain asymptotic versions. If we can find an optimal rule (according to some criterion) in this limiting version of the decision problem, then it will serve as a benchmark for how well any sequence of decision rules can perform in the original problem. Typically, the solution will also suggest the form of a sequence of decision rules that asymptotically match the optimal rule. We develop results for two standard optimality concepts: average and minmax risk. 3.3.1. Average Risk Optimality First, consider the average (Bayes) risk criterion in the limiting Gaussian model. When the limiting loss function is given by L∞ (δ h), we wish to find the rule δ that minimizes B∞ (δ) = R∞ (δ h) dh = L∞ (δ(Δ) h) dN(Δ|h I0−1 ) dh Directly minimizing this expression would involve searching over the space of decision rules, but the problem can be simplified by reversing the order of integration and noting that, for each Δ, the solution will minimize L∞ (δ(Δ) h) exp(−(h − Δ) I0 (h − Δ)/2) dh which is equivalent to minimizing posterior expected risk, where the posterior has h|Δ ∼ N(Δ I0−1 ). ˙ In the Supplemental Material, we show that the average Let σg2 = g˙ I0−1 g. risk optimal rule for hypothesis testing loss LH ∞ is K g˙ Δ HB HB −1 >c δ(Δ) = 1 where c = Φ σg 1+K For asymmetric welfare regret loss, average risk is minimized by the rule (K − 1)φ(c) g˙ Δ TB >c where c TB solves c = δ(Δ) = 1 σg Φ(c) + KΦ(−c) T For both LH ∞ and L∞ , the cutoff is equal to 0 when K = 1.
1692
K. HIRANO AND J. R. PORTER
Both optimal rules in the limiting Gaussian model have a simple cutoff form, which suggests how to construct rules in the original problem that are asymptotically equivalent. Let θˆ n be an estimator in the original sequence of models that is best regular: √ √ h n(θˆ n − θ0 − h/ n) N(0 I0−1 )
(3.2)
∀h ∈ Rk
h
where denotes convergence in distribution under the sequence of probability measures Pθn0 +h/√n . Both the maximum likelihood estimator and the Bayesian posterior mean would usually satisfy this condition. If we also have a consistent estimator σˆ g of σg , then the feasible decision rule (3.3)
δ
HB n
√ ng(θˆ n ) HB (Z ) = 1 >c σˆ g n
will have limiting distributions that match 1(g˙ Δ/σg > c HB ) for every h. For Tetenov’s loss, we define δTB analogously. These two decision rule are asympn totically optimal for average risk: THEOREM 3.2: Suppose the conditions of Proposition 3.1 are satisfied, g(θ0 ) = 0, g(θ) is differentiable at θ0 , and the prior measure Π admits a density π with respect to Lebesgue measure that is continuous and positive at θ0 . Suppose θˆ n p is a best regular estimator satisfying Equation (3.2) and σˆ g → σg under θ0 . Then H lim BnH (δHB n Π) = inf lim inf Bn (δn Π) δn ∈D
n→∞
n→∞
and lim
n→∞
√
√ T nBnT (δTB nBn (δn Π) n Π) = inf lim inf δn ∈D
n→∞
where D denotes the set of all sequences of decision rules that converge in the sense of Proposition 3.1. REMARK: This result focuses on the case where g(θ0 ) = 0. When g(θ0 ) = 0, one treatment is always preferred under all local parameters, so any rule choosing the appropriate treatment with probability approaching 1 will be asymptotically optimal. In particular, the rules given in Theorem 3.2 remain optimal when g(θ0 ) = 0 under suitable rate normalizations for the Bayes risk. Similar remarks apply to the other asymptotic optimality results to follow. Theorem 3.2 shows that a simple rule, which replaces θ by an efficient estimator θˆ n and replaces σg by a consistent estimator, is approximately Bayes
ASYMPTOTICS FOR STATISTICAL TREATMENT RULES
1693
optimal. If the posterior distribution is tractable, we could also solve the finitesample Bayes problem directly. Not surprisingly, this will also be asymptotically optimal: = arg minδ Bnj (δ Π). If the arg min COROLLARY 3.3: For j = H T , let δjBayes n jBayes be any rule such that Bnj (δjBayes Π) ≤ does not exist for any n, let δn n j jB jB jBayes Bn (δn Π). Then Theorem 3.2 also holds with δn replaced by δn . 3.3.2. Minmax Optimality Next, consider the minmax criterion in the limiting Gaussian model. We wish to solve the functional minimization problem infδ suph R∞ (δ h) over the class of all decision rules, a difficult task in general. However, the structure of our problem can be used to simplify the solution. We consider “slices” of the parameter space constructed in the following way: fix an h0 such that g˙ h0 = 0 and for any b ∈ R, define h1 (b h0 ) = h0 +
b ˙ I0−1 g g˙ I g˙ −1 0
Along each slice, the quantity b = g˙ h1 gives the welfare contrast. It turns out that for many loss functions of interest, rules of the form δc = 1(g˙ Δ > c) for c ∈ R form an essential complete class on each slice, so that it is sufficient to search among cutoff rules to solve the minmax problem along a slice. Furthermore, when the loss function only depends on g˙ h, the same cutoff value c solves the minmax problem along each slice and leads to a minmax rule over the entire parameter space. THEOREM 3.4: Suppose that Δ ∼ N(h I0−1 ) for h ∈ Rk , and consider a decision problem with action space {0 1} and loss L(a h) such that for all h with g˙ h = 0, [L(1 h) − L(0 h)](g˙ h) < 0 ˜ (i) For any randomized decision rule δ(Δ) and any fixed h0 ∈ Rk , there exists a rule of the form δc (Δ) = 1(g˙ Δ > c) which is at least as good as δ˜ on the subspace {h1 (b h0 ) : b ∈ R}. (ii) Additionally, suppose L(a h) depends on h only through g˙ h.9 If a minmax decision rule exists, then δc∗ (Δ) is minmax for some c ∗ . Moreover, the optimal value c ∗ can be obtained by solving infc supb Eh1 (b0) L(δc h1 (b 0)). The condition [L(h 1) − L(h 0)](g˙ h) < 0 requires that the loss impose T greater penalties for incorrect assignment, and is satisfied by losses LH ∞ and L∞ . 9
There exists a function Lg such that L(a h) = Lg (a g˙ h) for all a, h.
1694
K. HIRANO AND J. R. PORTER
Part (i) of this result is a mild extension of the essential complete class theorem of Karlin and Rubin (1956). Part (ii) provides a simple method for constructing a minmax rule. Using (ii), the minmax rule for loss LH ∞ is derived in the Supplemental Material and is a cutoff rule 1(g˙ Δ/σg > c HM ), where c HM = c HB . In the case of loss LT∞ , Tetenov (2007) provided a solution in the scalar normal case with known variance, that extends to our multivariate setting in light of Theorem 3.4(ii). The minmax rule for loss LT∞ is a cutoff rule where the cutoff c TM is the solution to sup(−K · b · Φ(b − c TM )) = sup b · Φ(c TM − b) b≤0
b>0
As with the Bayes criterion, if either loss is symmetric (K = 1), then the optimal cutoff is zero. For general K, the minmax criterion and the Bayes risk criterion lead to the same optimal decision rules for loss LH ∞ , but interestingly not for loss LT∞ . These limit experiment solutions lead to the following asymptotic minmax result: THEOREM 3.5: Suppose the conditions of Proposition 3.1 are satisfied, g(θ0 ) = 0, and g(θ) is differentiable at θ0 . Suppose θˆ n is a best regular estimap tor satisfying Equation (3.2) and σˆ g → σg under θ0 . Let δHM and δTM be defined n n analogously to Equation (3.3). Then these rules are locally asymptotically minmax: h HM δ θ + sup lim inf sup RH √ 0 n n n→∞ h∈J n J h H = inf sup lim inf sup Rn δn θ0 + √ n→∞ h∈J δn ∈D J n and
h sup lim inf sup nR δ θ0 + √ n→∞ h∈J n J √ h = inf sup lim inf sup nRTn δn θ0 + √ n→∞ h∈J δn ∈D J n √
T n
TM n
where the outer supremum is over all finite subsets J of Rk . REMARK: Although the √ average and minmax asymptotic risk comparisons allow h, and hence θ0 + h/ n, to take on arbitrary values, this reparametrization has important consequences. The localization reduces the statistical information to have an approximately Gaussian form (a type of asymptotic sufficiency) and leads to limiting risk functions that depend on the parameter only
ASYMPTOTICS FOR STATISTICAL TREATMENT RULES
1695
through g˙ h. These two effects lead to the simplification of the general decision problem, with the trade-off that only certain combinations of loss function and optimality notion lead to nontrivial comparisons of decision procedures. 4. SEMIPARAMETRIC MODELS 4.1. Gaussian Sequence Limit Experiment Empirical studies of treatment effects often use nonparametric or semiparametric specifications to allow for more flexibility in the modeling of treatment effects. In this section, we extend the results from the previous section to models with an infinite-dimensional parameter space. Suppose Z n consists of an i.i.d. sample of size n drawn from a probability measure P ∈ P , where P is the set of probability measures defined by the underlying semiparametric model. In some cases, the set P will include all distributions satisfying certain weak conditions (so that the model is nonparametric); in other cases, the form of the semiparametric model may restrict the feasible distributions in P . We fix P0 ∈ P , and define local alternatives to P0 in a standard way following Van der Vaart (1991a). Consider subsets of the form {Pth : t ∈ (0 η)} ⊂ P , where η > 0 and h is a real-valued measurable function on Z satisfying 2 1 1 1/2 (4.1) dPth − dP01/2 − hdP01/2 → 0 as t ↓ 0 t 2 Each subset {Pth : t ∈ (0 η)} is then a smooth one-dimensional submodel (or path) of P . Given P0 , the collection of such paths will be denoted P (P0 ). The function h is the score function associated with the submodel and satisfies h dP0 = 0 and h2 dP0 < ∞. For fixed t and h, Pt/√nh is a sequence of measures that approaches P0 as n → ∞. It will be enough to consider the sequences P1/√nh , so we can view each h as a local parameter, in analogy with the parametric case. Let the tangent set T (P0 ) ⊂ L2 (P0 ) be the set of (equivalence classes of) functions h satisfying Equation (4.1). We will assume that T (P0 ) is a separable linear space, so that T (P0 ) is a separable Hilbert space with the usual inner product and norm for L2 (P0 ). Let φ1 φ2 denote any orthonormal basis of T (P0 ). We identify T (P0 ) with an l2 space in the usual way,
through the isomorphism h → (h1 h2 ) with hj = h φj , so that h(·) = j hj φj (·). Again, we use g to denote the difference in social welfare W1 (x) − W0 (x). For a probability measure P ∈ P , we denote this welfare contrast by g(P x), or g(P) for short. We assume that there exists a continuous linear map g˙ : T (P0 ) → R such that (4.2)
1 ˙ (g(Pth ) − g(P0 )) → g(h) as t ↓ 0 t
1696
K. HIRANO AND J. R. PORTER
for every path in P (P0 ).10 This implies √ √ ˙ n g P1/ nh − g(P0 ) → g(h) ˙ can be associated with By the Riesz representation theorem, the functional g(·) ˙ ˙ h for all h ∈ T (P0 ). (This parallels = g an element g˙ ∈ T (P0 ) such that g(h) ˙ 2 = g ˙ g ˙ > 0. the notation g˙ h in the parametric case.) Assume g Van der Vaart (1991a) showed that an asymptotic representation theorem similar to the parametric case holds, where the shifted multivariate Gaussian limit experiment is replaced by an infinite (shifted) Gaussian sequence. This leads to the following result for treatment rules: n√ : h ∈ T (P0 )} satisfy Equation (4.1). ConPROPOSITION 4.1: Let En = {P1/ nh n (z ) in the experiments En and let βn (h) = sider a sequence of treatment rules δ n √ δn dP1/ nh . Suppose βn (h) → β(h) for every h. Then there exists a function δ such that β(h) = Eh [δ(Δ1 Δ2 )], where (Δ1 Δ2 ) is a sequence of indepenh
dent random variables with Δj ∼ N(hj 1). 4.2. Semiparametric Optimality We consider loss functions that are analogs of losses H and T in the parametric case, where the parameter θ is replaced by P ∈ P . We denote these by LH and LT as before, in a slight abuse of notation. Then the limiting versions of ˙ h replacing g˙ h. the loss functions will have the same form as before, with g Defining and working with average risk is more complicated in infinitedimensional models. In the parametric case, simple conditions ensure that a prior measure behaves locally like Lebesgue measure. In infinite product spaces, however, there is no natural analog of Lebesgue measure, and the asymptotic properties of Bayes procedures can be quite sensitive to the choice of prior and the specific model at hand (see, for example, Diaconis and Freedman (1986)). Instead of working with some fixed prior on the space P , we define a prior on the tangent space and compare procedures by their average risk with respect to this prior.11 It is useful to choose the orthonormal basis φ1 φ2 so that the wel˙ g ˙ and let ˙ h is attached to the leading term. Let φ1 = g/ fare contrast g φ2 φ3 be an orthonormal basis for the orthocomplement of the space spanned by φ1 . We can view the Gaussian sequence Δ1 Δ2 in Proposition 4.1 as being defined relative to this choice of orthonormal basis. In partic˙ h/ g ˙ 1) and Δ2 Δ3 have distributions that do ular, under h, Δ1 ∼ N(g ˙ h. not depend on the value of g 10 Van der Vaart (1991b) provided a thorough discussion of this differentiability notion, which is related to Hadamard differentiability. 11 In a different setting, Andrews and Ploberger (1994) used priors on local parameters to define local average power optimality of tests.
ASYMPTOTICS FOR STATISTICAL TREATMENT RULES
1697
Define a prior measure for (h1 h2 ) by Π = λ × ρ, where λ is Lebesgue measure on the real line and ρ is some finite or σ-finite measure on l2 . Since Π is a finite product of σ-finite measures on separable spaces, it is well defined. Define Δ = (Δ1 Δ2 ), let R∞ be the associated risk function, and let B∞ (δ(Δ) Π) = R∞ (δ(Δ) h) dΠ(h) ˙ h, so we can Suppose that the loss function only depends on h through g ˙ h) (see the supposition of write (with slight abuse of notation) L∞ (δ(Δ) g Theorem 3.4(ii)). By interchanging the order of integration, it follows that the Bayes rule can be obtained by minimizing, for each Δ, ˜ σg2 ) L∞ (δ(Δ) u) dN(u|g =
˜ σg2 ) L∞ (0 u) dN(u|g ˜ σg2 ) (L∞ (1 u) − L∞ (0 u)) dN(u|g
+ δ(Δ)
˙ 1 and σg2 = g ˙ 2 . The optimal rule does not depend on where g˜ = g Δ Δ2 Δ3 or on ρ. From here, the analysis is essentially the same as in the parametric case. The optimal Bayes rules in the Gaussian sequence model for T LH ∞ and L∞ can be written 1(Δ1 > c HB )
1(Δ1 > c TB )
where c HB and c TB are the same constants as in the parametric case. To construct asymptotic approximation results, define
n n √ √ P RH δ = LH δn (z n ) P1/√nh dP1/ n 1/ nh n nh (z ) B (δn Π) = H n
√ RH n δn P1/ nh dΠ(h)
and similarly for loss LT . Then we have the following theorem, which shows how to construct optimal rules from a best regular estimator: THEOREM 4.2: Suppose the conditions for Proposition 4.1 are satisfied, p ˙ under P0 , and gˆ n (Z n ) satisfies: g(P0 ) = 0, g satisfies Equation (4.2), σˆ g → g (4.3)
h √
˙ 2) n gˆ n (Z n ) − g P1/√nh N(0 g
1698
K. HIRANO AND J. R. PORTER h
for all h ∈ T (P0 ), where denotes convergence in distribution under P1/√nh . Let √ √ ngˆ n ngˆ n HB n HB TB n TB δn (Z ) = 1 >c >c δn (Z ) = 1 σˆ g σˆ g Then H lim BnH (δHB n Π) = inf lim inf Bn (δn Π) δn ∈D
n→∞
n→∞
and lim
n→∞
√
√ T nBnT (δTB nBn (δn Π) n Π) = inf lim inf δn ∈D
n→∞
where D denotes the set of all sequences of decision rules that converge in the sense of Proposition 4.1. REMARK: Theorem 4.2 has a different interpretation than the parametric Bayes result in Theorem 3.2. Here, the prior Π is a weighting on the tangent space of P0 , so its influence does not disappear as the sample size grows large. We use it here to compare different rules which are derived from asymptotic considerations. In general, a sequence of Bayes rules derived from some fixed prior on P will not be optimal under our criterion. The slicing argument used for the minmax analysis of the parametric case also extends to the infinite-dimensional case. Our choice of basis proves con˙ h = ˙ h depends on h only through its first term h1 : g venient, because g ˙ 1. g h ind
THEOREM 4.3: Let Δ = (Δ1 Δ2 ) have Δj ∼ N(hj 1) under h = (h1 ˙ h = 0, h2 ) ∈ T (P0 ). Let the action space be {0 1} and for all h such that g let the loss L(a h) satisfy ˙ h < 0 [L(1 h) − L(0 h)]g ˜ (i) Then for any randomized decision rule δ(Δ) and (0 h2 h3 ) ∈ T (P0 ), there exists a rule of the form δc (Δ) = 1(Δ1 > c) which is at least as good as δ˜ on the one-dimensional subspace {(b h2 h3 ) : b ∈ R}. (ii) Additionally, suppose L(a (h1 h2 )) only depends on (h1 h2 h3 ) ˙ h.12 If a minmax decision rule exists, then δc∗ (Δ) is minmax for through g ∗ some c . Moreover, the optimal value c ∗ can be obtained by solving infc supb Eh=(b00) L(δc h). 12 This supposition is equivalent to assuming that L(a (h1 h2 )) does not depend on h2 h3 for a = 0 1.
ASYMPTOTICS FOR STATISTICAL TREATMENT RULES
1699
Theorem 4.3 leads to the optimal rules 1(Δ1 > c HM ) and 1(Δ1 > c TM ), where the constants are the same as in the parametric case. If we can match the distribution of Δ1 asymptotically, we can obtain asymptotic minmax optimality: THEOREM 4.4: Suppose the conditions for Proposition 4.1 are satisfied, p ˙ under P0 , and gˆ n is a best regular g(P0 ) = 0, g satisfies Equation (4.2), σˆ g → g estimator for g(P) satisfying Equation (4.3). Let √ √ ngˆ n ngˆ n HM n HM TM n TM δn (Z ) = 1 >c >c δn (Z ) = 1 σˆ g σˆ g Then
HM
√ sup lim inf sup RH P1/√nh = inf sup lim inf sup RH n δn n δn P1/ nh n→∞
J
δn ∈D
h∈J
J
n→∞
h∈J
and
√ P1/√nh sup lim inf sup nRTn δTM n J
n→∞
h∈J
√ = inf sup lim inf sup nRTn δn P1/√nh δn ∈D
n→∞
J
h∈J
where the outer supremum is over finite subsets J of T (P0 ). To close, we illustrate how this result applies to Manski’s conditional empirical success rule. EXAMPLE 4.5: Suppose that W0 (x) = 0 and that we observe a random sample (Xi Yi ), i = 1 n, where Xi has a finitely supported distribution and Yi |Xi has conditional distribution F1 (y|x). The social welfare contrast is the functional g(x F1 ) = w(y) dF1 (y|x). The conditional distribution function F1 is unknown and the set of possible cumulative distribution functions P is the largest set satisfying sup E |w(Y )|2 |X = x < ∞ F1 ∈P
The conditional empirical success rule of Manski (2004) can be expressed as δˆ n (x) = 1(gˆ n (x) > 0) where
n gˆ n (x) :=
w(Yi ) · 1(Xi = x)
n i=1 1(Xi = x)
i=1
1700
K. HIRANO AND J. R. PORTER
The estimator gˆ n (x) is an asymptotically efficient estimator of g(x F1 ) (Bickel, Klaasen, Ritov, and Wellner (1993, pp. 67–68)). Therefore, δˆ n is asymptotically minmax and asymptotically Bayes optimal for both losses H and T under K = 1. This result extends easily to the case where W0 (x) is not known; then gˆ n (x) would be a difference of conditional mean estimates for outcomes under treatments 1 and 0. REFERENCES ANDREWS, D. W. K., AND W. PLOBERGER (1994): “Optimal Tests When a Nuisance Parameter Is Present Only Under an Alternative,” Econometrica, 62 (6), 1383–1414. [1696] BERGER, M. C., D. BLACK, AND J. SMITH (2001): “Evaluating Profiling as a Means of Allocating Government Services,” in Econometric Evaluation of Labour Market Policies, ed. by M. Lechner and F. Pfeiffer. Heidelberg: Physica-Verlag, 59–84. [1683] BICKEL, P. J., C. A. KLAASEN, Y. RITOV, AND J. A. WELLNER (1993): Efficient and Adaptive Estimation for Semiparametric Models. New York: Springer-Verlag. [1700] BLACK, D., J. SMITH, M. BERGER, AND B. NOEL (2003): “Is the Threat of Training More Effective Than Training Itself? Experimental Evidence From the UI System,” American Economic Review, 93 (4), 1313–1327. [1683] DEHEJIA, R. (2003): “When Is ATE Enough? Risk Aversion and Inequality Aversion in Evaluating Training Programs,” Working Paper, Columbia University. [1685] (2005): “Program Evaluation as a Decision Problem,” Journal of Econometrics, 125, 141–173. [1683,1685] DIACONIS, P., AND D. FREEDMAN (1986): “On the Consistency of Bayes Estimates,” The Annals of Statistics, 14 (1), 1–26. [1696] HIRANO, K., AND J. PORTER (2009): “Supplement to ‘Asymptotics for Statistical Treatment Rules’,” Econometrica Supplemental Material, 77, http://www.econometricsociety.org/ ecta/Supmat/6630_Proofs.pdf. [1684] KARLIN, S., AND H. RUBIN (1956): “The Theory of Decision Procedures for Distributions With Monotone Likelihood Ratio,” Annals of Mathematical Statistics, 27, 272–299. [1694] LE CAM, L. (1986): Asymptotic Methods in Statistical Decision Theory. New York: Springer-Verlag. [1684,1686] MANSKI, C. F. (2000): “Identification Problems and Decisions Under Ambiguity: Empirical Analysis of Treatment Response and Normative Analysis of Treatment Choice,” Journal of Econometrics, 95, 415–442. [1683,1684] (2002): “Treatment Choice Under Ambiguity Induced by Inferential Problems,” Journal of Statistical Planning and Inference, 105, 67–82. [1683,1684] (2004): “Statistical Treatment Rules for Heterogeneous Populations,” Econometrica, 72 (4), 1221–1246. [1683,1684,1686,1699] O’LEARY, C. J., P. T. DECKER, AND S. A. WANDNER (1998): “Reemployment Bonuses and Profiling,” Discussion Paper 98-51, W. E. Upjohn Institute for Employment Research. [1683] (2005): “Cost-Effectiveness of Targeted Reemployment Bonuses,” The Journal of Human Resources, 40 (1), 270–279. [1683] PUHALSKII, A., AND V. SPOKOINY (1998): “On Large Deviation Efficiency in Statistical Inference,” Bernoulli, 4 (2), 203–272. [1686] SAVAGE, L. (1951): “The Theory of Statistical Decision,” Journal of the American Statistical Association, 46, 55–67. [1690] SCHLAG, K. (2006): “Eleven-Tests Needed for a Recommendation,” Working Paper ECO 2006-2, EUI. [1683]
ASYMPTOTICS FOR STATISTICAL TREATMENT RULES
1701
STOYE, J. (2006): “Minimax Regret Treatment Choice With Finite Samples,” Working Paper, New York University. [1683] TETENOV, A. (2007): “Statistical Treatment Choice Based on Asymmetric Minmax Regret Criteria,” Working Paper, Northwestern University. [1689,1694] VAN DER VAART, A. W. (1991a): “An Asymptotic Representation Theorem,” International Statistical Review, 59, 99–121. [1695,1696] (1991b): “On Differentiable Functionals,” The Annals of Statistics, 19, 178–204. [1696] (1998): Asymptotic Statistics. New York: Cambridge University Press. [1687] WORDEN, K. (1993): “Profiling Dislocated Workers for Early Referral to Reemployment Services,” Unpublished Manuscript, U.S. Department of Labor. [1683]
Dept. of Economics, University of Arizona, 401 McClelland Hall, 1130 East Helen Street, Tucson, AZ 85721-0108, U.S.A.;
[email protected] and Dept. of Economics, University of Wisconsin, 6448 Social Science Building, 1180 Observatory Drive, Madison, WI 53706-1393, U.S.A.;
[email protected]. Manuscript received August, 2006; final revision received August, 2008.
Econometrica, Vol. 77, No. 5 (September, 2009), 1703–1704
CORRIGENDUM TO “BOOTSTRAP ALGORITHMS FOR TESTING AND DETERMINING THE COINTEGRATION RANK IN VAR MODELS” BY ANDERS RYGH SWENSEN1 THE CLAIM on page 1712 of my paper (Swensen (2006)) that the eigenvalues of the matrices Φj j = 0 1 r − 1 must be of modulus less than 1 because the eigenvalues of Φ = Φr have this property is false, as the following example shows. Let the vector autoregression (VAR) be given by −04 05 00 Xt = (1 1)Xt−1 + Xt−1 + εt −04 00 14 Then the matrix Φ=
02 05 14 −04 05 00 −04 00 14
has eigenvalues with modulus 093 087, and 087, while the matrix 05 00 Φ0 = Γ1 = 00 14 of course has eigenvalues 05 and 14. This affects the conclusions of Proposition 2, Corollary 1, Lemma 2, and Lemma 4. To make the conclusions of these enunciations correct, the following assumption must be included in addition to Assumption 1. ASSUMPTION 2: The eigenvalues of the matrices Φj j = 0 1 r − 1 defined in (A.9) must have modulus less than 1. It may be worth pointing out that the model considered for the numerical simulations reported in Section 4 is covered by Assumption 2. Then r = 1 and the eigenvalues of Φ0 equal zero. Another example is choosing Γ1 in the example above as any matrix having eigenvalues with modulus less than zero. Assumption 2 does, therefore, not imply that Proposition 2 is vacuous. Hence two natural responses arise. First, to determine how restrictive the additional condition is and, second, to formulate a bootstrap algorithm that is consistent only under Assumption 1 of the paper. For some thoughts on the second option, see Remarks 3.2 and 4.7 in Cavaliere, Rabek, and Taylor (2008). 1
Giuseppe Cavaliere and Anders Rahbek kindly made the author aware of the mistake.
© 2009 The Econometric Society
DOI: 10.3982/ECTA8201
1704
ANDERS RYGH SWENSEN REFERENCES
CAVALIERE, G., A. RABEK, AND A. M. R. TAYLOR (2008): “Testing for Co-Integration in Vector Autoregressions With Non-Stationary Volatility,” CREATES Report RP2008-50, Aarhus University. [1703] SWENSEN, A. R. (2006): “Bootstrap Algorithms for Testing and Determining the Cointegration Rank in VAR Models,” Econometrica, 74, 1699–1714. [1703]
Dept. of Mathematics, University of Oslo, P.O. Box 1053 Blindern, Oslo NO0316, Norway;
[email protected]. Manuscript received October, 2008; final revision received January, 2009.
Econometrica, Vol. 77, No. 5 (September, 2009), 1705–1708
ANNOUNCEMENTS 2009 LATIN AMERICAN MEETING
THE 2009 LATIN AMERICAN MEETINGS will be held jointly with the Latin American and Caribbean Economic Association in Buenos Aires, Argentina, from October 1 to 3, 2009. The Meetings will be hosted by Universidad Torcuato Di Tella (UTDT). The Annual Meetings of these two academic associations will be run in parallel, under a single local organization. By registering for LAMES 2009, participants will be welcome to attend to all sessions of both meetings. Andrés Neumeyer (UTDT) is the conference chairman. The LAMES Program Committee is chaired by Emilio Espina (UTDT). The LACEA Program Committee is chaired by Sebastián Galiani (Washington University in St. Louis). Plenary Speakers: Ernesto Zedillo, Former President of Mexico, Yale University Roger Myerson, Nobel Laureate, University of Chicago, LAMES Presidential Address Mauricio Cardenas, Brookings Institution, LACEA Presidential Address Daron Acemoglu, MIT Guido Imbens, Harvard University John Moore, University of Edinburgh Invited Speakers: Fernando Alvarez, University of Chicago Jere Behrman, University of Pennsylvania Abhijit Banerjee, MIT Pablo Beker, University of Warwick Samuel Berlinski, University College of London Richard Blundell, University College of London Gustavo Bobonis, University of Toronto Michele Boldrin, Washington University in St. Louis Maristella Botticini, Boston University Francois Bourguignon, Paris School of Economics Francisco Buera, Northwestern University Guillermo Calvo, Columbia University Matias Cattaneo, University of Michigan V. V. Chari, University of Minnesota Satyajit Chatterjee, Philadelphia Federal Bank Lawrence Christiano, Northwestern University Ernesto Dal Bo, University of California, Berkeley © 2009 The Econometric Society
DOI: 10.3982/ECTA775ANN
1706
ANNOUNCEMENTS
David DeJong, University of Pittsburgh José de Gregorio, Banco Central de Chile Augusto de la Torre, The World Bank Rafael Di Tella, Harvard University Juan Dubra, Universidad de Montevideo Esther Duflo, MIT Jonathan Eaton, New York University Huberto Ennis, Universidad Carlos III de Madrid Martin Eichenbaum, Northwestern University Raquel Fernandez, New York University Sergio Firpo, Fundacao Getulio Vargas Sao Paulo Paul Gertler, University of California, Berkeley Edward Glaeser, Harvard University Ricardo Hausman, Harvard University Christian Hellwig, UCLA Bo Honoré, Princeton University Hugo Hopenhayn, UCLA Boyan Jovanovic, New York University Dean Karlan, Yale University Pat Kehoe, University of Minnesota Tim Kehoe, University of Minnesota Felix Kubler, Swiss Finance Institute Victor Lavy, Hebrew University David Levine, Washington University in St. Louis Santiago Levy, Inter American Development Bank Rodolofo Manuelli, Washington University Rosa Matzkin, UCLA Enrique Mendoza, University of Maryland Dilip Mookerjee, Boston University John Nye, George Mason University Rohini Pande, Harvard University Fabrizio Perri, University of Minnesota Andrew Postlewaite, University of Pennsylvania Martin Redrado, Banco Central de la Republica Argentina Carmen Reinhart, University of Maryland Rafael Repullo, CEMFI James Robinson, Harvard University Esteban Rossi-Hansberg, Princeton University Ernesto Schargrodsky, Universidad Di Tella Karl Schmedders, Kellogg School of Management Northwestern University Paolo Siconolfi, Columbia University Michele Tertil, Stanford University Miguel Urquiola, Columbia University Martin Uribe, Columbia University
ANNOUNCEMENTS
1707
Andres Velasco, Ministerio de Hacienda, Chile John Wallis, University of Maryland Chuck Whiteman, University of Iowa Stanley Zin, Carnegie Mellon University Further information can be found at the conference website at http://www. lacealames2009.utdt.edu or by email at
[email protected]. 2010 NORTH AMERICAN WINTER MEETING
THE 2010 NORTH AMERICAN WINTER MEETING of the Econometric Society will be held in Atlanta, GA, from January 3 to 5, 2010, as part of the annual meeting of the Allied Social Science Associations. The program will consist of contributed and invited papers. The program committee will be chaired by Dirk Bergemann of Yale University. This year we are pleased to invite submissions of entire sessions in addition to individual papers. Program Committee: Dirk Bergemann, Yale University, Chair Marco Battaglini, Princeton University (Political Economy) Roland Benabou, Princeton University (Behavioral Economics) Markus Brunnermeier, Princeton University (Financial Economics) Xiahong Chen, Yale University (Theoretical Econometrics, Time Series) Liran Einav, Stanford University (Industrial Organization) Luis Garicano, University of Chicago (Organization, Law and Economics) John Geanakoplos, Yale University (General Equilibrium Theory, Mathematical Economics) Mike Golosov, MIT (Macroeconomics) Pierre Olivier Gourinchas, University of California (International Finance) Igal Hendel, Northwestern (Empirical Microeconomics) Johannes Hoerner, Yale University (Game Theory) Han Hong, Stanford University (Applied Econometrics) Wojcich Kopczuk, Columbia University (Public Economics) Martin Lettau, University of California, Berkeley (Finance) Enrico Moretti, University of California, Berkeley (Labor) Muriel Niederle, Stanford University (Experimental Game Theory, Market Design) Luigi Pistaferri, Stanford University (Labor) Esteban Rossi-Hansberg, Princeton University (International Trade) Marciano Siniscalchi, Northwestern University (Decision Theory) Robert Townsend, Massachusetts Institute of Technology (Development Economics)
1708
ANNOUNCEMENTS
Oleg Tsyvinski, Yale University (Macroeconomics, Public Finance) Harald Uhlig, University of Chicago (Macroeconomics, Computational Finance) Ricky Vohra, Northwestern University (Auction, Mechanism Design)
Econometrica, Vol. 77, No. 5 (September, 2009), 1709
FORTHCOMING PAPERS THE FOLLOWING MANUSCRIPTS, in addition to those listed in previous issues, have been accepted for publication in forthcoming issues of Econometrica. ARMSTRONG, MARK, AND JOHN VICKERS: “A Model of Delegated Project Choice.” CALDENTEY, RENÉ, AND ENNIO STACCHETTI: “Insider Trading With a Random Deadline.” CARNEIRO, PEDRO, JAMES J. HECKMAN, AND EDWARD VYTLACIL: “Evaluating Marginal Policy Changes and the Average Effect of Treatment for Individuals at the Margin.” CHASSANG, SYLVAIN: “Fear of Miscoordination and the Robustness of Cooperation in Dynamic Global Games With Exit.” CHE, YEON-KOO, AND FUHITO KOJIMA: “Asymptotic Equivalence of Probabilistic Serial and Random Priority Mechanisms.” CITANNA, ALESSANDRO, AND PAOLO SICONOLFI: “Recursive Equilibrium in Stochastic OLG Economies.” GENTZKOW, MATTHEW, AND JESSE M. SHAPIRO: “What Drives Media Slant? Evidence From U.S. Daily Newspapers.” GUVENEN, FATIH: “A Parsimonious Macroeconomic Model for Asset Pricing.” KAMADA, YUICHIRO: “Strongly Consistent Self-Confirming Equilibrium.” LEVITT, STEVEN D., JOHN A. LIST, AND DAVID H. REILEY: “What Happens in the Field Stays in the Field: Exploring Whether Professionals Play Minimax in Laboratory Experiments.” PESENDORFER, MARTIN, AND PHILIPP SCHMIDT-DENGLER: “Sequential Estimation of Dynamic Discrete Games: A Comment.” SANNIKOV, YULIY, AND ANDRZEJ SKRZYPACZ: “The Role of Information in Repeated Games With Frequent Actions.” WINSCHEL, VIKTOR, AND MARKUS KRÄTZIG: “Solving, Estimating and Selecting Nonlinear Dynamic Models Without the Curse of Dimensionality.”
© 2009 The Econometric Society
DOI: 10.3982/ECTA775FORTH