Advances
in COMPUTERS VOLUME 12
Contributors to This Volume
JAMESP. ANDERSON R. D. BERGERON G . M. FERRERO diROCCAFERRERA J. D. GANNON HARRY B. LINCOLN JUDITHM. S. PREWITT DAVIDC. ROBERTS D. P. SHECTER F. W. TOMPA A. VAN DAM
Advances in
COMPUTERS EDITED BY
MORRIS RUBTNOFF University of Pennsylvania and Pennsylvania Research Associates, Inc. Philadelphia, Pennsylvania
VOLUME 12
ACADEMIC PRESS New York London-1972
COPYRIGHT 0 1972, BY ACADEMIC PRESS,INC. ALL RIGHTS RESERVED. NO PART OF THIS PUBLICATION MAY BE REPRODUCED OR TRANSMITTED IN ANY FORM OR BY ANY MEANS, ELECTRONIC OR MECHANICAL, INCLUDING PHOTOCOPY, RECORDING, OR ANY INFORMATION STORAGE AND RETRIEVAL SYSTEM, WITHOUT PERMISSION I N WRITING FROM THE PUBLISHER.
ACADEMIC PRESS, MC. 111 Fifth Avenue,
New York, New York 10003
United Kingdom Edition published by ACADEMIC PRESS, INC. (LONDON) LTD. 24/28 Oval Road, London NWl
LIBRARY OF CONGRESS CATALOG CARDNUMBER: 59-15761
PRINTED IN THE UNITED STATES OF AMERICA
Contents CONTRIBUTORS . PREFACE .
ix xi
Information Security in a Multi-User Computer Environment James
P. Anderson
1. The Computer Security Problem 2. Techniques of System Access Control
3. 4. 5. 6. 7. 8. 9.
.
Computer Characteristics Supporting Information Security . Operating System Functions Relating to Information Security Problems of File Protection . Techniques of File Protection . Techniques of Security Assurance . Communications Problems Summary . References .
2 10 13 18 24 28 30 33 35 35
Managers, Deterministic Models, and Computers G. M. Ferrero diRoccaferrera
1. Introduction . 2. The System Approach
. . .
3. Management Systems 4. Management Science 5. When and How Managers Have to Implement Management Science Models 6. Will Computers Eliminate Managerial Decision Making? . . References
V
37 40 43 46 50 63 71
vi
CONTENTS
Uses of the Computer in Music Composition and Research Harry 8. Lincoln
1. 2. 3. 4.
Introduction . Composition of Music by Computer . Music Research Using the Computer . Automated Music Typography for Composition and Research References . .
73 74 88 107 110
File Organization Techniques David C. Roberts
1. Introduction . 2. Survey of File Organizations . 3. Random File Structures . 4. List File Structures . 5. Tree File Structure . . 6. Implementation of File Structures References .
. . . . . .
. .
115 116 130 143 151 160 166
Systems Programming languages R. D. Bergeron, J. D. Gannon, D. P. Shecter, F. W. Tompa, and A. van Dam
1. 2. 3. 4. 5. 6.
Introduction . Criteria for a Systems Programming Language . Specific Constructs . Reviews of Several Systems Programming Languages . Extensibility and Systems Programming . Language for Systems Development . References .
. . .
. . .
176 180 192 196 235 239 283
CONTENTS
vii
Parametric and Nonparametric Recognition by Computer: An Application to leukocyte Image Processing Judith M. S. Prewitt
1. Introduction . 2. Image Articulation . 3. Image Description . 4. Discrimination: Logic and Rationale . . 5. Linear Logic and the Evaluation of Performance: Logic and Rationale 6. Feature Selection: Logic and Rationale . 7. Experimental Results: Parameter Variability . 8. Experimental Results: Parametric Discrimination using Decision Theory . 9. Nonparametric Pattern Detection : A Cytotaxonomy for Leukocytes . 10. The Inverse Problem: Human Visual Discrimination Using . Computer-Oriented Image Properties 11. Perspectives on Automatic Microimage Analysis . . 12. Summary and Prospectus . References .
AUTHORINDEX . SUBJECT INDEX . CONTENTS OF PREVIOUS VOLUMES.
. .
.
285 301 321 327 338 347 364 368 383 393 400 404 409
415 422 432
This Page Intentionally Left Blank
Contributors to Volume 12 Numbers in parentheses indicate the pages on which the authors' contributions begin.
JAMESP. ANDERSON, James P. Anderson & Co., Fort Washington, Pennsylvania ( 1 )
R. D. BERGERON, Department of Computer and Information Sciences, Brown University, Providence, Rhode Island (175) G . M. FERRERO diROCCAFERRERA, Quantitative Methods Department, School of Management, Syracuse University, Syracuse, New York (37)
J. D. GANNON,Department of Computer and Information Sciences, Brown University, Providence, Rhode Island (175)
HARRY B. LINCOLN, Department of Music, State University of New York at Binghamton, Binghamton, New York (73)
JUDITHM. S. PREWITT, Division of Computer Research and Technology, National Institutes of Health, Bethesda, Maryland (286) DAVID C. ROBERTS, Informatics, Inc., Rockville, Maryland (115) D. P. SHECTER, Department of Computer and Information Sciences, Brown University, Providence, Rhode Island (175)
F. W. TOMPA, Department of Computer and Information Sciences, Brown University, Providence, Rhode Island (175)
A.
DAM,Department of Computer and Infomz&tion Sciences, Brown University, Providence, Rho& Island ( 175)
VAN
ix
This Page Intentionally Left Blank
Preface
The past year and a half has witnessed a remarkable revolution in semiconductor electronics, and with it have come commensurately remarkable advances in computers. The success of large-scale semiconductor integration (LSI) has opened the door to the ultimate in low-cost miniaturization, so that all the sequential logic for a desk calculator can be fitted onto a single silicon chip in a conventional transistor case, at tremendous cost reduction. The result has been a rapid growth of the minicomputer market, with complete processing units in the $2000 category, and with corresponding price reductions in the data processing units of larger computer systems. Meanwhile, the computer specialists have been busily engaged in producing advances in computers in the systems and applications areas. Computer analysts and programmers are becoming increasingly skilled in prescribing and implementing higher-level languages and complex applications programs, in an ever-widening universe of markets. The current volume of Advances in Computers reflects these directions of computer growth. Professor van Dam and his coauthors provide a comprehensive analysis of systems programming languages and present criteria for such languages, including PL/I, AED, BLISS, and PL360. They examine the elements of extensibility and describe a language for systems development. A companion article by Dr. David Roberts reviews file organization techniques, covering a broad range of basic file organizations, and comparing their relative effectiveness in different data storage applicational environments. Professor Harry Lincoln, Professor diRoccaferrera, and Dr. Judith Prewitt relate the computer to the areas of music, management, and image, respectively. Professor Lincoln describes how music research is being accomplished with the aid of computers, including music representation, theory, style analysis, and ethnomusicology. And in the same article, he presents the background and future of music composition by computer. Professor diRoccaferrera examines the reasons for the explosive growth of computer application to management problems. The underlying general systems approach is identified, management systems are characterized, and the use of models in management science is presented. The article then addresses itself to the “when and how” of implementing management science models and the persistent question of whether computers will displace or replace human managers. Dr. Prewitt discusses image recognition from her many years of research in the recognition of microscopic components of blood, especially leukocytes. The entire spectrum of the xi
Xii
PREFACE
recognition problem is covered and the progress achieved to date is presented in narrative text, a large number of illustrations, and in the underlying mathematics of image recognition. Dr. James Anderson addresses himself to the important problems of information security in a multi-user environment, a question of growing relevance in time-sharing and remote-terminal computer access. After an enumeration and discussion of the many elements entering into the computer security problem, he delineates characteristics which support systems for rendering data secure. He then presents techniques for file protection in particular and security assurance in general. Complete security requires that the communications links between computers and their remote terminals be protected as well; the vulnerability to wiretapping, and control measures that may be adopted, are discussed in depth. It is with deep regret that I bring to your attention the withdrawal of Dr. Franz L. Alt as Coeditor of this series, due to the pressures of his regular duties. Dr. Alt initiated Advances in Computers 13 years ago and it was only when the third volume was being prepared that I was privileged to join him in this exciting activity. I n his characteristically modest and pleasant way, Dr. Alt continually refused to accept the designation of Senior Editor, insisting that we both be designated as editors of equal standing. But he was in fact the primary planner and decision maker in every volume to date, and the Senior Editor in deed, if not in name. I shall miss him.
MORRISRUBINOFF
Information Security in a Multi-User Computer Environment
JAMES P. ANDERSON James
P. Anderson 8
Co.
Fort Washington, Pennsylvania
1. The Computer Security Problem . 1.1 Introduction . 1.2 Technical Threats to Information . 1.3 Equipment Protection . 1.4 Back-up Data . . 1.5 Basic Assumptions of Information Protection . 1.6 Types of Multi-User Systems . 1.7 Relation t o Privacy Issue . 1.8 General Requirements for Secure Operation of Multi-User Systems . 2. Techniques of System Access Control. . 2.1 Basic Methods . . 2.2 Considerations in Design of Passwords . . 2.3 Password Distribution . 2.4 Other Methods of User Authentication . 3. Computer Characteristics Supporting Information Security . . 3.1 Hardware Facilities for Multiprogramming . 3.2 Program (User) Isolation Mechanisms . . 3.3 Two State Operation . . 3.4 1/0 Characteristics . 3.5 Virtual Machines . 4. Operating System Functions Relating to Information Security . 4.1 Recognition of Authorized Users . . 4.2 Control of Access t o Programs and Data . 4.3 Common Services . 4.4 Output Routing . 4.5 Sources of Security Problems in Contemporary Operating Systems . . 4.6 Security Relationship of Operating System to Hardware System . 5. Problems of File Protection 5.1 Basic Problems . . 5.2 Models for Shared Information Processing . 5.3 Models for Hierarchical Access Control . 6. Techniques of File Protection . 6.1 OS/360 . 6.2 File Encryption . . 1
. . . .
2 2 2 4
.
5
.
5
. .
. . .
. . . . . . .
. .
. . . .
. .
. . . .
. . . .
7 9 9 10 10 11 12 13 13 13 14 15 16 17 18 18 18 19 20 21 23 24 24 24 27 28 28 29
2
JAMES P. ANDERSON
7. Techniques of Security Assurance 7.1 Pseudo-User . 7.2 Audit Trails . 7.3 Validation of Programs . 8. Communications Problems . 8.1 Vulnerability to Wiretap . 8.2 Wiretap Countermeasures . 9. Summary . References .
. .
. . . . .
30 30 31 32 33 33 33 35 35
1. The Computer Security Problem 1.1 introduction
The problem of ‘Lcomputersecurity’’ ranges from the physical protection of computer equipment to the physical and logical protection of computerbased data. Since the techniques of physical security are reasonably well known and relatively simple to apply, this article stresses the problems and prospects for protecting data or information in multi-user computer environments . The problem of information protection in computer systems has only recently received significant attention. There are several reasons for this. (a) Until recently, there were fewer systems and less information in computer-based systems. (b) Earlier systems were mono-programmed, and therefore simple physical security techniques were (are) sufficient to protect the data. (c) Growth of online storage of large data bases-concentration in one place of significant data resources. (d) Growth of multiprogrammed operation (time-shared, multi-access, etc.), permitting ready access to large sets of data. (e) Development of on-line and batch data retrieval systems to exploit the collection of data, making it easier to access and manipulate, and increasing the number of people who could access the data. All of these factors have converged in the past few years, making it possible and profitable to penetrate systems for purposes of recovery or manipulation of information. 1.2 Technical Threats to Information
Various writers have categorized the threats to on-line data in timeshared systems. Petersen and Turn [18a] distinguish between accidental and deliberately induced disclosure, with the latter further broken down into passive and active methods. The former is wiretapping; the latter,
INFORMATION SECURITY IN A MULTI-USER COMPUTER
3
all of the perversions of normal access procedures that give a user unauthorized access. In another paper, Ware [MI elaborates on accidental disclosures that can occur, citing such events as failure of computer circuits for bounds registers or memory protection, failure of the system/user mode instruction separation and the like. He further cites the hazard (to information) that exists in the radiation of electromagnetic energy in highspeed electronic circuits (thus facilitating eavesdropping by third parties), and the vulnerability of systems to various forms of manipulation by operators and other operation staff personnel. It is interesting to note that nearly all of the writers on this subject take as an init,ial condition the assumption of the reliability of the operations staff, a reflection of the seriousness of this threat, and the extreme difficulty (or near impossibility) of providing adequate protection to a system from this source of penetration. A number of writers discuss the problem of (lsoftware failure” as a threat to information security. This unfortunate choice of terms is misleading since it conveys to the uninformed the incorrect notion that software is something that wears out, and does not properly reflect the very real danger of incorrect or incomplete design of the operating system and its components, both to information security and proper operation of a system. The term is often used to convey the fact that programs will not operate properly in the face of hardware malfunction or failure, while ignoring or weakly implying the effects of incomplete design. While the problem is not limited to the operating system alone, we note that the possibility of incomplete design is one of the major problems of information security in multi-user systems. One of the reasons for this condition is that the designers of the operating systems have heretofore been concerned with providing protection to the operating system from the effects of user programming errors that arise in a benign environment. As we will see, the problem of information security has received only rudimentary and incidental attention. Finally, a number of writers note the wiretapping vulnerability of the communications between a remote terminal and the resource-sharing computer. Because of the other vulnerabilities of systems, this problem has not been of major importance to date. A related problem, that of unauthorized access to dial-up time-shared systems (because of loss of user credentials) has been a problem noted in the trade press. The focus of this article is on the threat to information posed by programmers who can gain access to a multi-user system and exploit known or suspected weaknesses in the operating system. This is not to minimize the seriousness of the other threats in specific situations; rather it directs attention to the major source of security problems of multi-user systems.
4
JAMES P. ANDERSON
The essence of the computer security problem becomes clear when one considers that programs and data of different users share primary storage simultaneously in multi-user systems that rely on programming to maintain their separation. Furthermore, the situation is aggravated by the fact that the user of a resource-sharing system must often program the system to accomplish his work. In this environment, it is necessary to prove that a given system is proof against attack (ie., hostile penetration), and that it will not commit unanticipated disclosure. Of the two, it is easier to demonstrate that a system is proof against attacks of various kinds than it is to prove that a system will not commit unanticipated disclosure. The reason for this is that the former concerns demonstrating that the design is sufficient, while the latter involves proving that something (e.g., a hardware failure) will not happen. Because of the wide diversity of possible systems and environments for their use, no single set of measures can be specified that will assure the security of a resourcesharing system. Each system must be examined in its setting and a specific set of security techniques applied. Because of the evolving nature of nearly all resource-sharing systems (new applications, different user services, etc.) the process of security assurance of a system is a continuing one. 1.3 Equipment Protection
Clearly related to the problem of information protection is the protection of the vehicle for processing information-the computer system and its subsystems. Unless one is sure of the physical integrity of the hardware of a system, many of the other measures that could be taken are meaningless. Further, the principal objective of a penetrator may be to deny the legitimate users (owners) of the equipment the use of that equipment. The problem of sabotage is not the only reason for providing equipment protection. A skilled penetrator could induce modifications to the hardware that would circumvent the hardware aids built into the systems to maintain program separation, thus exposing the information to unauthorized disclosure. The major emphasis in the literature is on fire protection, perhaps the largest single hazard to computer systems. To aid in planning for the physical protection of computer equipment against fire hazards, the National Fire Prevention Association publishes standards applicable to data processing installations [16]. Along with fire protection, the next major problem is to provide proper controls to limit physical access to a computer operation to the operations staff only.
INFORMATION SECURITY IN A MULTI-USER COMPUTER
5
Computer rooms and data processing centers should be located in interior portions of a building, with single entrances equipped with locked doors. (Multiple emergency exits can be provided with alarms on the doors to signal when the door is opened.) Access to the computer room can be limited by use of magnetic card locks or multiple push-button locks of which there are a number available commercially [3]. 1.4 Back-up Data
The information in files must be recognized as an asset requiring protection. As a consequence, operators of data processing centers must take prudent action to prevent the loss of this data. This form of protection is common practice in most businesses, since the data is often critical to the continued operation of a business. The frequency of taking back-up copies is a function of the activity of the file, and how frequently it changes. If the transaction load is small, it may suffice to take a back-up copy of a file once a week (month) and save the transactions in the interim. When sufficient change has taken place, a copy of the current file can be taken, and the accumulated transactions discarded. When the transaction rate is high, the “old master” can be saved (with the transactions) until a new run is made. This is practical mostly for tape files. The purpose of saving transactions is to minimize the time needed to reconstruct the file in the event it is destroyed. With files that are updated in place (direct access files), the saving of transactions may be the only feasible way to reconstruct the file. In order to determine the magnitude of a security program, it is desirable to categorize the data (files) of a company as to the degree of importance it has. As an initial estimate, one can place data in the following categories: (a) Vital-Cannot operate a business without it. (b) Essential-Very difficult to operate without it, or proprietary. (c) Important-Difficult to operate without it, or increases costs without it. (d) Convenient-Simplifies some jobs. Examples of vital data might include a grocery inventory file for a supermarket chain, the file of sold and available space for a reservation service, the list of stocks held for each customer in a brokerage house, etc. In fact, any data or information that is the “product” of the business. Essential information includes that required to be reported by regulatory or taxing authorities, payroll files, accounts payable, accounts receivable, and other accounting files. Important data might include files supporting
6
JAMES P. ANDERSON
production scheduling, personnel files, files containing summary data from other accounting and business systems. Convenient data is what is left after all other data is categorized. In some installations, a large amount of convenient data is not surprising and may be a good point to review whether continued production of such data is justified. Factors to be considered in categorizing data include: (a) What role the data plays in a business (b) What its replacement cost is. (c) What its reconstruction cost is. (d) What the value of the data is to someone else. (e) What alternatives are there in an emergency? For the most part these factors are self-explanatory. The difference between (b) and (c) is that it may be possible to reconstruct lost data from back-up files or current source documents-the replacement costs are the total costs to redevelop a file from scratch if all data (and back-up) is lost. While data back-up is not the primary focus of information protection, it is important to the overall security afforded a system, and constitutes additional files that require the same degree of protection as the primary files. The use of “counter service” for programmers and other ‘LlOcal”users of a system cuts down the traffic in a computer room, and further provides a degree of control on the distribution of reports and acceptance of jobs. With modern operating systems, there is no valid reason for any programmers to be present a t the computer during the execution of their jobs. Supplementing a single access for the computer room proper, it is highly desirable to control access to a data processing center in order to minimize the risk of unauthorized persons obtaining physical access to the computer. Techniques for such control include the use of guards, picture badges (color coded to indicate type of access-computer room, keypunch areas, job counter, etc.), closed circuit TV, and the like. While only slightly challenging in its own right, equipment protection is the foundation upon which information protection rests. Without assurance of the integrity of the equipment, the balance of the efforts are wasted. I n this, all writers are in agreement. 1.5 Basic Assumptions of Information Protection
There are two basic assumptions universally held in considering computer based information protection : (a) that the physical environment of the computer system is secure, and (b) that the integrity of the operations personnel is above reproach. The importance of these assumptions is obvious, for if the physical
INFORMATION SECURITY IN A MULTI-USER COMPUTER
7
environment of the computer system is not secure, it is possible to steal or otherwise appropriate any information contained on any transportable medium. While it can be argued that such theft would be detected, the housekeeping practices in many computer installations are sufficiently lax that as long as a tape or disc pack were replaced with another, the theft would probably be put down as an administrative error. The operations personnel (including systems programmers) constitute a special hazard because of the nature of the access to files and the system required of their jobs. It is quite obvious that an operator can copy any file, list it, or even manipulate information, all with little or no chance of detection in almost any computer installation. In addition to the operators, systems programming personnel responsible for the maintenance of the operating system and the development of applications programs know about security controls installed in systems, and often can create programs operating with such privilege as to circumvent whatever controls may be present in a system. Peters [ l 7 ] notes that not all operations personnel require clearance to the highest level needed by a system provided there is a t least one individual on every shift who is cleared and is fully aware of what is required to operate a system appropriately and securely in any circumstance. This observation is important in that, depending on the installation, the cost of clearing all personnel could be prohibitive. Comber [4],in discussing information privacy (see Section 1.6), states the case thus: “Despite all that has been said heretofore, the ‘key’ to security information rests with individuals who have access to the data system.” 1.6 Types of Multi-User Systems
In considering the problems of information security, the conventional classification of systems into “time-sharing,” batch, and remote batch are of little value because they do not suggest significantly enough the degree or scope of the problem. The distinction between single-user-at-a-time system, and multi-user system is more meaningful. From a security viewpoint, we wish to classify systems as to the degree of direct user control permitted by the system. Differentiating among systems as to the degree of direct user control is merely a recognition of the fact that the security threat to a time-shared system is a function of the direct user control possible in a system. Clearly, if a user (at a terminal) cannot exercise direct control over the programts) he is executing, he is less likely to be able to cause improper operation of the program than a user who has a high degree of direct control.
8
JAMES P. ANDERSON
One can identify a number of points along a spectrum of direct control. (a) A system in which only specific “canned” programs may be used from a terminal. Airlines reservations systems, or the various query systems are examples. The user “control” of the programs is limited t o supplying parameters. (b) Systems providing interpretive computing for terminal users. Systems providing BASIC or JOSS type languages are examples. The principal distinction is that although the user may specify in some detail both the functions to be performed, and the sequencing desired, he is barred from direct control of the hardware (i.e., from writing instructions that are directly executed by the machine) by the fact that the operations and/or the sequencing between steps is interpreted by another program standing between the user and the hardware of the central processor. Further, interpretive systems isolate users from awareness of memory allocation functions. (c) Systems that use only approved compilers to produce running code. An outstanding example of this kind of system is the Burroughs B5500 which presents the machine to the users only in terms of the Algol, Fortran, and Cobol compilers. (d) Systems which permit users a t a terminal to write in the machine language of the system and/or exercise direct debugging control at the machine language level. Examples of this kind of use abound-although the machine language is most frequently assembly language for the obvious reasons. In practical use, most time-shared systems offer a range of use embracing nearly all of the cases cited above. Depending on the particular circumstance of the using installation(s), an increasing security problem is generally present as the options permitting more direct user control are selected. From still another security viewpoint, systems can be classified by the type of information they contain, and whether or not the information requires selective sharing. Examples of the selective sharing problem range from the simple case of payroll data that is available on a per-person basis to a payroll clerk, but only in aggregate t o cost analysis clerks in production, to a complete management information system that might contain sales and production status data for a product line that requires both hierarchical access (e.g., a sales manager can access all data concerning his territory, while an individual salesman would be restricted just to data concerning his sales and customers) and disjoint access (e.g., a product manager can access all sales data regarding his product lines, but not data on products the responsibility of a different product manager).
INFORMATION SECURITY IN A MULTI-USER COMPUTER
9
It is this problem of information sharing in MIS and data management systems that has given rise to a number of proposals and schemes to control the shareability of information discussed in Sections 5 and 6. 1.7 Relation to Privacy Issue
Much has been written about privacy in relation to computer systems [9,10, 19bI. Ware [24]makes the distinction between security and privacy on the basis of whether the information in question arises in the military or defense environment, or industrial or non-defense governmental environments. Comber [4]takes a more sociological view relating privacy to the notion of personal information. While such distinctions are important in an overall sense, it is sufficient to note that the issue of privacy relates to disclosure policy regardless of the kind of data or the environment it arises in. Information security techniques are but one of a number of techniques used to aid in the management of a disclosure policy. Perhaps the best known, and most formally developed disclosure policy is the scheme of national classifications applied to defense information. In this light, there is no essential difference in objectives between the government official who classifies a document SECRET, and the corporate manager who marks a document COMPANY CONFIDENTIAL. They are both using a technique to maintain the privacy of information by labeling the information with an indicator of a particular disclosure policy. The privacy issue often arises in situations involving computer based information systems because the disclosure policy is not well defined, and because they are often no information security measures applied (or planned) to the data processing systems involved. 1.8 General Requirements for Secure Operation of Multi-User Systems
The very wide variability in the environment, equipment, user populations, and information contained in multi-user systems precludes a simple specification of the administrative and technical measures necessary to provide an adequate level of protection to information in all systems. In spite of this, we can state broadly the minimum requirements that must be met by such systems [20].These include: (a) A physically secure environment for the computer, and other physical elements of the system including terminals where present. (b) Control of access to the system. (c) An adequate method of internally isolating individual programs (users) simultaneously resident on the system. (d) An adequate method of limiting access to programs and data files.
10
JAMES P. ANDERSON
With the exception of the physically secure environment for the physical elements of the system, the balance of this paper will examine these requirements and the techniques used to meet them. 2. Techniques of System Access Control 2.1 Basic Methods
After physical security measures, access control techniques are the first line of defense against unauthorized use of a computer facility and the possibility of misappropriation of information contained in a system. By “access,” we mean the ability t o gain use of the system, and are not dealing with the question of authority to access file data a t this point. For many computer systems, “access control” is exercised by a control clerk who accepts jobs for processing on the system and distributes output to a user community. The control is administrative in nature, and may involve personal recognition of the person submitting the job by the clerk before a job is accepted. Whenever the clerk is in doubt about the authority of an individual to use the system, he may verify this authority with an appropriate level of management in his organization. Where remote access to the information system is provided, as in various forms of time-sharing systems, other methods of access control must be applied. There are basically two techniques used to control remote access to a system. The first of these, called terminal control, limits remote access to a system by limiting physical access t o the terminals connected to that system. This method is appropriate to various forms of systems dedicated to a certain function or dedicated to a given user population, as might be experienced in various in-house systems, or dedicated systems such as airline reservation or stock quotation systems. Terminal control is often used on systems where the terminal is straightwired (i.e., it is a point to point permanent circuit) to the system. Where the group having legitimate access to a given terminal is homogeneous in function, it is possible from a security viewpoint to dispense with individual identification of users, letting the terminal stand as a surrogate for any user who can gain physical access to that terminal. In dedicated systems, adoption of such a policy can dramatically reduce the administrative burden of security maintenance. Adding or deleting a user is simply a matter of granting or revoking authority to use a terminal. The use of this scheme also simplifies the maintenance of file access authority, since the authority is associated with the terminal, which in turn stands for a group of users. The disadvantage of the scheme is that if a given user has
INFORMATION SECURITY IN A MULTI-USER COMPUTER
11
file access authorities not in common with the group represented by a given terminal, he must use a different terminal to exercise the additional authority. Another disadvantage is that the terminal control scheme by itself does not usually provide a fine enough grain file access authority. Any file access authority granted a terminal can be exercised by all members of the group represented by that terminal. This will be discussed further in Section 5. The second method, and by far the most common, is to identify a user to the system by placing his identity (user-ID, e.g., social security number, badge number, name) on an access list maintained in the computer, and require user identification to be supplied as part of a log-on procedure. Since it is necessary to use this identification in such things as accounting reports, which may be accessible to a wide variety of people, nearly all remote access systems provide for a “secret” password to accompany the user identification [b7]. The principal advantage of identifying individual users is that it permits highly selective file access authorities to be applied on a per user basis. There is no special advantage over terminal conrtol as a mechanism for controlling initial access to a system. The adoption of user identification as the system access control mechanism in a system places all of the burden of properly identifying a prospective user on the central system in contrast to the distribution of authority found in terminal controlled access. Of course for many systems, such as commercial time-sharing offerings, there is no choice but to identify individual users. 2.2 Considerations in Design of Passwords
I n order to be effective, passwords should be generated as a random string of letters or numbers in order to minimize the possibility of their being diagnosed. Furthermore, they should be long enough to discourage systematic testing. The determination of the random password length required to provide a given degree of protection against systematic testing is given below. It is assumed that the tests are carried out at the maximum line transmission rate (such as would result by replacing a terminal with another computer). The password size is determined by solving the following inequality for S: (R/E)4.39 X 1 0 4 ( M / P )2 A S
where R is the transmission rate of the line (characters/min), E is the number of characters exchanged in a log-on attempt, P is the probability that a proper password will be found (as a decimal fraction), M is the
JAMES
12
P. ANDERSON
period over which systematic testing is to take place (in months of 24 hour/day operation), A is the size of the “alphabet” from which the password is made up (e.g., 26, 36), and S is the length of the password in characters. As an example, we can determine the password size drawn from the standard alphabet that would give a probability of no more than 1/1000 (.MI) of recovery after 3 months of systematic testing under the conditions noted above. The line speed is 300 characters/minute, and 100 characters are exchanged in a log-on attempt. Using our expression we get
300 x 4.39 x 100
104 x 3
x
103 I 26s
3.951 X lo8 5 2@ 26s = 3.089 X lo8 26s
=
8.03 X lo9
for S
=
6
S=7
Under these circumstances, we might reasonably choose S = 7 and be close enough. If the probability were made 1/10,000 (.0001), the next larger size (S = 8) would have to be chosen. I n fact, it is the probability (of getting a “hit”) that affects the value of the expression most; the other terms rarely contributing more than a factor of 10. 2.3 Password Distribution
The problem of distributing passwords is one of the most vexing barriers
to frequently changing the password. Because of the “secret” nature of the password, users are encouraged to memorize them. If the frequency of password change is sufficiently low, this imposes little burden on a user, while if frequent changes (up to a change for each log-on) are made, the user will become confused or end up writing down the password. is because they are re-used, The reason passwords are kept perhaps for extended periods of time. If however, the password were changed each time it was used, there would be no more risk in writing down the password than in carrying a key to a locked room. The principal risk would be loss or theft, which if reported promptly could be eliminated by locking out that password. If one accepts the foregoing, then it would be feasible for the system to supply users with the next password to be used as part of the log-on procedure. This password could be carried about by the user as required. To further minimize the risk of carrying a password openly, the system could be constructed to generate, say 10 passwords each time, only one of which would be the correct one to use for the next log-on. The position in the
INFORMATION SECURITY IN A MULTI-USER COMPUTER
13
list of the correct next password would be fixed for a given user, but varied between users. In keeping with common practice and to allow for typing errors, yet prevent systematic testing of each password on a list, the log-on procedure can be designed to lock out the user after three tries (or as many as the installation desires) if any but the correct password were used. Key switches can be used to deny casual access to a system from remote terminals fitted with the key switch. The key switch, however, does not provide unique user identification, and devolves into a technique for operating a terminal control system without requiring the terminal to be in a constantly attended office. The magnetic and card reader/writer is interesting since it permits the system to transmit back the password to be used at the next log-on in a most convenient fashion. Its disadvantage is the cost of a reader/writer which effectively precludes its use in systems with a larger number of remote-access stations. 2.4 Other Methods of User Authentication
Other methods of unique positive identification of users have been available, proposed, or are being actively worked on. These include devices for accurately measuring some physical characteristic (e.g., the geometry of the hand, identification of speech patterns, fingerprint reading). In addition, key switches and magnetic card reader/writers have also been mentioned. Of the former group, the problems are not of measuring physical characteristics or speech patterns, but of abstracting the measurements into a unique representation for an individual. Even if these problems are solved, the cost of the measurement equipment is generally too high to apply to most systems, particularly as the identification is no more precise than that afforded by the password technique. 3. Computer Characteristics Suppotting Information Security 3.1 Hardware Facilities for Multiprogramming
The basis of multi-user information systems is found in the hardware of the computer used for the system. It is also the foundation of adequate security controls incorporated in such a system. Multi-user (multi-access, etc.) systems use the techniques of multiprogramming to distribute the real resources of a computer system among a number of simultaneous users. Thus, we find computers with specific hardware features to support multiprogramming. These include base registers to permit relocation of
14
JAMES P. ANDERSON
program and data segments and to facilitate dynamic linking of program segments: a comprehensive interrupt system that provides redirection of control based on external events, the concept of privileged operationsissuing 1/0 instructions, setting base and time registers, etc., and memory protection schemes. Not surprisingly, because of the only recent awareness of its importance, there is no special hardware features to support information security per se. However, on inspection, we find that many of the hardware facilities for multiprogramming are useful in providing the foundation for information security as well. The fact that many of the hardware facilities to support multiprogramming were motivated from considering the effects of undebugged programs is fortunate from a security point of view. I n the balance of this section, we will consider some security objectives/requirements, and some of the hardware facilities that can be used in meeting the objective. I n the next section, we will present the rest of the framework by considering operating systems functions that support information security. Hardware feature
Use in support of multiprogramming.
Memory protect
Prevent undebugged programs Prevents users from accessing from damaging other users, information not his, or manipulating the operating sysor operating system. tem to recover system information.
Interrupt system
Redirect control on external events.
Detects attempts to execute “illegal” instructions, violate memory protect, etc.
Privileged instructions
Prevents accidental damage t o file data or to operating system.
Prevents user from “seizing” the operating system, or extending his memory bounds. Protects file data from misappropriation.
Use in support of multi-user information security. ~~
3.2 Program (User) Isolation Mechanisms
The security purpose of isolating programs, one from the other, is to prevent a hostile program from misappropriating data from main memory or manipulating parts of the operating system to cause i t to do the misappropriation on behalf of the hostile program, whereas the emphasis on program isolation in ordinary multiprogramming has been to limit the damage an undebugged program could cause. The netj effect is that the memory protect hardware in many contemporary computers provides
INFORMATION SECURITY IN A MULTI-USER COMPUTER
15
write protection (i.e., it will act to constrain all data transfers from registers to memory to fall within the memory assigned t o that program), but does not provide read protection. Where the hardware is available for read protection, it is often the case that the manufacturer-supplied operating system does not take advantage of it. The principal isolation mechanisms are segment length checks, bounds registers, and storage locks. The segment length check operates with a base register and tests the relative address developed (after indexing, if any) against the length of the segment being referenced. If the relative zddress is less than or equal to the segment length, the base is applied, and the reference completed. If the relative address is greater than the segment length, an interrupt is generated. Bounds registers operate on a similar basis, except that the check is made on the real addresses rather than relative addresses. Storage locks are bit patterns applied to blocks of real storage allocated to a particular program. References to storage for reading or writing must be accompanied by the correct bit pattern, or a reference out-of-bounds interrupt is generated. One bit pattern (usually all zeros) is reserved for use by the operating system, and permits free reading or writing to all of the main memory. From a security viewpoint, any of these specific storage protection mechanisms can provide the necessary information protection, although the specific technique chosen has profound effect upon other aspects of the operating system design, and can affect the efficiency of the system. For example, use of bounds registers will require a program and its data to occupy contigious locations, while the length check and storage lock technique permit the operating system complete flexibility in locating program and data areas of a single program (although the latter commits a fixed size unit of storage for each allocation). Related to the problem of user isolation is the problem of leaving data behind in core storage (or on temporary work files) after a job is completed. When a new job is allocated space previously used, it may be possible to recover information left behind by the previous user unless the operating system clears this space (by overwriting with zeros for example) upon deallocation. While simple to do for core storage, such a procedure becomes quite time-consuming for external files, and may require a special job continuously resident to overwrite released file space. 3.3 Two State Operations The concept of two state computer operation is already present in most contemporary computers and is important in providing information security [ l 7 ] .The two states are known variously as supervisor-user states, master mode-slave mode, privileged-nonprivileged, etc. The principal dis-
16
JAMES P. ANDERSON
tinction is that the privileged state is permitted to execute the hardware 1/0 instructions, to set base registers, and to load and/or manipulate memory protection registers. Again, from an information security viewpoint, this arrangement is quite satisfactory and provides the essential safeguards. It is interesting to note, however, that the design of modern operating systems has introduced the concept of partially privileged program. These partially privileged programs exist outside of a user’s program and often perform an essential service, as in utility programs. The reasons such programs are given the status of partially privileged is that they may require access to the users’ programs data space, and for purposes of efficiency to execute some privileged instructions. Present designs rely on programming checks built into the principal operating system to achieve the necessary separation between itself and partially privileged programs. While the concept of partially privileged programs is merely an extension of the privileged part of an operating system for most systems, the methods used to control the partial privilege can be of considerable importance if the basic manufacturer-supplied system is used as the foundation for constructing dedicated applications systems.
3.4 1/0 Characteristics The establishment of independent 1/0 machines (Channels) was a big step forward in the design of efficient computers. Upon instruction from a central processor, the channel will cause data to be read or written from or to any portion of the main (execution) memory independent of the rest of the system. It has been a characteristic of most systems that the memory protect mechanisms is suspended for channels. Since the 1/0 instructions themselves are privileged in most modern systems, this would appear to be no problem since the 1/0 can only be executed by the executive, which can check to make sure the area of memory affected is the proper one. The problem arises because many 1/0 operations are data sensitive-(e.g., the amount of data being read in is determined by the length of the physical record recorded on tape). If a read operation is commanded, and the address to which the data is to be read is within the area set aside for the user program, the operating system may issue the command, while the data being read in can exceed the nominal user8 space by an arbitrary amount. While the exploitation of such an arrangement may not be easily accomplished, the danger exists that i t is possible to read data in such a way as to replace all or part of the operating system with one’s own version, enabling a hit and run penetration to take place.
INFORMATION SECURITY IN A MULTI-USER COMPUTER
17
Of the various memory protect schemes in use, the storage lock technique will prevent overwriting of storage not identified with the user in question by including the users storage key in 1/0 commands to the channels. It is interesting to note that the IBM System 360 incorporates such checks in the hardware. 3.5 Virtual Machines A recent development in systems structure has been the concept of a virtual machine [6] (or sometimes virtual computer, or computer with virtual memory). The principal characteristic of these systems is the fact that they generalize the base address register concept found in earlier machines by providing a table of base registers for various independent segments of programs. Addresses that exist or are developed (using index registers, etc.) in the users address space are translated into real memory addresses through the use of the table of base registers. Since the table of base registers is held in computer words, often much larger than the address size, there is room for ancillary information about the segment represented by the table entry. Most commonly, the additional information is the length of the segment (used for memory protect purposes), and an indication of the types of references permitted to the segment in question. This latter information is typically: READ (AS DATA) READ OR WRITE WRITE ONLY READ (FOR EXECUTION ONLY).
Since these (or similar) access authorities can be applied to each independent segment of a program, it is possible to develop systems with very positive and precise control of the types of referencing of programs and data permitted in the system. It also permits the operating system itself to be in the “users address space,” since the access restrictions of “READ FOR EXECUTION ONLY” can be applied to the code. By incorporating the operating system into the users address space, it is possible to create systems conceptually equivalent to the single user a t a time systems of the past. Operating systems based on the virtual machine concept isolate the management of the real resources of the system (memory, peripherals, etc.) from the rest of the operating system which is constrained to operate in the area(s) allocated a given user. Even if the user breaks out of the normal bounds of the code he writes, or spoofs the operating system to read or write into areas of storage allo-
18
JAMES P. ANDERSON
cated to him (but not generated directly from code he writes), the operating system itself is constrained to operate within his allocated space. While it would still be possible for the user to alter the operating system code (accidentally or deliberately), it is feasible to validate the operating system each time before i t assigns itself to the next task. The validation could take the form of hash totals of the code, compared against values maintained in the real resource management code (see Section 7). 4. Operating System Functions Relating to Information Security
4.1 Recognition of Authorized Users
This function involves both proper identification of the user, and in those systems using dial-up facilities, often includes recognition of terminals as well. I n the latter case, this is accomplished by equiping terminals with a “HERE IS” function that responds upon interrogation by the computer. Incorporation of this automatic response upon interrogation provides the system with an additional level of confidence that it is in communication with an authorized user, although its principal value is more in the realm of being able to collect rentals for terminals supplied by the vendor of the multi-user service. Identification of terminals is more important in defense related systems where the terminal (by being located in a relatively open or accessible area) may not be authorized to transmit or receive certain levels of government classified data, while a user who may be authorized to access these levels of information can use that terminal. I n order to prevent such a user from activating programs and/or transmitting and receiving data a t classified levels higher than the terminal is permitted to handle, it is necessary to apply the concept of information access authority to the terminal population as well as the user population. 4.2 Control of Access to Programs and Data
Once an operating system accepts a user (and his “terminal” in some cases) as authentic, it is then the operating system responsibility to exercise the controls provided to limit access to programs and data in the system. The controls are part of the administration of the file system and are quite specific t o a given system. As a simple example, many time-shared BASIC systems maintain a catalog of the saved files of a user. When the user requests a file in the OLD-NEW sequence, the name is checked against those in his catalog (or in some cases, the public catalog), and only if it is a file he owns, or can access, is he permitted to continue. In essence,
INFORMATION SECURITY IN A MULTI-USER COMPUTER
19
the controls limit access to data or programs either in the public catalog, or the user’s private catalog. Access to programs and data becomes a problem only when the simple dichotomy of public data files and user private files is left, and one builds hierarchical file sharing arrangements, or where some limits on access within a single file are required. These controls are often complex, and may have to be executed for every reference to data from a file in a given case. Some of the techniques available to exercise this control are discussed in Section 5 . 4.3 Common Services
In a sense, modern operating systems have evolved from the first centralized service, IOCS, by progressively adding more functions to assist users in managing the resources of their systems. There is information TABLE1
SUPERVISORY SERVICECALLSIN GECOS I11 Symbol GEINOS GEROAD GEFADD GERELS GESNAP GELAPS GEFINI GEBORT GEMORE GEFCON GRFILS GESETS GERETS GEENDC GERELC GESPEC GETIME GECALL GESAVE GERSTR GEMREL GESYOT GECHEK GEROUT GEROLL
Meaning Input/output initiation Roadblock Physical file address request Component release Snapshot dump (Elapsed) time request Terminal transfer to monitor Aborting of programs Request for additional memory or peripherals File control block request File switching request Set switch request Reset switch request Terminate courtesy call Relinquish control Special interrupt courtesy call request Date and time-of-day request System loader Write file in system format Read file in system format Release memory Write on SYSOUT Check point Output to remote terminal Reinitiate or rollback program
20
JAMES P. ANDERSON
security benefits as well as hazards from this centralization of services. The benefits are those arising from separation of a user from the real facilities of the system, and providing programs to mediate his actions. The hazards are those associated with incomplete design, and in modern systems, from the fact that the common services are the principal internal interface between the operating system and user program, and are the primary points of attack available to anyone attempting to penetrate the system. Examination of the major operating system service functions made available to user programs through calls on the supervisor indicate the number of potential interface points between user programs and the supervisory services. Typical of these functions are the list in Table I, available under GECOS I11 for the GE-600 series systems. The degree of risk in such services is intimately bound up in the overall design of the operating system, and whether or not the system is able to validate the parameters to the common service calls. It is beyond the scope of this paper to attempt to treat this in detail; however, it is possible to illustrate the point. It is well known that OS/360 uses locations within the users address space to store addresses of privileged operating system routines. At some points in its execution, it uses these addresses to transfer control to the routines in supervisor state. As a consequence, it is possible for a user to replace the addresses with ones pointing to code he has written, and force the sequence of actions that cause the operating system to transfer control to these addresses in supervisor state. This gives him complete control of the system for whatever purpose he wishes. Contrast this with GECOS 111, where all data necessary to support the execution of the users programs is contained in tables outside of the users address space. This prevents the user from directly manipulating this information, although it may still be possible in some operating systems to supply calls on common services with incorrect 'data, and obtain control indirectly. 4.4 Output Routing
This function is concerned with directing data back to a user at a remote location. For many commercial time-sharing systems, there is little or no problem with this function. It impacts security only if the output is misrouted. While not a frequent problem, it can be serious in some environments. The bulk of the potential errors that could cause misrouting of data lie outside of the operating system itself. The problems arise when electrical transients in the system modify bits of the device (or terminal) address in a channel, or as the channel control word is transmitted from main (computer) memory to the channel itself. Further, failures of common carrier switching networks can cause information to be misrouted or lost.
INFORMATION SECURITY IN A MULTI-USER COMPUTER
21
If the user environment demands protection against this kind of error, the use of straight-wire (point to point) connections between the central and remote sites will provide some assurance against the misroutes that occur due to the common carrier. There is still the potential for misroute from the channels, and communications interfaces, although the relative frequency of these errors is still low. With the interface to communications lines being handled by dedicated processors in many systems, it is feasible to incorporate “echo” checks between the main processor and the communications interface. This at least assures that the terminal address is transmitted properly between the systems. Within the communications interface processor, it is only possible to check that the current address being transmitted to is correct by copying the address used for the transmission and comparing it with the one used to set the register. Clearly, there is still room for errors; however, experience on a number of large scale heavily used systems indicates this is not frequent in the systems themselves. 4.5 Sources of Security Problems in Contemporary Operating Systems 4.5.7 Design for Benign Environment
The fact that the designers of operating systems are only peripherally aware of potential malevolent penetration of their systems has to be the major source of security problems in contemporary operating systems. The designers do concentrate on providing protection from program bugs of various kinds, but give little thought to the effects of a planned penetration. Thus, we see OS/360 with 1/0 control blocks located within a programmer’s address space, where he can manipulate them (e.g., supply different data set names than those supplied in the DCB or JCL cards). Another symptom of design for benign environments is not checking that parameters passed to system service calls are legitimate for that user-for example, issuing 1/0 commands with addresses outside of the users address space. This kind of action is not expected, consequently, it may be overlooked by the operating system designer. 4.5.2 incomplete Program Design
A simple example of incomplete program design with security consequences is found in one of the contemporary time-sharing systems. It is possible to attempt a log-on with either the user-ID or password incorrect, and by depressing the BREAK key on the teletype, interrupt the error message to the effect “Password Incorrect. . . Try Again,” resetting the log-on attempts counter to zero. The effect of this design error is to permit
22
JAMES P. ANDERSON
anybody with enough patience to systematically try passwords until one
is discovered that works. Since the number of log-on attempts never reaches the threshold for disconnecting the user (usually 3 on many systems), the system is not aware that penetration is being attempted. This particular design omission could be exploited by replacing the user/ terminal combination with a minicomputer programmed to try all possible (reasonable) passwords. Incomplete design is generally what is being referred to by the phrase “software failure,” and arises from the designers’ and programmers’ of the operating system not being able to anticipate extreme or unreasonable situations. Unfortunately from the security viewpoint, modern operating systems are too complex to make simple a priori determinations that they will operate properly under all conditions. 4.5.3 Problems o f Program Identity
Related to the problems of validating parameters in service calls is the problem of identifying the program that made the call as a user or a privileged program. Since service calls are often nested and embedded in the operating system itself, it is difficult for the service call to ascertain whether it is being invoked in a chain that originates in the supervisor or in a user program. It needs this information to validate parameters appropriately, since what is legal for the supervisory code is most often not for a user program. Complicating the process is the existence of programs (other service functions, for example) that, while part of the supervisor, have only limited privileges, but many more than an ordinary user’s program. In a subsequent section, we will survey a number of protection proposals that attempt to deal with this problem. 4.5.4 Inability to Have Completeness of Design
The final source of a lingering doubt about the security of an operating system design is that even one that appears to be complete cannot be proved to be complete for all of the pathological cases that could occur. This doubt is reinforced by the experience of the computing community with the bulk of programs that are never fully debugged, if only because of the extreme cost of exercising all important cases, and the virtual impossibility of even enumerating all possible pathological cases. This unhappy state of affairs is compounded by the special difficulty of accurately reproducing an exact sequence of events in a real time system. The only relevant work reported to date is that of Dijkstra [6],who has constructed a multiprogramming environment where each component
INFORMATION SECURITY IN A MULTI-USER COMPUTER
23
of the operating system is constructed to be self-contained, and to operate correctly for all cases. The components are then able to communicate freely and without possibility of interference, and it is claimed by the author that since each step of the construction process (mainly components) is proved to operate correctly, this constitutes a “proof” that the system as a whole will operate correctly for all cases. This work is an interesting and important step toward the general ability to prove complex programs (operating systems) are complete; however, it appears that it will be a number of years before the results are assimilated into the general use environment. It should be noted that it is because of just this point that there is reluctance on the part of the government to permit more than the most highly restricted operation of multi-user systems with classified defense information. While there are a large number of other considerations that apply to this situation as well, it is well beyond the scope of this article to attempt to treat them. 4.6 Security Relationship of Operating System to Hardware System
In general, the operating systems base their security procedures on the two major hardware features affecting security-the memory protection scheme and two state operations (i.e., privileged instructions) [15]. These particular hardware features establish the foundation upon which the operating system designers erect their control. The other major hardware system attribute that affects the design of the entire operating system (and the security provisions required as well) is the number of base registers and the form of addressing provided by the system, and whether or not they contain control information recognized and acted on in the hardware itself. To be sure, there are only few systems with this attribute; however, they are nearly all distinguished by being some form of virtual system. Note that it is specifically intended to include machines such as the Burroughs 5500, 6700, etc., that provide an arbitrary number of base registers as descriptors, as well as the more obvious forms such as the GE-645 used in the MULTICS system. The important characteristic is hardware acting on control information implicitly. The importance of such facilities is that it provides positive control for all user references to data and programs. If the base registers (descriptors, or segment tables in contemporary literature) are maintained outside the user’s address space, or he is otherwise effectively prevented from manipulating them, they provide an adequate mechanism for protected information sharing, and in general, expand the modes of use of multi-user systems.
24
JAMES P. ANDERSON 5. Problems of File Protection
5.1 Basic Problems
There are basically two file protection problems (excluding the problems of physical protection). The first arises in connection with computer utilities and is concerned with methods of precisely controlling the sharing of information, and more specifically, programs. The problem is complicated by the notion of implied sharing. As an example, if a user B is sharing some programs owned by user A, and then authorizes user C to share his program that shares some of user A’s programs, how is the sharing between B and C controlled such that C does not have access to the programs of A and B, only their results. Basically, the question being addressed is how can communication be established between two users’ programs such that only the results of the shared program are available to the sharer. The second problem arises in environments where data is classified according to external criteria (e.g., in files of defense information) and is more concerned with establishing a logically consistent method of determining the security label to be associated with file access requests in order to permit an intelligent determination of the validity of the request. This problem is complicated by the fact that users, programs, terminals, files, and executions all can be considered to have such labels, and that the security label of some objects (executions and some files) can change during the execution of a program, or during the execution of a job. In addition, in the environments where this problem is important there is considerable attention paid to the derivation and proper transfer of security labels to files and printed material. 5.2 Models For Shared Information Processing
The issues involved in this problem are how authorizations to use a file or a program are accomplished and the general framework in which programs are created and executed. Most of the workers involved with this problem have assumed or required the existence of a file system consisting of a collection of files and a directory associating a user with his files, or in exceptional cases, a directory associating a file with its users. Assuming the first form, the authorization mechanism must permit a file owner to designate the users with whom he wishes to share a file and those privileges the sharer is permitted with respect to the file. A commonly used mechanism is to associate with each shared file in a user’s directory a list of other users who may access the file and for what purpose (i.e., READ, WRITE, APPEND, etc.). A sharer, in order to establish a connection to the shared file, creates his name for
INFORMATION SECURITY IN A MULTI-USER COMPUTER
25
the file and equates it to the file being shared. Sharers reference to the file name he created is interpreted as an indirect reference to the owner’s directory, from which the type@) of access permitted are checked before completing the reference. A number of variants on this scheme can occur to make the process more efficient. For example, the directory search can take place at binding time (assuming pre-execution binding), a name substitution made, and a transfer of access flags made to the sharers’ file control block. However, these are implementation and application dependent and will not be discussed further here. In one model [7], actual system commands are provided to permit designating sharers of files. Other authorization models exist ;these include use of passwords associated with each file in the (protected part of the) system to act as locks. An owner authorizes sharing of his file(s) by providing the sharer with the password for the file. As Friedman [7] notes, however, this is less than satisfactory because it permits the sharer unrestricted access to the file for any purpose. The method of actually controlling authorized sharing in nearly all utility-oriented systems is based on the use of indirect references to the shared objects through descriptors. It is characteristic of most systems designed for information utilities, or large populations of on-line users, that they provide some form of virtual memory system [5]. The objects (e.g., programs, data, files) occupying the virtual memory are represented by descriptors, collected into one place, managed by the system, and acting to map a virtual address into a real address. The mapping is often aided by hardware in the system, but this is merely a technique for improving execution efficiency and is not fundamental to the concept. Since descriptors are maintained by the system (necessarily, since they deal with real resources) they are in a special segment designated READONLY to a process. Descriptors are used to control sharing in a variety of ways. Basically, each descriptor representing a program, data set, file, etc., contains control information in addition to the address in real memory where the object is located. The basic control information of security interest is the type of access permitted to the object-READ, READ-WRITE, EXECUTE, APPEND, etC. Since the operating system is the only program permitted to create and manipulate these descriptors, the necessary mechanism to provide controlled sharing of other users’ programs and files appears to be established. This would be the case if only one user at a time were permitted to gain access to an object. However, in the multiple user environment, a given object could be in use by a large number of users, perhaps with different access privileges. In general, this case is handled within the same framework as for the single user; since each user’s process is represented by a descriptor
JAMES P. ANDERSON
26
TABLE I1 INTERPRETATION OF ACCESSCONTROLS IN DESCRIPTORS AS A FUNCTION OF THE PROGRAM STATE Access control Program state
User-RE
user-
AD
WRITE
User-EXECUTE
Supervisor (any)
READ
OK Error Error OK
or
APPEND
EXECUTE
OK OK
Error
Error
Error OK
Error Error OK
READ
WRITE
OK
OK
OK
table (segment) unique to that user, the descriptor referring to such an object can have the access control information set to the appropriate value for that user. The actual checking on access type is accomplished on modern systems in hardware as a descriptor is referenced. Assuming that the processor distinguishes a t least four logical states, Table I1 indicates possible interpretations of the access control data in the descriptor. The ERROR state is assumed t o be sufficiently severe to force the abortion of the offending process. Within this general framework, a number of secondary problems emerge. Graham treats protection as a disjoint series of rings, and discusses the problems of changing control from one protection level (viewed as concentric circles or rings) to another in a safe manner [8].To provide protection in both a downward (from a superior routine to an inferior routine) as well as an upward direction, he proposes a model that augments the descriptor for a segment with ring bounds that permits free access as long as the element being transferred to is within the bounds, but invokes special software whenever the bounds are exceeded in either direction. In general, the special software validates the address being referred t o regardless of the direction of the reference. In this way, the mechanism protects a process from the operating system as much as the other way round. Vanderbilt has created a model that extends that of Graham to include cases that arise when a user sharing an object authorizes others to use the process he creates [ d l ] . I n his model, he introduces the notion of access privileges as a function of the activation level of the process, and in effect makes copies of the descriptor segment for each activation level encountered in order to provide the precise control needed. He distinguishes
INFORMATION SECURITY IN A MULTI-USER COMPUTER
27
the problems that arise from direct access to a shared procedure and adopts as part of the model the policy that direct sharing of procedures is only permitted t o procedures authorized to the borrower by their owner, while only indirect sharing of procedures is permitted for those procedures owned by a third party and authorized and used by an owner in constructing a procedure that is (to be) shared with others. I n the latter case, a borrower can only affect indirect access to procedures borrowed by the owner of a shared procedure.
5.3 Models For Hierarchical Access Control
The only available paper that deals with this subject in a formal manner is that of Weissman [27]. In it, the author defines security objects (files, users, terminals, and jobs) and security properties associated with the objects. The properties are Authority (a hierarchical set of security jurisdictions-classification) , Categories (a mutually exclusive set of security jurisdictions-a formalism of the need-to-know policy), and Franchise (clearance). The balance of the paper is devoted to developing a set-theoretic statement of the policy adopted in the ADEPT-50 system. (a) A user is granted access to the system only if he is a member of the set of users known to the system. (b) A user is granted access to a terminal only if he is cleared to do so. (c) The clearance of a job is determined from the clearance of the terminal and the clearance of the user. (d) Access is granted to a file if the clearance and need-to-know properties of the file and the user are authorized (cleared) to the job. The model treats all file accesses as events and maintains a running determinat,ion of the classification and need-to-know level of the job based on events throughout its execution. This information, known as a high water mark, is most useful in determining the derived classification and need-to-know for new files created during job execution and for labeling output. The only drawbacks with this model is that classification and need-toknow can change in only one direction-upward (to higher levels) depending on the files used in the application. Two relatively infrequent, but none the less important, cases are not treated by the model-the case where individual data items are themselves not classified, or are a low level classification, but when aggregated (collected into a file or report) may acquire a higher classification, and the case where a program trans-
28
JAMES P. ANDERSON
forms a classified file into an unclassified file (perhaps by extracting data known to be unclassified for a report). The latter case arises principally because the classification is applied to too large a unit (the file) and would disappear if fields could be individually classified. The former case cannot be handled within the framework of Weissman’s model as it stands, since it is a value judgment as to when (or if) a particular aggregation requires a higher classification than the source material. This could be handled by providing the concept of security declarations in programs that would override the running classification and need-to-know property if specific conditions were encountered during execution of the job. The conditions might be of the form, “If the number of records placed in temporary file F1 is greater than 100, advance the classification to the next highest level,” or, in general, “IF (condition) THEN (statement of security labeling. )”
6. Techniques of File Protection
6.1 OS/360
The previous section has given various models for shared information control proposed or developed in the context of information utility oriented systems. We can contrast these with that provided on System 360 as typical of the type of protection the manufacturers have considered adequate. A user may create a protected data set and apply a password of up to 8 characters to it. When the data set is opened, the password must be supplied by the operator in order to continue processing. Failure to supply the correct password causes the job referring to that data set to be aborted. On 360/DOS systems, the security parameter merely informs the operator that a “secured” file is being accessed, to which he replies YES (to permit processing to continue) or NO to abort the job. Unfortunately, this scheme is rather gross, since there is no way for the operator to distinguish the owner of the data set from other “casual” users. This means anyone attempting to open the data set will probably succeed, assuming that the operator is the principal custodian of the password. On certain time-shared versions of 360 systems, the application of passwords to files is the principal protection mechanism for those files. As Lampson [ I d ] points out, this is quite unsatisfactory for a variety of reasons including: (a) the larger number of passwords that have to be remembered, and (b) the fact that any access is total.
INFORMATION SECURITY IN A MULTI-USER COMPUTER
29
6.2 File Encryption This method of file protection has been persistently pursued by several writers as a solution to file protection problems. The bulk of the papers have dealt with cryptographic methods ranging from very simple schemes [23] to reasonably sophisticated techniques [.2, 19~1.The applicable techniques are in most cases derived from communications practice and transferred bodily to the computer environment, with little or no justification (except for a claimed secrecy) for their use. VanTassel cites some of the properties of files that must be accounted for in adaption of a cryptographic transform and further enumerates a list of the properties the transform must have [22]. The principal benefits obtained from this degree of elaboration is protection of files should other safeguards fail. Carroll and McLelland cite the advantage of encipherment in general as a useful countermeasure to passive infiltration (wire tapping), between-lines entry, piggy-back techHowever, they point out that information niques, trap-doors, and theft [a]. in core is vulnerable to dumps and system entry of various kinds. The principal problems with using file encryption are those associated with the maintenance of the cryptographic “keys” for different files, and the requirement that the user remember the keys. The technique suffers from much the same problem as that cited by Lampson for password protected files [I,%’]. The difficulty with LLkey’l management is illustrated by considering the problems attending the encipherment of an indexed sequential file. If one assumes that the method of encipherment is known (an assumption that would have t o be made in most multi-user environments since the method could be obtained from memory dumps, and in any case might be a common system service available to all users), then only the “key” would be protecting the file. If the file were enciphered using a single key (as might be the case for a sequential file), then it would not be possible to use the file in a “random” manner. If each record is enciphered with a different key, then the problem is to associate the specific key with each record. One method of doing this is to use some form of auto-key (self-keying) for such files. This method (which is not restricted to random access files) eliminates the position sensitivity of most methods derived from communications practice, and does not interfere with common file operations regardless of organization or access methods. It must be concluded that file encipherment is effective only in specific instances, particularly where the integrity of the key(s) can be maintained, and where the cryptographic technique is itself reasonably complex. For
30
JAMES P. ANDERSON
both a complete survey of cryptographic methods, and a sobering review of the “invincibility” of even highly sophisticated techniques, the reader is referred to Kahn [ii]. 7. Techniques of Security Assurance 7.1 Pseudo-User
The concept of security assurance on multi-user systems is analogous to a watchman making his rounds to check on the proper functioning of the security measures protecting his domain. I n a similar sense, we can increase the confidence in the security of a multi-user system by checking the proper functioning of the security measures taken in the system. Since the security measures in multi-user systems are based on hardware features such as those discussed in Section 3, it is here that tests can be applied. Assuming that the rest of the system and additional software security measures are properly designed, this kind of check is to guard against an undetected failure of the memory protection scheme or the two-state operation that is the base of the security mechanisms for multi-user systems. The technique consists of establishing a permanent pseudo-user program that attempts to violate memory bounds and execute instructions reserved for the supervisor state. Since the first of these conditions should be reported to the supervisor by an appropriate interrupt, while the second may NOP or report as the first (depending on the machine involved), i t is only necessary to establish a convention where the fact that the interrupt is being forced by the pseudo-user is conveyed to the operating system in order not to invoke the abort or dump procedures that would be appropriate to a normal user program. The frequency of running the pseudo-user hardware test can vary depending on external considerations, but should probably be run every minute or less. Depending on the kind of system, it is also possible to check on the proper operation of various parts of the system software. At a minimum, one can extend the pseudo-user concept to check the log-on sequence for proper rejection of invalid user numbers or an incorrect password. While these checks are not a guarantee that the system is error free, failure of the system to properly reject an invalid log-on (or to accept a valid log-on) is often evidence of a hardware malfunction that may have invalidated all other internal security measures taken. The proper course of action to take in the event of an error detected by the pseudo-user program must be a function of the kind of system involved and the type of material contained in the system.
INFORMATION SECURITY IN A MULTI-USER COMPUTER
31
7.2 Audit Trails
Audit trails are an important part of a security assurance scheme. While most multi-user systems have provisions for logging important events occurring during the running of a system, these are primarily for the purpose of accounting or occasionally for monitoring system performance. While a security audit trail may make use of the same logging program, the purpose of the security audit trail requires different information to be recorded. The security audit trail is not to be confused with the type of information required by conventional auditing practice. For a discussion of these requirements, see Wasserman’s article [66].Rather, it is concerned with recording who was doing what in a multi-user system. Its purpose is to detect patterns of abnormal activity that may be the first clue of attempted penetration of a system, and to provide a sufficient record to determine the extent of a penetration that may be discovered by external means. 7.2.1 What Should Be logged
The following list is by no means exhaustive, but is presented as an indication of the kinds of data that should be maintained in a security audit trail. Each entry should include, in addition to the item(s) indicated, the identification of the user involved and the date and time when the event took place. 1. 2. 3. 4. 5.
Log-on (successful), terminal ID, user ID. Log-on (unsuccessful), with password/user ID, terminal ID. Programs called (includes utilities, compilers, etc.). All files referred to (includes volume serial numbers, file names, etc.). All apparent program “errors” (out of bounds references, illegal supervisor calls, array bounds errors, etc.).
For multi-user batch systems that permit essentially unrestricted programming, the job control cards that are used to specify temporary files should also be recorded. 7.2.2 What to Report
Clearly, with conscientious attention to security audit trails, more data will be available than can be reasonably assimilated by simple inspection. It is, therefore, necessary to provide a data reduction program to summarize significant security events. As a minimum, all abnormal events (incorrect log-oas, execution time errors, requests for large amounts of file space,
32
JAMES P. ANDERSON
etc.) should be reported for each user. The other data can be maintained in raw form, or summarized onto files t o assist the process of determining the extent of a penetration that may be discovered later. The length of time to maintain such data must be a function of the kind of installation, and whether or not there exist statutory requirements that must also be met. 7.3 Validation of Programs
There is no known mechanical method of certifying that a program will operate correctly under all conditions of data and in all contexts of its execution. This means that the introduction of new systems programs must proceed cautiously and with human checking. Many times this is satisfactory if done; however, as a practical matter, there are few installations that can (or will) take the time to examine manufacturer-supplied changes to an operating system, compilers, or utilities to see that they do not unintentionally (or otherwise) create a gap in the security built into a system. This is perhaps the biggest barrier to the simple creation of secure multi-user systems. The only recourse an installation has is to invest the effort to understand and analyze the operating system they are using for its potential effect on system security. A related problem, not yet prominent because of the security of other concerns, is that of assuring that a program (or more generally, part of the operating system) has not been changed in an unauthorized fashion. On systems that provide unlimited programming, this may prove to be a particularly sensitive problem because a user who has determined how (if possible) to bypass the memory protection system could have modified the system after initial start-up in an undetected fashion, achieved a penetration, and returned the system to its original state before terminating his job. I n a less exotic way, it is often important to validate the current copy of the operating system as it is being loaded, or more importantly, as it is being reinstated as part of an automatic recovery after a crash. Other parts of a multi-user system that might benefit from periodic validation include the file directory(s), user access lists, and the like. The simplest validation technique is the use of hash totals for various segments of the operating system. These can be taken over any span of code or tables used by the system. The major problem with hash totals is the significant overhead incurred to do the entire system frequently. To overcome this, it may be sufficient to take hash totals only on those parts of the system (code and tables) that are particularly sensitive t o changes. For example, a user status table that contains pointers t o a system code, and possibly an indication of the users privilege, could be hashed as part of every system change, and tested
INFORMATION SECURITY IN A MULTI-USER COMPUTER
33
before the data is used. Sensitive code could be sampled and tested every few minutes or so. I n order to prevent manipulation of the hash totals themselves, it would be possible to start hash totals with an initial value (different for each day, supplied by an operator) that would have to be recovered before the hash totals could be made to look correct. 8. Communications Problems 8.1 Vulnerability to Wiretap
One of the lesser known aspects of the Civil War is that it was the first large scale conflict directed in the main by message communications from the War Department. This was possible because of the rapid spread of the telegraph network after its invention in 1840. As a consequence, both the North and South used wiretapping to gain intelligence on the disposition of forces and planned operations. Thus, we see that wiretapping has been with us ever since the invention of modern communications. In connection with multi-user remote access systems, it has not been a major problem primarily because it is easier and less risky to obtain information by other means, including use of penetration techniques outlined above. However, considering the relative ease with which it is possible to use various forms of eavesdropping devices, it can be expected to become more of a problem, particularly as the internal (programming) and external security controls are strengthened in the products provided by the manufacturers. In addition to the recovery of information flowing between a terminal and a computer, Carroll and McLelland [a] cite the problems of “betweenlines” entry (using a system on an open communication line that has an inactive terminal), and L‘piggy-back’lentry (“capturing” the communications channel after log-on, and feeding the legitimate user false messages) as additional hazards of wiretapping. 8.2 Wiretap Countermeasures
Within a facility, one can rely on physical security techniques such as personnel movement controls and guards to deny a potential penetrator the opportunity to tap lines. A more expensive solution is to use armored cable or various forms of cables with anti-intrusion alarms (often cables pressurized and fitted to alarm if the pressure drops-as it would if someone broke into it). Other techniques involve placing the cables between the terminals and the computer in such a way as to be continuously visible and use guards to monitor the cables.
34
JAMES P. ANDERSON
The more common case of terminals located some distance from the computer forces one to rely on cryptographic equipment to protect the terminal-computer communications [ I , 19~1. Until recently, all of the commercially available cryptographic equipment was designed for off-line use for message communications or voice scrambling. Currently, there are two products on the market designed expressly for remote terminal to computer communications. They are discussed below. 8.2.1 Datacoder-DCI 10
This device is located at the terminal site for protecting the transmission and storage of data maintained in files in a remote-access multi-user system [186]. No provision is made for a matching device a t the computer site, with the consequence that the file is handled internally in encrypted form. The data sheets on the device suggest that numeric-only fields in a record be maintained in the clear in order to permit computation on them, while alphanumeric data be encrypted. The device recognizes certain character combinations as control characters that switch the device between plain and encrypting (decrypting) operation. As it is constituted, the device is primarily useful only for securing noncomputational (text-only) files that are only stored by the computer for later retrieval (on some indicator that is not enciphered). Although it is possible to have numeric data in the record unenciphered for computational purposes, there is not much security in enciphering just alphanumeric data. The examples of use in the description of the device shows a payroll file with the names of the employees encrypted, but with their social security numbers in the clear (along with their pay rates, etc.). 8.2.2 Data Seqoesfor, JJC-3
This device can be used to provide encrypted communication between the terminal arid the computer, as well as in a manner similar to the DC-110 [I&]. I n its latter use, it is supported by programs that replace the JJC-3s a t the computer end and simulate the data scrambler logic for each user. Each user’s key is stored in a secure internal file, enabling the system to handle each encrypted line in a different key. This device appears to have a better concept of use with computers than the DC-110, particularly the provision of a program t o interface t o many different encrypted lines, although with the program and file of user keys being maintained in the system, the protection afforded an encrypted file at the system is limited to theft of media. In contrast, the DC-110 can be used by any user without pre-arrangement with the system,
INFORMATION SECURITY IN A MULTI-USER COMPUTER
35
and protects the information from all kinds of intrusions as well, since the key is maintained only a t the terminal end of the link. Thus the JJC-3 is oriented principally to protecting the transmission and storage of information from outside parties, while the DC-110 is oriented to protecting the information from the operators of the system as well. 9. Summary
This article has presented the problems and issues of information protection in multi-user systems. The emphasis has been on the threats inherent in programming and using a system rather than various forms of internal and external penetration of the “smash-and-grabll variety, or the subversion of employees having legitimate access to a system. Not discussed at any length were the topics of administrative controls, physical security techniques, personnel screening, and many other aspects of security that are required in greater or lesser degree to devise a secure system. One is often confronted with the question: Is it possible to build a completely secure system? The answer must be unqualifiedly negative. However, it is possible to build an adequately secure system for a particular operational environment. As we noted above, there is no single set of measures that can be taken to obtain “instant security.” Information security is a problem of providing sufficient barriers and controls to force a prospective penetrator into attacks that carry a high risk of detection and/or have a very large work factor. The work factor, if high enough, will be sufficient to deter all but the most dedicated penetrator, and for him it may make any eventual success a hollow victory. REFERENCES 1. Baran, P., On distributed communications: IX. Security, secrecy and tamper-free communications. Memo. RM-3765-PR. Rand Corp., Santa Monica, California, August, 1964. 2. Carroll, J. M., and McLelland, P. M., Fast “Infinite-key” privacy transformation for resource-sharing systems. PTOC.A FZPS. Fall Jt. Computer Cmj., Houston, Texas, 1970, pp. 223-230. 3. Chu, A. L. C., Computer security, the corporate Achilles heel. Business Automation 18(3), 32-38 (1971). 4 . Comber, E. V., Management of confidential information. Proc. AFIPS Fall J t . Computer Conf., Las Vegas, Nevada, 1969, pp. 135-143 (1969). 5. Denning, P. J., Virtual memory. Computing Surveys 2(3), 153-189 (1970). 6. Dijkstra, E. W., The structure of “THE”-multiprogramming system. Commun. ACM 11(5), 341-346 (1968). 7’. Friedman, T. D., The authorization problem in shared files. ZBM Syst. J . 9(4), 258-280 (1970). 8. Graham, R. M., Protection in information processing utility. Commun.ACM 11(5), 365-369 (1968).
36
JAMES P. ANDERSON
9. Harrison, A., The problem of privacy in the computer age: An annotated bibliography. Memo. RM-5495PR/RC. Rand Corp., Santa Monica, California, December, 1967. 10. Hoffman, L. J., Computers and privacy: A survey. Computing Surveys !)(3), 143-155 (1966). 11. Kahn, D., The Codebreakers. Macmillian, New York, 1967. 12. Lampson, B. W., Dynamic protection structures. Proc. A F I P S Fall Jr. Computer Conf., Las Vegas, Nevada, 1969, pp. 27-38 (1969). 13. McKeeman, W . M., Data protection by self-aware computing systems. C E P Rep. Vol. 2, No. 6. Computer Evolution Project, Appl. Sci., Univ. of California, Santa Cruz, California, June, 1970. 14. Mintz, H. K., Safeguarding computer information. Software Age 4(5), 23-25 (1970). 15. Molho, L. M., Hardware aspects of secure computing. Proc. A F I P S Spring Jt. Computer Conf., Atlantic City, New Jersey, 1970, 36, 135-141 (1970). 16. NFPA Pamphlet 75. Nat. Fire Prev. Ass., 60 Batterymarch Street, Boston, Massachusetts 02110. 17. Peters, B., Security considerations in a multiprogrammed computer system. Proc. A F I P S Spring Jt. Computer Conf., Atlantic City, New Jersey, 1967, 30, 283-286 (1967). Ma. Petersen, H. E., and Turn, R., Systems implications of information privacy. Proc. A F I P S Spring Jt. Computer Conf., Atlantic City, New Jersey, 1967, 30, 291-300 (1967). 186. Product Description Sheet-DATACODER, Model DC-110. DATOTEK, Inc., 8220 Westchester, Dallas, Texas 75225. 18c. Product Description Sheet-DATA SEQUESTOR, Model JJC-3. Ground Data Corp., 4014 N.E. 5th Terrace, Fort Lauderdale, Florida 33308. 19a. Security in Communications, Excerpts from 15th annual seminar. Industrial Security 14(2), 3 8 4 3 (1970). 19b. Security and Privacy Considerations in Criminal History Information Systems, Project Search Tech. Rep. No. 2. California Crime Technol. Res. Found., Sacramento, California, July, 1970. 19c. Skatrud, R. O., A consideration of the application of cryptographic techniques to data processing. Proc. A F I P S Fall Jt. Computer Conf., Atlantic City, New Jersey, 1970, 36, 111-117 (1970). 20. Taylor, R. L., and Feingold, R. S., Computer data protection. Industrial Security 14(4); 20-29 (1970). 21. Vanderbilt, D. H., Controlled information sharing in a computer utility. MAC TR-67. Project MAC, Mass. Inst. of Technol., Cambridge, Massachusetts, October, 1969. 22. VanTassel, D., Advanced cryptographic techniques for computers. Commu. A C M 12( 12), 664-665 (1969). 23. VanTassel, D., Cryptographic techniques for computers. Proc. A F I P S Spring Jt. Computer Conf., Boston, Massachusetts, 1969, 34, 367-372 (1969). 2.4. Ware, W. H., Security and privacy: Similarities and differences. Proc. A F I P S Spring Jt. Computer Conf., Las Vegas, Nevada, 1969, 30, 287-290 (1967). 25. Ware, W. H., Security and privacy in computer systems. Proc. A F I P S Spring Jt. Computer Conf., Las Vegas, Nevada, 1969, 30, 297-282 (1967). 26. Wasserman, J. J., Plugging the leaks in computer security. Haroard Business Review) Sept./Oct. pp. 119-129 (1969). 27. Weissman, C., Security controls in the ADEPT-50 time-sharing system. Proc. AFZPS Fall Jt. Computer Conf., 000 pp. 119-133 (1969).
Managers, Deterministic Models, and Computers
G. M. FERRERO diROCCAFERRERA School of Management
Syracuse University Syracuse, New York
1. Introduction . . 1.1 Premises . . 1.2 Generalities on Management Science . 1.3 Manager’s Duties . 1.4 The Resources Utilized in a Managerial Decision . 2. The System Approach . 2.1 Characteristics . . 2.2 Components of a System . 2.3 Performance of an Operative System . 3. Management Systems . 3.1 Premises . . 3.2 Information Systems . . . 3.3 Data Needed by Managers . 4. Management Science . . 4.1 Development of a Deterministic Model 4.2 Model Building . . 4.3 Computer Utilization of Models . . 5. When and How Managers Have to Implement Management Science Models 5.1 ProblemSolution Interpretation . . 5.2 Types of Problems Solved . . 5.3 Considerations on Priorities in Solving Problems . 5.4 Utilization of Computers for Solving Managerial Problems . 6. Will Computers Eliminate Managerial Decision Making? . 6.1 The Present Expansion of computer Utilization . 6.2 The Future Utilization of Computers . . . 6.3 The Interference of Computers in Decision Making . References
. .
. . . . . . .
.
. . .
. . . . .
. .
. . .
.
. . .
37 37 38 38 39 40 40 41 42 43 43 44 45 46 46 48 50 50 50 53 59 62 63 63 67 68 71
1. Introduction 1.1 Premises
Managers, as is well known, are decision makers responsible for the achievement of some established enterprise’s objectives. They have to 37
38
G. M. FERRERO diROCCAFERRERA
allocate the available limited resources for conducting their business in the best possible way. In order to reach the firm’s objectives in a financially profitable manner, managers have to perform certain actions which lead to the attainment of the desired goals [GI. Modern times continuously impose the updating of methods used to reach the firm’s goals. These methods have been defined, systematized, and disseminated for the purpose of helping executives in their daily decision-making responsibility. Scholars and managers recognize that there are some basic patterns of environmental reaction, and the knowledge of these patterns conditions managerial actions. By examining and discussing these points, scholars and managers defined a series of “principles” and “techniques” which, when properly applied, can substantially help executives in the duty of optimizing their actions. 1.2 Generalities on Management Science
The discipline which provides managers with the needed guidance to fulfill their purposes is known as (‘Management Science.” The exact birth date of this methodology cannot be established with certainty since it is the result of an evolution of thoughts, analysis of facts, comparison of results in various circumstances, synthesis of findings, utilization of feedback knowledge, and deductive reasoning. Analysts can trace its beginning back t o the Industrial Revolution. In reality, even if there is logical justification for relating the beginning of the definition of managerial rules to the Industrial Revolution, scholars prefer to indicate that the development of “principles” started at the beginning of the twentieth century [ l y ] . There is no doubt that Frederick W. Taylor (1856-1915) and Henry Fayol (1841-1925) are considered the two major initiators of and contributors to modern Management Science. Both Taylor and Fayol discussed in their published works many of the fundamental rules and appropriate methods used to increase the effectiveness of the jobs performed by workers (Taylor), as well as by executives (Fayol). The basic concept of separating the “doers” from the “thinkers” emerged a t that time. Thinkers, or managers, are those who have the responsibility of instructing others (i.e., subordinates) to perform actions in accordance with sets of well-defined schemes of reference [GI. 1.3 Manager’s Duties
Managers, when pursuing the task of reaching the enterprise’s objectives, have certain responsibilities. They must (1) conceive plans to be implemented for the achievement of the ent,erprise’s goals; (2) organize working structures capable of providing the needed means to materialize the plans;
MANAGERS, DETERMINISTIC MODELS, AND COMPUTERS
39
(3) staff the system by allocating the right men to a given job; (4) schedule all working personnel, enabling them to complete the assigned tasks within a fixed time, and (5) control all the interrelated actions to be performed [2]. The three basic functions considered an integral part of management responsibilities are (1) problem analysis, which helps the executive to plan what must be done with the operative system; ( 2 ) communication with others, which is necessary for detecting and understanding the problems to be solved, as well as for the assignment of jobs and tasks to be performed under the designed conditions, and (3) decision-making activity, which encompasses the process made by managers in defining the course of action which they and their subordinates must take to fulfill the enterprise’s goals. Managers, in order to compete in today’s highly antagonistic market, have to make decisions quickly and accurately. In order to succeed under these trying conditions, they must be well informed about the real situations upon which decisions depend [15]. They must have a sufficient amount of data to att,ain a complete understanding of the facts which have determined the problems and the need for their solutions. Digital computers are playing a conspicuous part in the modern trend of greatly reducing the time which elapses between the detect,ion of a problem and the implementation of its solution. (A discussion of this point will be provided later.) 1.4 The Resources Utilized in a Managerial Decision
Managers have to decide how to allocate limited resources in an organized fashion to reach the predefined objectives. This decision depends on the type of problem to be solved [28]. It is recognized by scholars and managers that when the solution to a problem has to be found, there are eight essential resources to be considered. They are referred to as “The 8M’s’’ since the key words all start with an M. They are as follows. (1) Men. Managers must appoint or relocate labor forces and executives within the system configuration in accordance with the long-range plans, or with the specific situation under consideration. ( 2 ) Material. Raw and auxiliary materials employed in the production of goods or services have to be allocated as the necessity occurs, i.e., for the normal activity as well as for finding solutions to particular problems. (3) Machinery. Every enterprise utilizes some sort of mechanical, electric, or electronic device. This utilization may change as the internal or external situation alters. Managers may choose t o use more automated machines and less manpower, or vice versa.
40
G. M. FERRERO diROCCAFERRERA
(4) Money. Decisions on how to spend money depend on the company’s budgetary restrictions. Money is typically a limited resource which determines the selection and utilization of the other seven M’s. (5) Methods. Managers have to decide how to implement plans. The choice of which technique, what degree automation has to be used in the production of goods (or services), and how to implement controls, depends on other factors such as availability of money, personnel skill, and accomplishment schedule. ( 6 ) Market. This element is the demand for the product (or service) manufactured by the enterprise. Markets can be influenced by advertising campaigns or by an appropriate price policy. (7) Moments. Managers have to decide when products must be available to the market, how long a production cycle must be, how the distribution of the finished products has to be performed in terms of schedule, and so forth. Moreover, the collection of pertinent data must be promptly timed. (8) Messages. Information has to be available a t the right time and in its best possible form. An appropriate and reliable network of communication channels must be established for the purpose of instructing managers about problems, the course of action needed, and the outcome of their decisions.
Scholars and managers agree that, regardless of the nature of managerial problems to be solved, these eight resources are always present, although there are cases in which one (or more) assumes the leading role. 2. The System Approach 2.1 Characteristics
The system approach is one of the fundamental methods for the interpretation of facts detected in business. This approach requires that any responsible manager, when solving an enterprise’s problems, be concerned not only with the restricted area where the problem may have been detected, but with the entire firm as a unique entity. Managers must consider the direct and indirect effects their decisions may have in the performance of all the activities as planned, organized, directed, and controlled [6].The notion of optimization and suboptimization of the objectives and subobjectives must be present in any decision-making process [IW]. To fulfill this requirement, management science, and in particular operations research techniques, considers the enterprise as a unitary operative systems, even if it can be conceptually divided into subsystems. The idea behind the formulation of quantitatively expressed models is that
MANAGERS, DETERMINISTIC MODELS, AND COMPUTERS
41
solutions to problems depend not only on the restricted area where the problem took place, but on all the adjacent regions of activity as well, up to and including the entire organization and even its external sphere of influence [IS]. This constitutes the so-called “total system approach.” System analysis seeks to expand the investigation of facts and data in order to insure a better and more complete understanding of the domain in which problems are detected and solved. By applying the system orientation, the manager is forced to examine the various relationships existing among the variables under scrutiny [23].In so doing, the behavior of each variable may indicate its interdependence with remote (or not apparent) causes [,25]. 2.2 Components of a System
It is generally accepted by scholars, analysts, and managers that the following five parts are always traceable in any physical system [I61: (1) Input. Some, and often all, of the eight significant resources (i.e., the 8 M’s) constitute what can be considered the needed raw material which will be utilized in implementing what managers have planned and organized. (2) Output. Every operative system has to provide an output, outcome, or payoff as a result of its activity. (3) Processor. The core of a system is the processor, which forms the operative nucleus. In a business system, for example, the management, the production and assembly facilities, the distribution department, the personnel, machinery, methods, and all types of controls, can be regarded as the processor. In a computing center the memory or storage, the arithmetic and logic unit, the control unit, the operators, and the computer routines and programs form part of the complex processor of the system. In computer science this term may refer to the hardware (the central processor) or the software (the compilers, assemblers, or the language processor) [ I I ] . (4) Control. Any configuration of significant operative resources must be placed under control to check whether or not the aim of reaching the well-defined objectives is being achieved. The means assigned to perform this control are an integral part of the system. (5) Feedback. In conjunction with the cybernetic system concept, the objective of a control is to determine when and how a system needs an adjustment in case some of the utilized limited resources are not harmoniously working together. The consequent action performed when the control detects the need for a modification in the organization of the system is called “feedback.” This action can be executed by means of an automatic device or by the intervention of a knowledgeable person or group of persons [S].
G. M. FERRERO diROCCAFERRERA
42
1 4 1 i
Degree of stability of the system efficiency
;;s,9i&
transition Positive
UL
CL
I
I
I I
I
I I I I
I
I I I I I
I
I
I
I
I I
I
I
I I
FIG.1. Transitions in the levels of stability of a system.
2.3 Performance of an Operative System
There are many ways to classify systems, but one frequently considered in management science discussions is whether the system is in a “steady state” or stable condition, meaning that the complex is operating without the detection of difficulties. The fact that all the external and internal activities of the system are balanced (i.e., in equilibrium) does not necessarily mean that the system performance is satisfactory. Figure 1 shows two levels of steady states, one higher and one lower. The second situation does not provide the complete fulfillment of the firm’s objectives because it is below the lower limit (LL) of the accepted bound representing the satisfactory zone. It could be the case that the system is slowly deteriorating, passing from one degree of stability to a lower one without clearly showing the transformation. The reliability of the information obtained through the
MANAGERS, DETERMINISTIC MODELS, AND COMPUTERS
43
network system is, in this instance, very important. Managers must have prompt and complete knowledge of the passage from one system performance level to another. This variation is called “transition period.” The change may require a long time (e.g., the decay occurring in a physical system), or a few seconds (e.g., the change resulting from a dynamite explosion or nuclear reaction). Moreover, the transition period is said to be “negative” when the system passes from a higher level of stability to a lower one, and “positive” in the opposit,e case. The positive transition must be planned and actuated if the system is to resume the satisfactory efficiency level which, customarily, is defined within two limits (i.e., UL and LL of Fig. 1). This positive transition phase is characterized by the emission of orders stemming from the decisions made in view of the modifications to be performed to the system structure in order to solve the problem. The responsible executive must study the conditions of the enterprise by the operative information system which provides the appropriate data for evaluation. Reliability, completeness, timeliness, and clearness of the received and utilized data are paramount characteristics of the management information system [6]. Managers are eager to minimize the negative transition period and also the steady state period when it is below the satisfactory level. In addition, the time involved in the positive transition must be minimized. In order to achieve these goals, managers have to rely on the soundness of the information which indicates that the system is sliding down in efficiency. Corrective actions can be immediately taken as soon as a deflection on the performance is detected. The satisfactory steady state level period must, consequently, be maximized. 3. Management Systems
3.1 Premises
Managerial decisions are made through the existing channels of communication and are implemented by orders given to the appropriate subordinates. Control of the executed orders, as well as detection of the feedback reactions, is established by using the firm’s information network [,%?,$]. Managers know about the existence of problems by means of, e.g., a verbal message, a written paper, a formal report, a printout from a tabulator, or from a hard copy obtained at a computer. When an executive is “informed,” he can take the action he thinks appropriate for modifying, if necessary, the flow of activities of one or more of the available factors of production or limited resources (i.e., the 8 M’s).
44
G. M. FERRERO diROCCAFERRERA
3.2 Information Systems Information systems are set up in conjunction with operative networks in order to support managers in their decision-making tasks. Knowledge of facts is the blood of an organization, and the channels of communication are its arteries and veins. Information is generated a t a source point and used a t a receptive point. In between there are various means utilized to move the message from one person to another, e.g., oral statements, writings, diagrams, pictures, flow charts, and codes of any form [.%‘I]. One of the basic concepts upon which information theory is based is that a message, in order to achieve its goals (for which it has been generated) must be understood by the recipient. Hence, common knowledge of the codes (usually written) or languages used by the two persons in contact must exist prior to the transmission of the information. The higher the degree of understanding between them, the better the dispatch is interpreted. The message can assume a very concise configuration, and it can also be stated by the use of special terms or by particular construction of phrases, as in computer languages [9]. Managers must know that every piece of information is influenced by three factors which reduce its quality: (1) The pureness of any conveyed message is affected by a component of “noise,” an extraneous addition to, or subtraction from, the pure information. This noise becomes an integral part of the dispatch. (2) The completeness of a message could be improved by using a better means, or combination of means, of transmission. Hence, the communication is, at all times and by principle, incomplete; i.e., it carries a gap of intelligence. (3) The message could be influenced by personal interpretation or motivation, or by external forces which modify the meaning of the information from its original intent.
These three disturbances are schematically represented in Fig. 2. It is a specific task of the manager to reduce to a minimum these three inconveniences. The noise can be lowered by selecting more appropriate means of communication. The diminished intensity of noise in a network may justify the cost involved in improving or changing the media used. The completeness of a message can be improved by training the involved persons on the typical nature of the news which will be exchanged among them. External forces can be reduced by instructing the dispatcher to follow (when possible) a well-defined scheme of reference. The use of pre-established forms, in which cells must be filled and explicit questions answered, is a good way to avoid personal interpretation and biased
MANAGERS, DETERMINISTIC MODELS, AND COMPUTERS
-
1 - 1
45
Ideal message
I
Message carrying "noise"
Incompleteness of the message (gap of intelligence)
Message suffering external influences (interpretation, motivation, ...I Hence, the information may be biased or subjective
FIG.2. Factors influencing the quality of a transmitted message.
description of facts and data. The quantification and codification of messages are the usual ways selected to reduce individual modifications in transmitting a report [200]. Managers, knowing that information can be distorted by these three influencing factors, must plan, organize, and control the network system to insure that sound and error-free messages are received at destination. 3.3 Data Needed by Managers
Managers, being responsible decision makers, must rely on the accuracy, timeliness, and completeness of all data and information received with reference to problems (detected during the negative transition period) and to solutions (during the positive transition period.) Managers must consider the presence of the three disturbances cited above when analyzing original data. The highest possible degree of pureness and understanding of the information received is a must for achieving the best fulfillment of the firm's objectives. Managers will transmit orders for implementing the solution by modifying the allocation and content of the eight typical limited available resources. In a society which has a very high standard of living, as does ours, managers have to respond properly to the market demand, which requires that commodities be attained at an increasingly fast rate. Their tasks are no longer so simple as they were a half century ago. Time runs fast, as do the changes in behavior of the society. In order to properly direct an
46
G. M. FERRERO diROCCAFERRERA
enterprise, and to solve its pressing competitive problems of long and short range, managers need to use new and more responsive decision-making techniques [lo].Sound data must be known promptly, decisions implemented on time, and feedback reactions immediately sensed. For these reasons scientists and managers have defined series of principles described under the label “management science techniques.” These rules are capable of helping the responsible executive in the accomplishment of his difficult task as decision maker. 4. Management Science
4.1 Development of a Deterministic Model
The development of a deterministic model can be better described by means of an example. A manager, having access to all the data needed for understanding an industrial problem, selects the experimental way of solving his query. He has the requisite time and money to find the most satisfactory solution to his problem, which is for instance, an inventory situation t o improve. He wants to define the size of stocks, the minimum cost for maintenance, turnover, space, and facilities, with reference to a given period of time of the plant activity. The best solution is found by utilizing the scientific method (i.e., by experimentation). Later, the same manager has a very similar problem in another plant of his company. He repeats the experimental steps for this new case, which can be summarized as follows. He (1) defines the nature of the problem; (2) establishes the variables involved ; (3) collects pertinent data; (4)formulates some hypothetical solutions; (5) tests them by experimentation; ( 6 ) finds the solution which provided the most adequate outcome; ( 7 ) detects the corresponding payoff; and (8) implements the found solution. Again, the problem is satisfactorily solved. Later the same inventory situation arises in another company’s warehouse. The manager repeats once more the above listed steps of the scientific process, finding the optimal solution. He has acquired useful experience in handling all these cases. At this point the manager recognizes that “to similar problems correspond similar optimal solutions.” Being quite familiar with this type of inventory query and its solution, the manager tries to describe both problem and solution through a “pattern.” The best way of portraying them is by using algebraic symbols for the variables and by indicating the existing relationships among the behavior of these variables by mathematical expressions and equations. Undoubtedly, this is a difficult task to pursue, but the direct relationship between that problem and its best solution is detectable, as well as the sensitivity of the solution to the model specification and the
MANAGERS, DETERMINISTIC MODELS, AND COMPUTERS
47
reliability of the input data. By mathematically comparing the various cases of application, the manager is able to face the next case of inventory situation with a different set of means. The manager describes the poser in terms of values to be entered in the “mathematical pattern” or “model.” The expression carrying all the data referring to the new state of the inventory allows the executive to directly obtain the solution without spending the time and money he had previously invested when the experimental method was applied. Figure 3 schematically shows this concept. In order to solve a problem mathematically the manager can apply different versions of the structured model, the selection of which one to use being solely dependent on the complexity of the query. For a simple case with a few variables which are not complexly interrelated, the solution
First
Problem
Second
Similar Droblem
Solved as above
Third
Similar problem
Solved as above
experimenlal approach
L____
+
Solution
Similar solution
Similar solution
=
c-*--
Pattern
Pattern
Mathematically described model
Mathematically described model
Similar problem
7 Deftnition of voriobles and parameters as in the model
1 Solving the mathematical expression
nth
(e.g., by computer)
1
Optimal solution by mathematical notations
t
Managerial interpretation and decision
FIG.3. Scheme of the generation of a deterministic model.
48
G. M. FERRERO diROCCAFERRERA
could be found making the computations by hand, i.e., using pencil and paper, a slide rule, or a desk calculator. For more elaborated formulations this approach may take too long. Managers facing problems need to have almost immediate access to an “optimal solution,” because the reduction of the time involved in the problem-solving process is mandatory [l4]. Now digital computers and mathematical models are providing much of the help that a manager may require for his decisions. The use of deterministic (and/or stochastic) models gives managers a way of obtaining an optimal and objective solution to problems. The resulting outcomes of a model obtained through a computer run are not to be considered substitutions for the managerial decisions, but only objective suggestions [ I ] . The responsible executive, customarily, must subjectively modify the solution in order to define the most appropriate final course of actions. 4.2 Model Building
In accordance with what has been pointed out above, it can be said that many of the deterministic models used by managers, and collected in management science and operations research texts, have been devised by considering the results obtained by applying the scientific (i.e., experimental) method [27]. Other models have been designed by pure mathematical and statistical approaches, and others by performing appropriate simulations. There are times, however, when managers need to have specific models to solve particular problems. In this event it is necessary to “build a model” for subsequent and iterated usage. Model building is indeed one of the fundamentals of management science and operations research techniques There is no precisely defined way of constructing a mathematical or normative model, even if the experimental method is the one most often applied. The type of modeling depends on: (1) the nature of the problem, (2) the objectives to be reached, (3) the imposed firm’s policies, strategies, and tactics, and (4) the degree of accuracy desired. Usually the various expected outcomes can only be qualitatively expressed. They are ranked, and a score is assigned to the variegated results. A quasiquantitative model can be defined by subjectively assigning values of merit to the payoffs. When the outcomes are quantitatively measurable, but their relationships are not expressible by mathematical terms, the model is again termed “quasiquantitative.” Operations research methodology cannot utilize qualitative models. All variables values and their relationships must be quantified. This characteristic allows the utilization of management science models by digital computers. Even if there is no established series of rules to follow for [A$].
MANAGERS, DETERMINISTIC MODELS, AND COMPUTERS
49
building deterministic models, here are some suggestions to consider : (1) Recognition of the nature of the problem. ( 2 ) Description of pertinent facts (diagnosis) and definition of the span of time related to the problem and its solution. (3) Definition and analysis of the elements involved, namely, (a) the environmental conditions, (b) the external forces influencing the events, (c) the variables, which must belong to one of these two categories: controllable (or decision variables) or incontrollable (e.g., by chance), (d) the relationship among variables (e) the restrictions (or constraints) in which variables behave, and (f) the goal that must be reached. (4) Determination of the characteristics of the original data (i.e., historical, technological or estimated), and the degree of accuracy desired in the collection, classification and manipulation of these input data. (5) Definition by mathematical expressions of the relationship among variables, as detected at Step 3 above, in accordance with the selected variables and their behavior (which could also be probabilistic). (6) Definition of the objective function that has to be optimized by maximization or minimization. This expression carries the cost or profit of the real variables entered in the set of equations and/or inequalities, and describes the various restrictions (or constraints) implicated in the problem. (7) Experimentation and testing for improving the model by using, if possible, special sets of controllable input and known output. (8) Determination of the sensitivity of th solution. This is another form of examination made by checking to see if the defined model specifications are providing a sound and reliable solution. Before validating the model it is necessary to test whether the expected outcomes (in quality and quantity) are supplied by the built algorithm. If not, the complete structure and procedure used to design the formulations must be reformulated. When the sensitivity of the solution reaches a satisfactory value, the normative model can be fully utilized. (9) Implementation of the model, which in effect is the utilization of the complete original set of data. During this phase further corrections and improvements (over and above what it has been done on Step 7) may have to be visualized and made. (10) Maintenance and updating of the moclel. Obviously, the deterministic model as such is not destroyed after its first utilization. It is kept and, if necessary, modified and improved for future usage [?‘I.
50
G. M. FERRERO diROCCAFERRERA
4.3 Computer Utilization of Models
A deterministic model customarily assumes the form of a computer program. All mathematical formulas, restrictions (constraints), variables, parameters, and their relationships are translated into computer languages. The variables, parameters, and the cost (or profit) values pertaining to a given problem are left outside the program since they constitute the input data recognizable by the processor by virtue of “names” or “codes” assigned to each of them. I n this way the program can be utilized for any set of original data, provided that the problem to be solved is of the “same type, size, and structure” of the model. If not, the computer routine (and consequently, the model itself) can be modified to provide the optimal solution in the newly detected conditions. When the solution is supplied by the application of a normative model (e.g., by a computer run), the manager has a mathematical expression to interpret. He must understand the meaning of those resulting formulas and values in correspondence t o the “nature” of each variable. He must extract from the computer printout the elements which constitute the decision variables, and define their optimal values for insuring the maximization (or minimization) of the objective function (i.e., of the goal to be reached). When this extraction is completed and the manager knows what the deterministic model objectively suggests, he must decide whether to implement the obtained solution or to modify it. Quite often the manager, even if he has a high degree of belief in mathematical models, modifies the decisions variables so acquired for the purpose of complying with enterprise’s policies, or for other reasons. This is because, once again, operations research models are tools, not substitutions for managerial decision making. 5. When and How Managers Have to Implement Management Science Models
5.1 Problem-Solution Interpretation
Managers, being rational thinkers by definition and conscious about their responsibilities as decision makers, are continuously exposed to the duty of properly solving problems. What kinds of problems actually have to be solved by managers? Generally speaking, when an operative system reveals some deflection from what is considered a ‘hormal” flow of activities, the trouble is likely due to a variation in the performance of one (or more) of the working elements (the 8 M’s). As systems analysis indicates, the factors of production (i.e., the eight limited resources) have t o work as planned in order to provide the expected satisfactory output [do].
MANAGERS, DETERMINISTIC MODELS, AND COMPUTERS
51
Managers become aware of a changed status of the operative elements when the system is in the negative transition period (as previously discussed and depicted in Fig. 1). Essentially, the most common ways by which this knowledge is acquired are : (1) Managers may be directly notified by colleagues, subordinates, or outsiders when “something goes wrong.” This type of message may refer to the case when an anomaly appears and evidence for the reasons of its happening is detectable. The perturbation can have been generated within the enterprise or outside it. An immediate control has to be performed to certify the soundness of the information, consideration being given to the above mentioned three disturbances (summarized in Fig. 2). (2) Managers could learn about troubles via the established channels of communication, for example, by reading a written report. This report could be issued at determinate scheduled periods, such as periodical printouts of financial, production, and distribution statements. Obviously, these printed accounts can also be issued any time the need emerges. By examining the data carried by these reports, the responsible executive may recognize the necessity for implementing a corrective action to the system. (3) Managers can be formally informed by groups of people to whom the task of controlling the performance of specific activities has been assigned. This is the case when the enterprise has established a specialized service to handle these problems, e.g., a value analysis team, a method analysis department, a comptroller office, or generically, a “problem-finding” group of specialists who are sensitive to undesirable variations in performances within the firm. These people with staff position can investigate, examine, analyze, compare, and evaluate output at any level of the organization. Their investigating work can be scheduled, they can search a t random, or follow an ad hoe sequence.
When managers know about the detection of variations in the operative elements, they must determine whether a problem really exists, rather than some personal interpretation of facts, or a mistake in detecting a variance in performance. Sometimes what is defined to be a problem by one person is not so defined by another; hence, managers have to recognize the real existence of posers in light of the firm’s objectives. The knowledge of a problem may not imply the immediate determination of a solution. Sometimes the source of the mishap has to be found in related activities connected with the elements which suffered the anomaly. It is well known that factors which have a slight or an indirect influence on the fulfillment of the enterprise’s plans are difficult to detect, while something which harms the accomplishment of the firm’s major objectives is rather evident and, consequently, quickly detected. Managers must be
52
G. M. FERRERO diROCCAFERRERA
able to recognize which are the operative functions sensitive to problems, and which are over and above the possibility of being touched (directly or indirectly) by variations of the behavior of the system elements [22]. The size of the enterprise may determine the importance of a problem. When a solution is found, for example, by using an operations research deterministic model and by applying management science principles, the outcomes tend to be in direct proportion to the size of the enterprise, while the cost of the utilization of a computerized model could be (more or less) the same for a firm of any size [28]. For instance, a change in production methods may provide a reduction in the unit cost of the manufactured items. If the plant produces a large amount of that commodity, the “return” from the application of the solution is relatively greater than for a small firm manufacturing a limited amount of such items. But the cost of utilizing a model which, for example, minimizes the cost of production does not follow the same pattern. Hence, the cost of solving a problem appears to be fixed no matter who uses the model [22]. From the economic point of view, managers may believe that finding a solution to a problem and implementing it may cost more than accepting the trouble as it is. If the output obtained from the system (even if it is not in perfect operative equilibrium) is acceptable, the decision might be to “wait, and do not do anything now.” In acting in this way, the responsible executive is gambling. In fact, he prefers to wage on two instances. First, he conceives that the problem will disappear and the system will adjust itself along the line. In this case he saves the monetary cost of finding a solution and of implementing it. Second, he prefers to pay more later if the problem degenerates into a worse situation. If the executive wins, it is because the first case took place. Gambling on decisions is, by itself, not a good policy, but sometimes managers decide to differ the courses of action to be taken because they prefer to “watch what happens next.” They delay the decision even if a probability of being correct in “waiting and seeing” can be assigned, as it is in a wager evaluation [8]. When a solution has been properly found, a series of investigations, researches, collection of data, analysis, and even experimentations, have been made. It could be the case that while searching for a solution the problem itself disappears. For example, observing and interviewing clerks to discover the reasons for their low performance might, by virtue of the inquiry, cause the personnel to improve their work efficiency, and the problem may vanish. It could also be the case that by collecting and analyzing data about a supposed problem, it could be demonstrated that the trouble did not exist. For example, in case of misinterpretation of financial figures supplied in special reports. Sometimes it happens that in searching for a solution through the investigation and collection of data,
MANAGERS, DETERMINISTIC MODELS, AND COMPUTERS
53
other related problems are detected. These problems then influence the way of solving the initial one, and their solution may even cause the first to disappear [ I ] . It is evident that understanding a problem is imperative before beginning the search for its solution. Managers must consider all the related disturbances and influences provoked by the problem and even by the solution. Any managerial decision has to keep in view the firm's objectives, stated policies, and expected (evaluated) or estimated outcomes. It is quite advisable, when the time from the detection of a problem and the implementation of the optimal (or most satisfactory) solution is available, to perform a feasibility study of the situation. 5.2 Types of Problems Solved
Managers are well aware that business and industrial problems can be solved by the use of deterministic (and/or stochastic) models. Operations research techniques provide the means to reach an objective solution when the query is quantitatively described by mathematical (and/or statistical) expressions. It is interesting to present briefly those problems which can be objectively solved by the deterministic approach. Management science gives a list of typical cases by placing the various models in groups having common characteristics.' Customarily, nine formal categories are distinguished [ I ] : 1. Inventory 2. Transportation (allocation) 3. Sequencing 4. Routing 5 . Replacement (maintenance) 6. Search 7. Queuing 8. Competition (game) 9. Mixed problems
An accurate survey of the use of these models in business and industry has not yet been made, but it seems that the utilization of deterministic models exhibits a clear trend to increase rapidly in the years to come. The advent of the computer real-time era has made a major contribution to the expansion of the utilization of mathematical models. Computer manufacturers have organized associations among computer users for the purpose of disseminating knowledge of existing programs, stemming from operations The sequence of these nine types of models does not indicate rank of importance or any other significant priority.
54
G. M. FERRERO diROCCAFERRERA
research models, to solve specific problems. For example, IBM founded in 1955 a society called SHARE (Society to Help to Avoid Redundant Efforts) €or the purpose of providing to its members computer programs ( i a , models) designed to solve business and industrial problems. In 1965 GUIDE was formed for IBM 705, 1401, 7000 series, and System 360 or higher computer users. In 1962 COMMON was established for the purpose of allowing the barter of programming knowledge among users of smaller computers, such as IBM 1130, 1620, 1800, or smaller System 360. The nine categories indicated above represent the type of models most utilized by managers. These models can be briefly illustrated by indicating their relevant characteristics. In each one of these classes a considerable series of related programs are available. Models are diversified, within each group, by the number of variables they can handle, their particular behavior, their operational boundaries, and the relationships existing among them. In each category problems are similar, while their representation may require an appropriate set of variables defined with reference to the available limited resources related with the problem. In particular: (1) Inventory. In any inventory problem two typical costs are encountered: one which increases with the growth of the stock size, and one which decreases as it expands. In the first type (of the increasing unit cost), the carrying expenses, which may include storage, maintenance, insurance, obsolescence, spoilage, turnover, and taxes costs, can be considered. The second type (decreasing) of the many costs that can be analyzed include the following: (1) set up and take down costs which are met each time that a new production cycle is implemented for manufacturing, assembling, and closing out an order. The larger the batch process, the smaller is the quota to be assigned to each item for these expenses; (2) direct production cost, which may also include the purchasing price of the raw materials used; (3) shortage (or outage) cost which is calculated when the demand for the manufactured items diminishes or differs in time; (4) manpower stabilization cost. In order to minimize inventories, it could be necessary to follow closely the demand with the production. This may require the hiring and instmoting of new personnel. In reality, not all these costs are simultaneously encountered. There are inventory problems which consider only a few of the possible events. In any case, the purpose of this type of model is to reduce to a minimum the total cost of handling stocks of merchandise during given periods of time. The mathematical models for solving this type of problems become quite complex, especially if the number of variables (e.g., the various cost components) entered in the formulations is large. Calculus, probability theory, queuing theory, and matrix algebra can be profitably used here. These
MANAGERS, DETERMINISTIC MODELS, AND COMPUTERS
55
problems, if static in nature, are solved by using computer routines dealing with linear programming techniques ; if time variations are considered, dynamic programming methods have to be applied [6]. (2) Transportation (or allocation). This class of problems (and their related models) are subdivided into three subgroups, namely: (a) transportation, or distribution of items; (b) allocation of limited resources to demanding points; and (c) allocation of controlled limited resources. The first type is concerned with cases when there are two or more supplying sources (providing the same kind of commodity) and two or more points demanding quantities of that specific item. The problem consists of defining the number of products to be shipped from one delivery point (e.g., a factory) to a receiving one (e.g., a warehouse) in such a way that the total cost of the distribution (in a given period of time) is minimized. The transportation model requires, for its utilization, the observance of few restrictions, such as the certainty (i.e., no probabilities involved) of the quantities available a t each one of the sources and requested a t each one of the receiving places, as well as all the elementary costs of transportation per unit of measurement from each point to each other one. The second type concerns the allocation of jobs to men, men to jobs, or group of students to class, drivers to trucks, buses to routes, and so forth. The problem may assume a complex configuration, but the objective is to optimize the eaciency of the entire operative system, e.g., to maximize profits or minimize the total cost. A variation of this type of model is met in the so called “product-mix problem.” For example, in the gasoline industry there is a large demand of diversified petroleum products to be sold a t various prices. The objective is to satisfy the requirements in accordance with some well-established specifications a t the minimum total cost of production (or maximum profit). The third kind of problem is characterized by the need for optimizing the allocation of supply and demand, when the manager has the possibility of controlling the available limited resources. This is the case when a decision has to be made about where to build a new plant, place a new commercial agency, the site for a warehouse, how many salesmen to send in a territory, how many trucks to use; or in the opposite case, which shop to close, which plant to shut down, which salesmen to withdraw from the field, and so forth. Also the allocation of budgetary monetary values to the various demanding departments is a query of this type. (3) Sequencing. This type of problem is encountered when a set of operations can be performed in a variable sequence under certain welldefined constraints connected with the goals to be achieved. Usually, the objective is to minimize the time involved in the performance of all the operations required, i.e., from the first one to the last or finished product.
56
G.
M. FERRERO diROCCAFERRERA
I
I I I
No longer usable/( Degree of
I I
I
I
!Time
-
-4
Replaced
~
3 ’
Items which die
Time Replaced
‘3
efficient
FIG.4. Efficiency of the two typical items used in machinery.
It is possible that a series of priorities in activities must be observed, or some penalties on cost or time might be imposed. Generally, this type of problem can be solved by applying models called CPM (Critical Path Method) or PERT (Program Evaluation and Review Technique). Both are methods designed to find the shortest way of sequencing activities in accordance with some established conditions of priority. For example, these models are extensively used in (1) the construction of buildings, bridges, and similar works, (2) production scheduling, (3) assembly line operations, (4)military problem where strategic and tactical plans have to be implemented, ( 5 ) budgetary allocation of funds, and (6) missiles countdown operative procedures [19]. (4) Routing. A manager may have the problem of instructing salesmen on the sequence they should follow when visiting customers. He wants to minimize the time that each agent spends at the various addresses. This case is also called the “traveling salesman problem.” It is considered immaterial which person is visited first as we!l as which will be next. If the cost (in time or money) of visiting A and then B is equal to the cost of visiting B and then A, the problem is defined as “symmetric;” otherwise, if A -+B is not the same as B + A, it is called “asymmetric.” If the salesman has three customers to visit, i.e., A, B, and C, he has six routes to choose, namely : A-B-C ;A-C-B ;B-A-C ;B-C-A ; C-A-B ; and C-B-A. I n case of four customers, there are 24 possible solutions (i.e., 4! = 24). With ten customers to visit in the symmetric assumption, the salesman
MANAGERS, DETERMINISTIC MODELS, AND COMPUTERS
57
has 362,880 ways to select; with twenty customers he has 243,290 x i01a possibilities, and so forth. ( 5 ) Replacement. This type of problem is especially encountered in machinery maintenance. The manager of a production plant needs to be sure that machines are working properly and that breakdowns are minimized. In any object composed of parts (e.g., an assembly or machine) two types of items are recognizable, those which degenerate or wear out, and those which after a certain period of time just are no longer usable, i.e., they die. Figure 4 shows how the efficiencytypically behaves in both cases. Items which wear out need maintenance in order to delay their degenerating process. To keep these items within the range of efficiency costs money and time. There is, consequently, a point when it is economically better to replace these pieces rather than spend in their adjustments. Figure 5 shows the effects of an appropriate maintenance program with reference to an hypothetical item. Elements which simply die can also be susceptible to adjustments for extending their life. Usually, appropriate environmental working conditions insure this effect. Machines are composed of both types of these items; hence, the problem is to repair or maintain the parts to assure the needed level of efficiency (between the two limits of a range of acceptance). The life of those parts (of both types) can be statistically determined by testing or by researching past records. Computerized mathematical models capable of handling this Degree of efficiency
I
I I I
I I I I
Time
Normal life
FIG.5. Effect of proper maintenance in a degenerating item. Dashed line, without maintenance; solid line, with maintenance; double vertical, adjustments and repairs.
58
G. M. FERRERO diROCCAFERRERA
complex problem are available, even when very many variables are considered. (6) Search. This type of model helps the analyst (in this case, the searcher) t o find what he is looking for, e.g., items in a warehouse, submarines in the sea, water or crude oil reservoirs in the ground, coal or gold veins in mines, special figures in accounting ledgers, or particular information to be retrieved from a backlog of stored data, and so on. The objective of this deterministic model is to achieve the goal a t the minimum cost of time or money, under the maximum possible level of accuracy. Accuracy in searching augments the cost of the process. It is a managerial decision to balance these two basic elements of the problem. Ststistical sampling and estimation theory are the two fundamental sources from which these mathematical models are derived. (7) Queuing. Any time that a group of persons (or items) are standing in line waiting to receive a service (or to perform an action) a queue is formed. Typical cases include people waiting a t a teller window in a bank, a t the checkout counter of a supermarket, a t the door of a theater, or a t the entrance of a bridge when driving. Queuing theory is a well-established discipline which, statistically, mathematically, and deterministically defines the problem by investigating (1) the rate of arrival of the customers a t the station(s), which is the point where the applicants (or, e.g., objects in case of an assembly line process) receive the requested service, and ( 2 ) the rate of service performed a t the station(s). The objective to be reached in solving this type of problem is t o minimize the total monetary costs encountered in the operative system, i.e., the evaluated cost of the time spent in waiting for service (when a queue is formed), and the cost of the idle time of the servicing personnel when there is no demand for action. In case of bettlenecks raised a t production or assembly lines, the rates of arrival and of service are welldefined since they are detectable data. Queuing models provide the desired solution to the most complex problems of this type. (8) Competition. When two individuals are playing cards, the move made by one may influence the decision of the opponent. Each tries to maximize his total outcome. Rules of the game are conceivably well known by both players who are in competition for winning the pot. The sequence of decisions is not known with certainty because human judgment is the basic element involved in the game. Hence, the estimation of what will be the next step performed by the competitor precedes any game decision. There is always the risk of being wrong in this evaluation. A similar situation can be visualized in business. The players are the managers competing for customers in a free market. The game is played by performing decisions in many fields, such as, the definition of selling
MANAGERS, DETERMINISTIC MODELS, AND COMPUTERS
59
prices, the selection of the types of raw material to be used, the organization of the distribution system, the presentation of the products (e.g., wrapping), the implementation of advertising campaigns, the establishment of customers’ credit policies, the definition of the auxiliary benefits to grant to customers. When more than two competitors are “playing” this sort of business game, the formulation of the “rules of the game” becomes extremely difficult and cumbersome. In this case, the selection of the variables to be considered (controllable and uncontrollable) is a very arduous task, especially if the investigation is to be accurate. Theory of games gives good insights and suggestions for the description of problems of this type and for the preparation of the relat,ed models. Algorithms for business games are extensively utilized in business for training purposes. Managers, or potential managers, are grouped in teams playing, competitively, series of decisions, e.g., allocation of money for production, research and development, plant improvement, definition of selling prices in various marketable areas, and so on, simulating, in this way, what managers would have done for conducting that business. Each set of decisions could represent, for example, a quarter of a year in reality. The computer, utilizing a programmed model, determines the results obtained by the competitive actions of the players. Each printout issued by the computer serves as a base for the next decision since it carries the resulting reactions of the market (quantities of items purchased, inventory levels, etc.) as a consequence of the team’s competitive decisions. By using this business simulation game approach, potential managers can become familiar with business problems and their solutions. (9) Mixed Problems. In the real world it is quite rare that a complex problem can be solved by utilizing just one of the above-cited deterministic models. A distribution problem may be connected with scheduling the production in order to avoid queues and the formation of inventories. A transportation problem could be related to an assignment, a routing, or a sequencing situation. A maintenance problem could be linked with inventory or search needs. This class of mathematical models includes combinations of the above mentioned “pure” types. 5.3 Considerations on Priorities in Solving Problems
Managers know that problems appear in bunches. During the process of solving one, others emerge. While the preparation for defining a solution to one trouble is made, another claims to be solved. Large and small problems are continuously under consideration by the responsible decision maker. Managers must have a priority list, assuming that the appearance of problems is predictable. Executives not only must define a criterion for
60
G. M. FERRERO diROCCAFERRERA
choosing which problem-solving process should be handled first when several of them are detected and some are already under study, but they have to establish a grneral norm to observe. Is the rule “solve first the problems which were detected first’’ a good one? Apparently not, since there are cases in which the solution of a “second” problem also resolves the “first.” Is the idea of simultaneously implementing two or more solutions a valid one? Probably, but it is tcchnically impossible to do this in all instances. Is the suggestion to implement first those solutions that require mow time, and then the others, sound? The answers to thesr questions lie in the state of need, which in its turn depcnds on the urgency of solving the key problem first. It is a matter of evaluating the foreseen outcomes of each solution in order to visualize the payoffs that can be expected. Which one will improve the grneral situation to a greater extrnt? The definition of th(1 “key-problem” dcpends on which will be the ‘Lkey-solution.”This last is the leading concept for the cstablishment of priorities. Usually problems which involve the entire enterprise are considered to deserve precedrnce over those which are conccrned uith one sector or function. Also, problems affecting the main objective or the corporate policies must have the “right of way” over the others. In order to better visualize which problrm-solving process has to assume the front position, it is appropriate to briefly discuss the “contrnt” of thc most commonly encountcrcd queries in business and industry which can be solved by the use of deterministic models. Since problems having an influence on the enterprise as a whole have priority over those effecting only a section of it (such as a division, a department, an office), the following list of examples starts with thr first type, and concludes with others [ I ] . Overall Planning. Operations research techniques can be utilizrd to solvp problems concerned with the planning of the enterprise activity as a unique entity. General long-range plans developed in accordance with the firm’s objectives can be programmed, as can plans for the establishment of criteria to be used for the best allocation of the available limited resources (i,e., the 8 11’s). Also the determination of the optimal policy and organization to be actuated in the long run can be defined by the use of operations research normative (or deterministic) models and by simulation. Financial Plans. Budgetary restrictions can be defined by applying operations research models. The best and most appropriate accounting system can be designed and tested by using management science techniques. The same can be said for the implementation of control and feedback systems within the communication network. Financial and credit policies can be established through the application of algorithms or deterministic models and the use of simulation techniques. Personnel. To assign the right man to the right work, and vice vrrsa, is
MANAGERS, DETERMINISTIC MODELS, AND COMPUTERS
61
a problem that must be solved for all jobs in the enterprise, a t any level of the organization. Recruiting personnel to maximize the efficiency of all the firm’s services is a complex problem that operations research techniques can solve. Incentive plans to minimize absenteeism and turnover can be designed by the help of mathematical or normative models and by simulation. Research and Development. A complex problem may arise when a longrange plan has to be established concerning the organization of a research and development department. Questions such as how large it should be, how much its services must cost, which projects it has to take care of, and which areas of knowledge and investigation it has to work, can be answered via the utilization of algorithms or deterministic models. Marketing. Problems concerned with, e.g., the shipment of products (or services), the appropriate geographical location of the receiving warehouses and their size and capacity, the consequent area of distribution, as well as the determination of the best position of retail outlets in accordance with the density of potential customers, can be solved by the use of management science models. Price policies can be tested, as well as the efficacy of advertising campaigns for given areas, periods of time, and type of media utilized. Promotion strategies can be defined and studied by using algorithms. Many other marketing problems can be solved by making use of deterministic operations research and simulations models. Purchasing. The best type of raw materials to be used in planned production can be established mathematically when the technical characteristics, specifications, prices, lead time, and availabilities are given as original data, as well as the needed quantities and the manufacture plans. Warehouses, and the definition of the conditions of the supply, storage, and turnover, can be determined by means of operations research algorithms. The same can be said for establishing the most appropriate usage of auxiliary materials and their contribution (technical and/or in terms of value) to the final products. All the financial and bookkeeping problems, with reference to the purchasing of any type of material, can be established by using computerized deterministic models, or simulation. Production. Mathematical models are extensively used to determine the best location, size, organization, capacity, degree of automation of plants, departments, shops, offices, services, and so forth. Also, problems in scheduling, dispatching and routing productions, and assembly operations are solved by the use of normative models. The indication of which is the best sequence to follow in performing the various manufacturing and assembly activities, as well as the handling of materials and of finished or semifinished products, can be optimized by the use of operations research models or by simulation. There are almost no business or industrial prob-
62
G. M. FERRERO diROCCAFERRERA
lems that cannot be solved by the application of deterministic (and/or stochastic) models utilizing digital computers [I]. 5.4 Utilization of Computers for Solving Managerial Problems
A common remark made by some managers who are not too familiar with deterministic models and management science principles is that a computer is indispensible for finding and implementing objective solutions to problems. The truth is that electronic computers can greatly help to solve complex, and/or voluminous input data, quantified problems. There are cases in which a normative model, having few variables, a restricted matrix, or a simple formulation, can be solved using a desk calculator.* For difficult cases, where the time and effort to resolve them by “paper and pencil” would become prohibitive, the use of a computer is welcome P31. If the enterprise has a computer with a real-time operative system, managers and management science teams may have a console-terminal in their offices with which they can interrogate the processor for computing formulas or partial calculations of complex mathematical expressions included in deterministic or normative operation research models [ I l l . Timesharing also allows managers to perform sensitivity analysis online, giving an immediate response to proposed solutions. The possibility of executing many thousands of calculations in few seconds is a great help to the analysts seeking optimal objective solutions to problems. Managerial decisions can be reached in a very short time, giving the executive the time to ponder, without pressure, the courses of action t o be taken. Consequently, managers can operate in a state of confidence, having the opportunity to consider choices in solutions especially when computer simulation techniques are applied [28]. The advent of computers indeed had an effect on the use of mathematical models. The possibility of working out complex problems in a very short time determined a considerable improvement of the quantitative formulation of queries as well as of their solutions [26]. Computers are tremendously useful tools in the design and development of operating systems for decision-making problems where limited resources (i.e., the 8 M’s) have to be allocated in order to optimize (maximize or minimize) the objective function within a well-established set of restrictions. The rate of acceptance by managers of the use of operations research algorithms *Students in management science classes learn to apply mathematical models to solve simple problems utilizing slide rules or desk calculators. This allows the student first to grasp the content of the model, and then to handle more complex problems using computer programs.
MANAGERS, DETERMINISTIC MODELS, AND COMPUTERS
63
is also in direct proportion to the availability of real-time, time-sharing computer equipment, and to the accuracy that software and computer languages can provide with respect to the managerial need for decision making. It seems appropriate to reemphasize now the three major characteristics of the operations research techniques as generators of the analysis and systematization of business and industrial problems and of the formation of deterministic (and/or stochastic) models. These essentials are summarized as follows [ I ] : (1) The system orientation concept must always be present in any problem description as well as in the finding of its solution. Managers have the responsibility of conducting business using the enterprise’s organization as an operative system. All troubles, deflections from the norm, and organizational slips needing improvements have to be treated as “unbalanced states of the system” (ie., as a negative transition period). Solutions consider the factors of production (the 8 M’s) as the elements to be used for putting the system back into the desired conditions of productivity (i.e., through the positive transition period). (2) The mixed knowledge team concept. It is necessary to have the combined contribution of qualified experts to properly define problems, collect and study pertinent data, design normative models, experiment or simulate cases to construct optimal solutions to quantified problems. The same type of mixed team can help the manager in implementing his decisions as final sets of actions. (3) The quantitative expressions used to describe the query and the methods to be used to solve problems. Mathematical formulas are the best way to describe problems and to design operations research models. It is by virtue of this last characteristic that digital computers are so extensively utilized. 6. Win Computers Eliminate Managerial Decision Making?
6.1 The Present Expansion of Computer Utilization
Modern electronic technological advances allowed the construction of faster and less expensive digital computers. These “human energy saving” devices, capable of dealing with an enormous quantity of data and of making very complex computations, determined a progressively increasing expansion of their utilization. This trend will undoubtedly continue. Computer efficiency in performance and usage will improve, and the cost of operating such systems will decrease. Moreover, the techniques de-
64
G. M. FERRERO diROCCAFERRERA
veloped to communicate with computers will develop so rapidly that soon managers will “talk” to an input apparatus, inquiring about data or computed results, and will receive an immediate audible output response. With “real-time, time-sharing” computer systems, executives can already interrogate and work quantitatively expressed problems by transmitting the original data via a keyboard console. The computed answer is immediately returned to the inquirer, usually by means of an automatically typed message at the same device. Many of these secluded terminals (consoles), each working independently, can be connected to a computer by utilizing the existing telephone cable network. The immediate response obtained from the computer gives the operator the impression that the electronic machine is running exclusively for him, even if several hundred terminals are simultaneously connected with the central processor. This extremely fast way of obtaining returns from the machine is only possible because of the very short time required for the computation of data. The large capacity of a computer memory allows the writing in, the retaining, and the retrieval of very large amounts of data, as well as the utilization of computational subprograms standing by in the processor’s storage. There are series of operations or calculations which can be designed and recorded in fixed patterns and entered in the computer memory. Any time a manager needs to use these “subroutines” he can insert the appropriate original data pertinent to his specific problem into the model. The processor and the computational portion of the computer utilize the input information within the frame of prefixed mathematical formulas. The calculations needed, for example, to issue a payroll can be prepared in advance, stored in the computer memory and utilized, with the suitable set of original input data, any time they are desired. A bank can have subroutines available to make the various computations of financial interests, and/or for continuously updating accounts of customers borrowing or depositing money. An enterprise may need to have an instantaneous inventory situation any time movements of materials are made. A subprogram, taking care of the supply and withdrawals of goods in all the warehouses, can furnish to any inquiring remote terminal the quantity level of the moved items and related information, such as the reordering dates, the cost involved, the value of the stored commodities, and many other statistical data useful to managers for their everyday decision-making responsibility. As already indicated, the advent of more sophisticated technology in computing machines, not only that concerned with the hardware and the memory size, but with the software as well and the concept of utilizing stored programs in a real-time mode, determined the possibility of greatly extending the utilization of computers. Consequently, very many related
MANAGERS, DETERMINISTIC MODELS, AND COMPUTERS
65
applications and concatenated works can be practically simultaneously performed. More and more automatized procedures are now handled by computers. It is possible today to register a few original input data into the processor memory and to have as output a large series of objectively computed results. For example, an industrial firm can enter as input into a computer the demanded quantity of some well-specified items to be manufactured for a certain date. The printout that could be obtained as output are bookkeeping analyses, billing registrations, printing of invoices, values of the changed levels of inventories where components and parts for the final product have to be withdrawn, the reordering updated timetable for the warehouses, the appropriate set of orders to the production and assembly lines, the sequence of the mechanical operations to be followed, the indication of the tools needed at the various stages of the working procedure combined with the current series of manufacturing activities, the schedule for the distribution of the finished goods, the pertinent notification to the salesmen about their commissions, and all the numerous related financial and technical statistics which may be needed. Today any bank is acting as a “clearing house” when dealing with checks to be paid and amounts to be acquired by any and all of its customers. Computers are performing this heavy task by calculating in fractions of seconds the various amounts. The result is that each account is instantaneously updated and kept in memory. At the same time, interests are computed and properly assigned to each account. Totals and statistics are also immediately available. There are many banks which have provided the headquarters and the branch offices with separate independent terminals, each directly connected with a central computer. This equipment allows tellers to promptly verify the consistency of any account while the customer is standing at the counter. When the transaction is accomplished, the computer is automatically informed of it, and all the accounting and financial ledgers are at once revised and the current values made available. Education has also greatly changed. The traditional way of teaching is being replaced by a computerized process. Each student can learn, sitting at a computer console, by answering questions which appear on the screen (TV type) of the terminal. The pertinent instructions are stored in the computer memory and retrieved by the student in accordance with his grade of study and the progress he has made in the subject. If the answer given by the inquiring person to the machine is appropriate, the next question appears at the console and the process continues. If the answer is incorrect, a set of explanations comes out at the terminal in visual form (on the screen) as well as on a typed sheet of paper. The interrogation of the computer can be performed by a large number of students simultaneously. At the end of the learning session exams can
66
G. M. FERRERO diROCCAFERRERA
also be taken by computer. Answers to the examination questions are immediately evaluated by the machine and the final grade calculated and issued. The advantage of this mechanized method of learning lies in the fact that each student can define his own path in assimilating new concepts. It is not a question of memorizing formulas or sentences; rather, it is a matter of understanding. Students can retrieve previous topics and review learned applications without being required to ask questions to the teacher in front of all the classmates. Students are in this way self-governing and self-controlling. A similar computerized setting can be established in any enterprise where the responsible executives, foremen, warehouse directors, inventory and storage keepers, financial, accounting, and distributing department managers can interrogate the continuously updated computer memory on the various statistics and facts pertaining to a particular department activity or to the entire operative system. The terminals located in the various firm offices can be used for communication and instantaneous transmittal of data. If copies are needed, a temporary storage in memory can be utilized to allow the retrieval of the information by anyone interested a t anytime it is needed. The computerized airline reservation system is today reality. Passengers, either by phone or by going to the appropriate office, can make reservations for certain flights. The airline ticket clerk sends (by a console) this information to a central computer (sometimes located thousands of miles away) which searches in its memory for an available seat on the requested flight. If it is not possible to accommodate the traveler’s desires, the computer immediately provides alternate ways for the trip. When agreement is reached with the passenger, the computer books that seat for that flight in its memory. This type of investigation and transaction can be performed by a large number of simultaneously inquiring terminals. The computer provides an instantaneous response to each one of them. The widespread use of computers for supplementing human activities can be hypot]hetically extended. One can visualize a society where not only money is unnecessary, but almost any traditional form of personal communication for business transactions is unnecessary as well. Data stored in the computer memory could be for recording of happenings, for supplying elements for statistics or for other conceivable needs, and for collecting any type of information stored for future utilization. Each manager will also have at home a computer terminal which will become as indispensable a part of his life as the telephone is today. A person will utilize computers any time he feels the necessity to know something, to transmit or record facts and values, or to make any type of computation. Undoubtedly, electronic machines are penetrating deeper and deeper
MANAGERS, DETERMINISTIC MODELS, AND COMPUTERS
67
into the business world as a technical necessity. No one denies that all the scientific achievements, all the successful space flights, and all the improved methodologies have been possible only by virtue of the availability of the appropriate electronic machines. Even the high standard of living that nations are enjoying today is indirectly due to the increasing utilization of digital computers. The today’s recognized trend in business, industry, as well as in social life, is for ever more augmented usage of automized computing machinery. 6.2 The Future Utilization of Computers
Data can be retrieved from any memory computer when the relevant searching code is provided. Just as items are found by their qualifying numbers, so a person can be recognized by a digital code (e.g., his Social Security number). Consequently, it is possible to provide the holder of a bank account with a punched plastic card carrying his code and the other data technically needed to recognize where (i.e., in which bank) his financial statements are carried out. With this document anyone can pass money from his account to any other one, as well as add capitals to his own funds. The owner of this card can receive his salary without seeing it. The amount can be passed from the account kept in the computer memory under the firm’s credit card code to the employee’s personal account. He, the worker, will receive a slip of paper issued by a computer terminal, notifying him of the transferred money. The worker can now spend what he wants without having seen the currency. He can go to the supermarket, buy, and present his punched plastic card to the cashier. By inserting the customer’s credit card into the appropriate slot and recording the total figure of the bill with the keyboard of the computer terminal located a t the check counter, the corresponding quantity of dollars and cents is withdrawn from the purchaser’s account and charged to the supermarket’s account. It is possible t o reserve seats a t the football field or a t a theater by instructing the computer of the need for tickets. The game or show is indicated by the code number of the stadium or theater, the day and hour of the performance, and the type of desired seats. The computer will search in its memory about the availability of the request. If it is not possible to satisfy the demand exactly, the processor will display alternative choices at the console screen, or type the information a t the terminal. The inquirer can then make his decision. Payment for the tickets could also be made automatically. The computer will transfer the appropriate amount of money from the customer’s bank funds to the stadium or theater account. Housewives will use the punched plastic card when ordering groceries
68
G. M. FERRERO diROCCAFERRERA
or merchandise by phone. Upon receiving the goods, payments will be made by withdrawing the appropriate amount from one fund and by electronically sending it to the bank carrying the store account. The “home type” of computer console will be greatly simplified in its usage and the terminal will be inexpensive. Computers will take care of almost all paper work by automatically utilizing models for the computation of data and the printout of results. A manager can retrieve from the computer memory information and data referring to a complex business or industrial problem that he has to solve. By collecting new facts or new insights, he can increase in number, or improve in quality, at any time, the existing backlog of data already stored in memory. The computer can be instructed to systematize those elements, summarize them, make statistical or mathematical computations, and so on, in order to provide the decision maker with a revised and always current set of pertinent original data and/or semifinished computational findings. When the manager considers that the time for decision has come, he can retrieve from memory all the computerized results and enter them in an “operations research decision-making model,” obtaining an objective final solution to his problem. This approach can be applied to any type of managerial quantifiable problem. As is generally recognized, portions of business and industrial activities are, little by little, being continuously automated. 6.3 The Interference of Computers in Decision Making
In the near future the total utilization of computers in a firm could be a reality, but not at the Utopian level of a “push button factory” which produces commodities in such a highly automatized way that it does not need any human intervention or control. The increasing dependence of business on electronic calculating machinery is’ already greatly sensed. There are no concerns, no industrial activities, which are not already directly or indirectly related to operations performed by means of digital computers . Is the foreseen massive interference of electronic computers in business, and consequently in human life, an absolutely unrealistic assumption? IS it actually improbable that in the future social behavior will heavily depend on computers? The development of automation, as an ever more extensive substitution for human activities, is already a vivid reality. If the answer to the question “will computers become an indispensable element of human life?” is ‘‘yes,” to what extent will electronic machinery participate in business and social activities? The example of the instantaneous updating of corporate and personal banking accounts by means of automatic adding and subtracting figures
MANAGERS, DETERMINISTIC MODELS, AND COMPUTERS
69
to and from a backlog of money, which exists simply as a numerical series of digits, indicates only one of the many possible cases of computer utilization. A firm, a concern, as well as an individual, by technical necessity, are treated as items sitting in a storage room. An item can be moved, modified, and changed in its characteristics, one of which could be its monetary value, corresponding to the available money recorded in its or his banking account(s). The “depersonalization” resulting from this type of automatized and instantaneous administration is highly sensed by individuals working in enterprises where computers are the logical substitution for human work. Financial intelligence is only a small part of the material stored in a computer memory. Data with reference to an individual’s health, employment, cultural, political, and social activities can be recorded along with the well-known “census)’ type data. With reference to an enterprise, a very great variety of information and statistics can be sent and kept available by the computer. Everything can be coded and stored in a very slight portion of a gigantic computer memory. When needed, data can be extracted and utilized by managers or by other interested people. For a firm there are facts which are delicate, confidential, or reserved. Not all data can be obtained without formal permission of the owner. Who then can retrieve these data? Who has the right t o investigate the private life of an enterprise, or of a person, by inquiring what is stored in a computer memory about it or him? For example, can the bank, the insurance company, or the car dealer, when a person is buying a new vehicle, read the existing financial records in the computer memory with reference to that customer? None of them can legally do so. By extending this concept, when something is bought the transaction must be preceded by a credit examination of the parties involved since the payment is done by a computerized transfer of capital. If it is permissible to know everything about a customer before signing a contract (of whatever nature) by interrogating the computer (assuming that both parties know the other code number), the ethic and the way of conducting business will assume a new form of being. Any transaction would be preceded by this computer practice. Society will certainly resent this manner of conducting business. It would probably only be justified in case of important and complex transactions. It would probably become a normal procedure to investigate all facts and data about a firm and its manager before having any meeting with him as representative of his enterprise. A n attitude of reciprocal mistrust will be generally enforced. No one will have confidence in anyone without computerized proof. This could happen if, and only if, every individual, every enterprise, firm, shop, or institution of any size, is also
70
G. M. FERRERO diROCCAFERRERA
known by his or its computer numerical code. This knowledge will be part of the name. Consoles capable of being connected with the computers of a unique national (and one day international) operative system must always be available a t the places where transactions are performed. The automatic selection of the right computer memory retaining the searched data or the required information will be automatically made by the reading of the various codes entered in the inquiring computer terminal. Computers will not only be used to keep track of business happenings from the statistical point of view, but, as they are already extensively utilized today, will solve various kinds of quantified problems. The managerial decision-making process is already heavily oriented toward the objective solution of problems using mathematical or deterministic models, or by performing simulations of any possible business situation. A great variety of standing-by computer subroutines and normative models will be available to every manager who needs to find an optimal solution to a problem, business, industrial, social, or whatever character it may have. Consequently, the skill of reasoning and of utilizing personal judgment, interpretation of facts, and the consequent responsibility of deciding will pass from the subjective to the highly objective (ie., computerized or scientific) approach. No matter what the problem is, the first attempt made by a responsible executive to solve i t will be to utilize an existing computer program. Since the stored and generally available subprograms (e.g., the operations research deterministic or stochastic models) can be used by an immense number of operators in a large variety of instances, an increasing uniformity of solutions can be expected. This fact also will tend to reduce the human personality of a manager. He will heavily rely on computerized solutions since they are easily obtainable. It will suffice to insert the original data a t the keyboard console, indicate the type of model to be used, the code of which is found in an ad hoc list of models, and obtain the objectively found optimal solution. What about the possible case of a computer failure? In a society mainly dependent on stored records and programs what will the consequences be of computer calculation mistakes or of technical disfunctions of the system? Such a disaster can have tremendous impact on the business and industry in general as well as to the society. This happening will be similar to a natural cataclysm. A small power failure (even not as severe as the blackout in New York State in 1965) may cause the destruction forever of precious factual records. True, auxiliary memories can be provided and technical methodology can be devised to avoid losses of data in these unfortunate cases, but these preventive policies will impose a highly complex computer system design, one which will be expensive and difficult to construct, organize, and maintain. The vulnerability of this gigantic system will be
MANAGERS, DETERMINISTIC MODELS, AND COMPUTERS
71
tremendously increased by the size of its utilization. Sabotage of computers and other sources of misfortunes can also be expected. Even industrial espionage can be visualized since the records stored in the computer memory could refer, for instance, to governmental, state, and local activities, including, probably, some data on military achievements. Big corporations, powerful enterprises, research companies, universities, and key industrial firms may have data stored without knowing what has been sent to the memory by other related users of the computer system. Unauthorized withdrawals of data and information can be avoided by assigning specific reserved codes only known by the owner of the records. This procedure, still, does not insure the absolute secrecy of the stored information. Computers are ever more invading the privacy of this increasingly mechanized society by performing many human activities (i.e., computing and memorizing). Society itself will determine the boundaries of this invasion. At this point it is appropriate to cite Norbert Wiener’s warning that because computers work so fast men cannot be sure to be able to “pull the plug’’ in time. It will be the silent struggle between the technology applied to perfecting digital computers and the managerial utilization that will deviate or arrest the above mentioned impressive trend. Hence, computers will never eliminate human personality or managerial decision-making responsibility, neither now, nor in the near or distant future. Computers will always be tools in the service of man, who is indeed a clever creature (. . . in fact, he developed the computer. . .). REFERENCES 1. Ackoff, R. L., and Rivett, P. A Manager’s Guide to Operations Reyearch. Wiley, New
York, 1967. 2. Albers, H. H., Organized Executive Action. Wiley, New York, 1961. 3. Anthony, R. N., Dearden, J., and Vancil, R. F., Management Control Systems. Richard D. Irwin, Homewood, Illinois, 1965. 4. Charnes, A., and Cooper, W. W., Mamgement Models and Industrial Applications of Linear Programming. Wiley, New York, 1961. 5 . Dearden, J., and McFarlan, F. W., Management Information System. Richard D. Irwin, Homewood, Illinois, 1966. 6. diRoccaferrera, G. F., Operations Research Models for Business and Industry.. SouthWestern Publ. Co., Cincinnati] Ohio, 1967. 7 . diRoccaferrera, G. F., Introduction to Linear Programming Processes. South-Western Publ. Co., Cincinnati, Ohio, 1967. 8. Dyckman, T. R., Smith, S., and McAdams, A. K., Management Decision Making Under Uncertainty. Macmillan, New York, 1969. 9. Elliot, C. O., and Wasley, R. S., Business Information Processing Systems. Richard D. Irwin, Homewood, Illinois, 1968. 10. Feigenbaum, E. A., and Feldman, J., ed., Computers and Thought. McGraw-Hill, New York, 1963.
72
G. M. FERRERO diROCCAFERRERA
1 1 . Fisher, F. P., and Swindle, G. F., Computer Programming Systems. Holt, New York, 1964. 12. Garret, L. J., and Silver, M., Production Management Analysis. Harcourt, New York, 1966. 13. Hall, A. D., Systems Engineering. Van Nostrand, New York, 1962. 1.4. Hampton, D. R., Behavioral Concepts in Management. Dickenson Publ. Co., Belmont, California, 1968. 15. Head, R. V., Real-time Business Systems. Holt, New York, 1965. 16. Johnson, R. A., Kast, F. E., and Rosenzweig, J. E., The Theory and Management of Systems. McGraw-Hill, New York, 1967. 17. MacKenzie, R. A., The management process in 3-I>. Harvard Business Review, No. 47, Nov./Dec. (1969). 18. Martin, J., Programming Real-time Computer Systems. Prentice-Hall, Englewood Cliffs, New Jersey, 1965. 19. Mayer, R. R., Production Management. McGraw-Hill, New York, 1962. 20. McDonough, A. M., Information Economics and Management Systems. McGrawHill, New York, 1963. 21. McDonough, A. M., and Garret, L. J., Management Systems. Richard D. Irwin, Homewood, Illinois, 1965. 22. Miller, D. W., and Starr, M. K., Ezecutive Decisions and Operations Research. Prentice-Hall, Englewood Cliffs, New Jersey, 1969. 23. Optner, S. L., System Analysis for Business Management. Prentice-Hall, Englewood Cliffs, New Jersey, 1960. 24. Schroder, H. M., Driver, M. J., and Strenfert, S., H u m a n Znformation Processing. Holt, New York, 1967. 25. Scott, W. G., Organization Concepts and Analysis. Dickenson Publ. Co., Belmont, California, 1969. 26. Sprague, R. E., Electronic Business Systems. Ronald Press, New York, 1962. 27. Wagner, H. M., Principles of Operations Research. Prentice-Hall, Englewood Cliffs, New Jersey, 1969. 28. Wagner, H. M., Principles of Management Science. Prentice-Hall, Englewood CliffS , New Jersey, 1970.
Uses of the Computer in Music Composition and Reseacch HARRY B. LINCOLN Deportment of Music State University of New York at Binghamton Binghamton, New York
1. Introduction . 2. Composition of Music by Computer . . 2.1 Music Composed for Performance by Traditional Means 2.2 Music Composed for Performance by Electronic Synthesizer 2.3 Real-Time Composition by Digital-Analog Computer . . 3. Music Research Using the Computer . . 3.1 Problems of Music Representation . 3.2 Music Theory and Style Analysis. 3.3 Ethnomusicology . 3.4 Thematic Indexing . . 3.5 Bibliographies and Information Retrieval Systems 4. Automated Music Typography for Composition and Research Discography . . . . . . . . . . . . References .
.
73 74 75 78 . 85 . 88 . 88 . 92 . 104 . 105 . 106 . 107 . 109 . 110
. . .
.
.
.
1. Introduction
It is now some twenty-two years since the first use of automated procedures in music research. Although limited to sorting routines, Bronsonls work in folk music [ I l l is nonetheless considered the earliest computeroriented music research project. The use of the computer in music composition has almost as long a history. The earliest efforts, dating from the early 1950s1are summarized in historical surveys by Hiller [%$I, Kostka [6I],and Bowles [9].The piece Illiac Suite, published by Hiller and Isaacson in 1957 [37] was the first computer-generated music to be performed and recorded, and receive wide publicity. Simon and Newel1 [74] were moved to predict in 1957 that “within ten years a digital computer will write music that will be accepted by critics as possessing considerable aesthetic value.” In the years since then there has been less accomplished in music research than early optimists had predicted, and computer-generated music compo73
74
HARRY B. LINCOLN
sitions most certainly cannot be said to have been widely accepted by critics. But although the accomplishments to date may seem modest to both musicians and computer experts, it must be recognized that the problems are formidable. In fact, it can be argued that the slow progress has been due in part to computing problems more complex than appreciated by the musician and to musical problems more complex than appreciated by the computer expert. But it is not reasonable to be pessimistic, since that which has been accomplished to date will provide a base for increasingly significant research and interesting compositions in the future. Much has been written by both composers and researchers, but publications of research results are rather limited to date, as are recordings and scores of computer-generated music. Although there is undoubtedly work in progress which has not come to the author’s attention, this article will attempt to indicate the wide range of effort in both composition and research by describing a number of representative projects. 2. Composition of Music by Computer There have been two basic approaches to the composition of music by computer. In one the computer is used as a tool to provide output in the form of an alphnumeric representation of standard music notation or the actual printing of that notation. Then the notated music is performed by musicians using standard instrumental and vocal ensembles. In other words, the sound of the music itself would not inform the listener that a computer had played a role in its composition. In the second approach, which has become increasingly prominent in the past few years, the composer programs instructions for a digital computer whose output in turn serves as input to an analog computer and this, in turn, produces electronic sounds via loudspeakers or transmits the electronic signals to recording tape for later performance. The sounds produced may, in some cases, reflect an attempt to imitate existing standard instruments, or, more likely, they will be newly created sounds. The taped results may provide the total composition or may be used in conjunction with standard instruments and/or voices. Again, as with notated music, the sound of the electronically synthesized music would not inform the average listener as to whether the source was the more traditional electronic studio (such as the Moog synthesizer) or whether the electronic signals had been programmed via digital-analog computers. It is expected, however, that increased sophistication in programming will make possible greater flexibility and speed in electronic synthesis than possible with the patch cords and keyboard of the present electronic studio.
USES
OF
THE COMPUTER IN MUSIC COMPOSITION AND RESEARCH
75
2.1 Music Composed for Performance by Traditional Means
Although most of our attention in this article will be directed to composition by digital-analog devices, i.e., various types of sound synthesis, a brief summary is given of music composed by computer for eventual performance by traditional instruments. The most complete survey of computer composition to 1967 is found in Hiller [S4]. The very earliest efforts a t using the computer for composition were based on analysis of simple folklike melodies followed by varied syntheses of these melodies, developed from probability tables. Pinkerton [67]wrote a “banal tunemaker” for the generation of new nursery tunes based on his previous analysis of 39 nursery tunes. According to J. Cohen [ I 8 ] ,J. Sowa developed “A Machine to Compose Music” that was based on Pinkerton’s ideas. A brief summary of work by Olson and Belar is given by Hiller [Sg] and in more detail by the two researchers in earlier articles in acoustical journals [65, 661. Their ‘Lcomposingmachine” is described by Hiller as a prototype of the famous RCA Electronic Music Synthesizers. Olson and Belar obtained various frequency counts of pitches in eleven Stephen Foster tunes. From these melodies various transition probabilities were developed, and synthesized Stephen Foster tunes were generated. Further work in analysis coupled with stochastic procedures to invent or “compose” new melodies were carried out by Brooks, Hopkins, Neumann, and Wright [ I S ] . Their results confirmed three predictions which are quoted by Hiller
[343: (a) If the order of synthesis is too low, it results in note sequences not contained in and not typical of the sample analyzed. (b) If the order of synthesis is too high, it results in excessive duplication, in whole or in part, of the original sample. (c) Between these two extremes, synthesis results in tunes which are recognizable members of the class from which the sample was drawn. The most widely discussed early attempt a t computer composition was the Illiac Suite for String Quartet by Hiller and Isaacson. The score of the work was published by Presser I371 and the composers give a detailed description of the preparation of the work in the book Experimental Music [S8]. The piece is based on computer composition of music according to the “rules” of simple Liacademic’fcounterpoint. While most listeners found very limited inherent musical interest in Illiac Suite, the idea of machinecomposed music attracted wide attention. The output of the piece was in the form of a music representation which had to be converted (by hand) into a musical score for performance by string quartet. For a recording of this piece see the Discography a t the end of this chapter.
HARRY 6. LINCOLN
76
MUSICOMP:
USE OF ML. ROW
LOCATION
OPERATION
Calling
P
Sequence Return
VARIABLE F I E L D
Call
ML.ROW,R,,
PZE
i, f , j
. .., E
P+3
R is t h e f i r s t l o c a t i o n of a l i s t of n_ i t e m s . f is t h e form of t h e r o w sought. j is t h e t r a n s p o s i t i o n sought. i is t h e o r d i n a l n u m b e r of t h e i t e m sought. Example:
If n.12, i = l l , f = 2 , j = 7 we o b t a i n t h e 11th n o t e of t h e r e t r o g r a d e v e r s i o n of a 12 - t o n e row t r a n s p o s e d u p w a r d s a p e r f e c t fifth. So we w r i t e i n t h e p r o g r a m 1,111
CALL PZE
ML.ROW,R 11, 2 , 7
,...,
12
,1111
FIG.1. Use of ML.ROW as an example of actual programming. From Hiller [34].
I n 1958, Hiller began his directorship of the “Experimental Music Studio” a t the University of Illinois from which eminated several computer compositions by various persons for a period of some ten years. This work is described in some detail in Hiller’s survey f.T4] and includes such wellknown pieces as Hiller’s Computer Cantata [QO],his Avalanche for Pitchman, Prima Donna, Player Piano, Percussionist, and Prerecorded Tape, and HPSCHD, for one to seven harpsichords and one t o 51 tapes by Cage and Hiller [35]. Many of the compositions of this period, and especially from the Illinois schools, feature stochastic processes, controls on degree of probability and indeterminancy, etc. Some details of these stochastic procedures are given by Hiller in [36].Figure 1 illustrates a subroutine in his compositional programming language known as MUSICOMP. Regarding this example he writes ([%I pp. 74-75) : ML.ROW, for example permits us to extract any pitch from any transposition of any of the four forms-original, inversion, retrograde, or retrograde inversion-of a row such as a 12-tone row. By setting up appropriate logical loops and entries into this subroutine, we are able to handle one important component process of serial composition directly. The three parameters I, F, and J must, in the actual calling sequence, be replaced by specific integers that represent the element of the row, the transposition of the row, and the form of the row, respectively. The calling sequence shown provides an explicit example of how this is done.
Two of the works from the Illinois studio make use of the CALCOMP plotter as a means of communication to the performers or the performance
USES
OF
THE COMPUTER IN MUSIC COMPOSITION AND RESEARCH
77
media. For example in Three Pieces for Solo Percussionist by Herbert Briin “the plotter draws standard symbols (triangles, circles, lines, and so forth) from the CALCOMP library. The language of the score is given by the distribution, size, and position of symbols on each page of the score; in effect, each page is a plot of dynamic level versus time. Once the performer is provided with a short introductory explanation of the notation, he is able to perform directly from the score ([Sd] p. 59).” I n the Avalanche piece mentioned above, one of the instruments is a player piano whose roll has perforations cut along markings inscribed by the CALCOMP plotter. In the early 1960s important efforts in computer composition were undertaken in France by the composers Barbaud and Blanchard working at the “Centre de Calcul Eletronique de la Compagnie des Machines Bull” and by Xenakis under arrangements with IBM in Paris. Barbaud and Blanchard produced a composition in 1961 which utilizes the twelve-tone technique. They first randomly generated a tone row and then developed structures from the row using various combinatorial operations. The musical procedures parallel mathematical ones; for example, transposition by addition and subtraction, interval expansion by multiplication, inversion by sign change around a chosen point, and so forth. The results expressed in a music representation were then scored for performance; Barbaud’s book describing this work [.2] includes sample programs. Xenakis described his work in a series of articles in Gravesaner BlLitter [SS]. His earliest composition, Achorripsis, is available in score [S2]. The role of the computer in some of his other compositions was not clear in some of the publications about them until clarified in discussions with Hiller ([Sd] p. 78). Several examples of computer music using various stochastic processes were heard a t the Cybernetic Serendipity show in London in 1968, concurrent with the IFIPS meeting in Edinburgh. Kassler, in reviewing some of this music for Current Musicology wrote ([CS] p. 50) : The penchant for utilizing rules from certain non-musical information theories seems to avoid the essential requirement of discovering a musical information theory. We seriously question that a theory derived from another universe can produce anything more than minimal compositional requirements. There may indeed be some convenient analogous properties between music and games, or between music and language (the 18th century noticed them), but i t must not be forgotten that the data are different, that music will remain a universe in itself with its own inherent problems-even though the computer will play a part in deciding what these problems are to be.
Hiller’s survey [S4] includes descriptions of work in several other European countries. Especially interesting among these is the work of Papworth to solve some of the problems of “change ringing,’’ that is, the various permitted permutations of ringing a set of bells that have been developed
78
HARRY B. LINCOLN
over several hundred years of this activity in England. The basic problem, as described by Hiller ([%$I pp. 84-85) follows: Given the numbers, 1, 2, . . . , n, representing church bells of different pitches in descending order, find rules for generating in some order all n! permutations or “change” or subsets thereof. However, the following restrictions must be observed: (1) the first and last permutations of any sequence or subsequence must be the original row, 1, 2, 111, n, which is known as a “round”; otherwise, no two rows may be the same; ( 2 ) in any row, no number may occupy a position more than one place removed from its position in the preceding row; (3) no number may remain in the same position in more than two consecutive rows.
Papworth was able to solve the system of change ringing known as “plain bob major.” He was concerned, first with proving that “plain bob major” is sufficient to generate all 8! or 40,320 possible permutations of eight numbers, and second, with generating sample compositions starting with random numbers. Specifically, each successive “lead-end” (the first row of each treble lead) was tested against all previous lead-ends and stored if new. Alternately, it was rejected and a new lead-end generated, Papworth says that his greatest difficulty involved making a composition end with rounds at the correct point. To achieve this, he found i t necessary to write “alternation routines” to have the composition come out right. (Hiller [34]p. 85)
2.2 Music Composed For Performance by Electronic Synthesizer
Probably the most striking development in the past decade in the field of music composition has been the wide acceptance of the electronic synthesizer. Almost every campus with a composition program has a Moog, Buchla, or other synthesizer as part of its equipment, and many composers who, until a few years agol were content to work in other idioms have turned to the synthesizer as an important tool for composition. It is in no sense replacing the performer, but rather complementing him. The total range of musical sounds and means of expression has been expanded by the synthesizer. Indeed, “Moog,” for example, is almost a household word, thanks to a famous record by Carlos, Switched on Bach [15], whatever opinion serious composers may have of that effort. It was inevitable that two streams of effort, computer composition and the synthesis of music, should merge. The merger has taken the form of computer control of sound synthesis by means of the digital-analog computer. The lay reader, if unfamiliar with this technology, would do well to begin with Mathews’ book [60]which is a textbook and a manual for use with the MUSIC V programming language. Sample problems and tutorial examples are included. Figure 2 illustrates concisely the basic procedure for converting digital information to sound. A sound can be considered as a time-varying pressure
USES OF THE COMPUTER IN MUSIC COMPOSITION AND RESEARCH
79
Loudrpeoker filter 0 to 5KHz
Sequence of numbers from
Sequence of pulses with
47-I
Sound pressure wave obtainrd
0 20
2
* m
3:
10
$2 ez
s .o
- 5
0
mnt -10 a
g -20
0
1 Time (msec)
2
FIG.2 . Conversion of digital information to sound. From Mathews [60].
in the air and these pressure functions which we hear as sound are generated by applying the corresponding voltage functions to a loudspeaker. The illustration shows how numbers stored in the computer memory are SUCcessively transferred to a digital-to-analog converter. For each number the converter generates a pulse of voltage whose amplitude is proportional to the number. These pulses are shown on the graph in the lower part of the illustration. The square corners of the pulses are smoothed with a filter (low-pass filter) to produce the smooth voltage function drawn through the pulses. This voltage, supplied to the loudspeaker, produces the desired pressure wave. Mathews continues this exposition of fundamentals by showing that it is necessary to have a sampling time half the desired bandwidth. For example, to achieve the high-fidelity sound of a bandwith of 15,000 Hz, it is necessary to have 30,000 samples per second. The meaning of this last information for the computer is summarized by Mathews ([GO] p. 6): We can now begin to appreciate the huge task facing the computer. For each second of high-fidelity sound, i t must supply 30,000 numbers to the digital-to-analog converter. Indeed, i t must put out numbers steadily at a rate of 30,000 per second. Modern computers are capable of this performance, but only if they are expertly used. We can also begin to appreciate the inherent complexity of pressure functions producing sound. We said such a pressure could not be described by one number; now i t is clear that a few minutes of sound require millions of numbers.
Mathews stresses the importance of an absolutely uniform rate of sampling, otherwise the equivalent of flutter or wow in an ordinary tape
80
HARRY
B. LINCOLN
recorder will result. He also describes the need for a buffer memory and control mechanisms t o permit a uniform rate of transmittal of data ([GO] pp. 31-33). Computer composition by means of sound synthesis can become a very expensive procedure in terms of computer time since so much data is needed for even a few seconds of music. Mathews describes some alternative procedures including the storage of samples in computer memory, a sample being read from memory rather than recomputed when that particular sound is needed. Although stored functions take memory space, they save time. Mathews summarizes this part of this argument with these comments ([GO] p. 3 5 ) : We have considered sound synthesis from the position of the computer and i t has led us to stored functions. Now let us look from the composer’s standpoint. He would like to have a very powerful and flexible language in which he can specify any sequence of sounds. At the same time he would like a very simple language in which much can be said in a few words, that is, one in which much sound can be described with little work. The most powerful and universal possibility would be to write each of the millions of samples of the pressure wave directly. This is unthinkable. At the other extreme, the computer could operate like a piano, producing one and only one sound each time one of the 88 numbers was inserted. This would be an expensive way to build a piano. The unit-generator building blocks make i t possible for the composer to have the best of both these extremes.
The unit generator in this system is called the orchestra, and the different subprograms are known as instruments. The subprograms perform functions which experience (including, for example, users of synthesizers such as the Moog) has shown t o be useful. Among these are oscillators, adders, noise generators, attack generators, and similar functions found in electronic sound synthesis. The final principle for specifying sound sequences in the MUSIC V system is the note concept. Mathews pointedly remarks that in music “notes have been around for some time” and argues that for practical reasons it is necessary to retain the idea that the sound sequences of the instruments (of the synthesizer) must be chopped into discrete pieces, each of which has a duration time. Once anote is started, no further information is given to the note-the complexity of the instruments determines the complexity of the sound of the note ([GO] p. 36). The above remarks can serve only as an introduction to the MUSIC V programs, which represent the present state of many years of work by this pioneer and leading researcher in the field. More will be said about recent work by Mathews in the area of real time synthesis in Section 2.3 below. Two other writers, Divilbis and Slawson, give basic information on computer synthesis as well as offering variations or alternatives to procedures described so far. Divilbis [Zl],after describing the computer
USES OF THE COMPUTER IN MUSIC COMPOSITION AND RESEARCH
81
process needed to generate a sine tone of one second duration, continues: In general, musically interesting sounds are much more complex than the sine wave example. . .for example, a sound may be the sum of several “voices,” each with s distinct pitch, waveform, and envelope. Under these conditions even a very high speed computer cannot calculate 10,000 sample points per second. Instead, the computer evaluates the function at the rate of approximately 1000 points per second and stores these calculated values on magnetic tape. The digital-to-analog converter converts 10,000 sample points per second quite independently of the computation necessary to obtain these points. We see now that the magnetic tape is a vital part of the system since i t allows the computer to spend ten seconds (or longer) calculating the sample points for one second of music. The ten-bone ratio cited is typical although the exact ratio depends on the complexity of the sound generated.
Divilbis carried out his work at the Coordinated Science Laboratory at the University of Illinois using a smaller computer than that available to Mathews and has put particular emphasis on economy of operation. In his article he outlines an approach and system somewhat different from Mathews and one which has certain advantages and disadvantages. It has the advantage of offering some of the elements of a real-time system and the disadvantage of a very limited range of options. Slawson has written a description of a speech oriented synthesizer of computer music [?‘5]which provides a good introduction to some of the techniques of computer composition. He effectively answers a common question from musicians, namely, how can a synthesizer represent two different “voices” or “instruments” and thus be able to produce polyphonic music. He writes ([75]p. 97) : Another troublesome, but on second thought obvious, fact is that the multiple voices in a contrapuntal composition reach the ears, not as ‘‘separate” pressure wave forms, but as a single varying pressure (or two varying pressures, one to each ear). At any given time, the instantaneous pressures arising in the several instruments are added algebraically to make up the single pressure that excites our hearing mechanism. Thus, the claim that any sound can, in principle, be synthesized applies as well to “combinations” of sounds. Any waveform is “single-valued”; it has only one pressure at a time.
Slawson clearly states that he is writing about synthesis of computer music, and not about composition of music by computer, and he has voiced his objection to some of the claims and work of those who regard themselves as composers of computer music [?‘@].Slawson’s synthesizer design is based on a model of speech production and he summarizes this “sourcetransfer function model” as follows ([?’5]pp. 107-108) : Basic to the model are the assumptions that acoustic energy originates a t a single point-the “source,” that the energy is modified in certain ways by a passive network of resonances, and that the source and the passive network are independent of each other-nothing the source does affectswhat the resonances do and vice-versa. The ratio of the output of the resonances to the output of the source a t all frequencies is Called
HARRY B. LINCOLN
82
the “transfer function.” I n other words, the transfer function is a description of what the resonance network does to any acoustic energy reaching it from the source. The template-like appearance of the transfer function, when plotted as a function of frequency, suggests the more graphic term “spectrum envelope,” which is almost a synonym for transfer function. I n the speech model, the energy source corresponds to the action of the vocal chords; the resonance network, to the throat and mouth. I n the model as adapted in the present programs for the synthesis of music the source is simply a train of pulses unconstrained in either frequency or amplitude. The pulse train may be made to be aperiodic for the production of noisy sounds. The resonance network-the transfer function or spectrum envelope-has four simple resonances (poles whose frequencies and bandwidths are also unconstrained). At present the transfer function can contain no antiresonances (zeros). Each voice in a segment of sound is assigned to one sourcetransfer function combination. The total number of variables to be controlled in each voice, therefore, is twelve: the source amplitude and frequency, the frequencies and bandwidths of four resonances, a mode variable that determines, among other things, whether or not the source is periodic, and a voice number. Each specification of these variables is fixed temporally by a thirteenth variable-the time in milleseconds between this specification of the variables and the next.
A synthesizer has, of course, the possibility of producing a vast range of sounds, many of them outside the scope of traditional music notation and instruments. Slawson, to illustrate a programming procedure, chose a sequence of events illustratable in standard musical notation and describes in detail the steps taken to synthesize the excerpt. Figure 3 shows his example ([75]p. 100). A concise summary of Slawson’s extended description of this excerpt (1751 pp. 101-107) follows: The statement “SETUP 2” prepares the preprocessor to accept a two-voice texture. The statement BEATI 594 sets the “beat” to its initial value of 594 msecs, or the equivalent of 108 beats SETUP BEATI VOICE YU .YU . u YU REST YU YU
.
1= 108
2 594 1
5A. 1.5, M F . M F 6CS,.5 6G.1.,F SAS, 3.5. F. MP.
-
. . .
0.5
.u
6GS. I .5. M F , M P , A 6 F . . 5, M P SE, 1 . 5 , , PP
REST VOICE
1.5 2
.W I .W I .I .W I .I REST FINISH
4
5 B . 1.5, M P . M P SDS,0.5 5FS. I , , M F 6C. 2. M F , M P , A 6D,1.5, MP. PP 1.5
FIG.3. Synthesis of computer music. From Slawson [?‘5].
USES OF THE COMPUTER IN MUSIC COMPOSITION AND RESEARCH
83
per minute. VOICE 1 signals the beginning of the first voice. The note A in the fifth octave 5A (A-440) ;its duration is “1.5” beats, and its beginning and closing loudness is M F (mezzoforte). The dash (-) at the close of the seventh statement calls for a slight accent with a numerical equivalent of 1.5. From thc limited information given here one can translate the program as it relates to the music example. Slawson adds much detail to this brief summary, and illustrates how one may call for glissandi, crescendi, and decrescendi over several events, precise control of staccato, and so forth. The various programs for music synthesis developed by Mathews are used by a number of composers. In some cases they have been modified according to needs of a particular hardware configuration or to the idiosyncratic needs of a particular composer. One of the first variants of the MUSIC IV program was written by Howe and Winham a t Princeton. Known as MUSIC 4B it was designed for the IBM 7094 computer and was later revised into the MUSIC 4BF program (FORTRAN version) for third generation hardware. Randall, who uses this program a t Princeton, has written speculative essays on various aspects of acoustics and music [69]and has had computer compositions on recent record releases [68].Dodge’s use of MUSIC 4B [22]stresses performance more than composition. More recently Howe has written a version, MUSIC 7, for use on the X.D.S. Sigma 7 computer [CI].Howe’s broad outline of the way the program proceeds is generally descriptive of the MUSIC 4 and the MUSIC 5 programs ([,$I] p. 0.1) The program normally makes three “passes” over the score. Each pass processes all of the data and transmits its results to the next pass by means of a scratch file. Each card in the score specifies one operation and 12 parameters or variables to the program. Pass 1 reads and prints the score, providing a record of the input data. The user may generate or modify his score by means of optional subprograms during this pass. The input data to Pass 1 is normally a deck of punched cards, but it may also be a file prepared a t an ASR teletype. Pass 2 sorts the elements in the score into chronological order and performs a number of additional options, such as converting starting times and durations from beats into seconds. Finally, the revised score is printed showing all modifications. Pass 3 now reads the score and calls the user’s orchestra to generate the music specified. Pass 3 prints a record of the performance showing optional function displays and a record of the maximum amplitude reached in each output channel during each time segment. Only P a s 3 requires a sigrtificant amount of computer time.
Batstone [4]uses the MUSIC 4BF program in operation with a CDC 6400 with conversion done via a Librascope L3055 computer and a DACOFIL-V digital-analog converter. Final recording equipment includes a Dolby A-301 noise reduction system and work is in progress on the ingenious idea of developing a programmed function to approximate the Dolby in record mode a t the time the digital tape is written.
a4
HARRY 6. LINCOLN
Whereas the work of Mathews (in the MUSIC V program) is based on mathematical simulation of oscillators, Hiller and Ruiz [39] propose to set up differential equations describing the motions of vibrating strings, bars, plates, membranes, spheres, and so forth, and to mathematically define the boundary conditions in such a manner as to synthesize a tone produced under varying conditions. Plucked and bowed string tones have already beeen simulated with varying degrees of success by Ruiz and the sequence of steps may be summarized as follows [S9]: (1) The physical dimensions of the vibrating object and characteristics such as density and elasticity are used to set up differential equations describing its motion. ( 2 ) Boundary conditions, such as the stiffness of a vibrating string and the position and rigidity of its end supports are specified. (3) Transient behavior due to friction and sound radiation is defined. (4) The mode of excitation is described mathematically. (5) The resulting differential equations are solved by means of a standard iterative procedure with the aid of a computer. (6) Discrete values of the solution corresponding to the motion of a selected point of the object are written on a file of digital magnetic tape. (7) The numerical values on this tape are converted into analog signals by means of a digital-to-analog conversion and recorded on audio tape. (8) A few cycles of the solution are also plotted by means of a microfilm plotter in order to compare the visual appearance of the vibration with the sound of the tape.
The programs developed by Ruiz for this project were written in FORTRAN, and more recently in COMPASS, the CDC-6400 assembly language. The project continues at the University of Buffalo under Hiller’s direction. I n some synthesis systems there is an important role for computer graphics. The “composer” can learn to work with a system of graphic displays which, even if entirely different from standard musical notation, will give him a visual description of the present state of his program or of the components making up a particular sound. Mathews and Rosler [62] developed a graphical language for portraying the “scores” of computergenerated sounds. Mathews’ GROOVE system, described in Section 2.3, uses a CRT display as an important element in a real-time situation. Henke [SS] has developed an interactive computer graphics and audio system at M.I.T., which emphasizes economy of operation and makes effective use of the CRT display. At present the emphasis is on control of timbre, using a program in FORTRAN IV, known as MITSYN (Musical Interactive Tone Synthesis). A totally different approach to music “composition” by computer is
USES OF THE COMPUTER IN MUSIC COMPOSITION AND RESEARCH
85
found in the work of Rosenboom. He makes use of the ability (inherent or teachable) of some persons to control their own heart rate, blood pressure, or brainwaves. By using electroencephalographic alpha feedback as an input signal he produces a “music” which has been generated by the brain waves of the subject(s). In a summary paragraph he describes one of his procedures as follows ([71]p. 14) : The system . . .includes the triggering of a sequence of sound events by the ARP synthesizer and associated equipment, rather than just on-off or amplitude indicating. Each synchronous alpha burst, sensed by a high quality analog multiplier and followed by a threshhold detector in an EXACT voltage controlled sweep generator, by pairs of participants triggered a slow rising sweep of the harmonic series which slight, automatically induced sequence changes each time through initiation of voltages that determined the starting pitch of the resonant filter, provided by a sample hold circuit being fed filtered white noise, and some randomly introduced bell like accentuations of various harmonic tones, produced by shocking the resonant filters with narrow pulses, at the attack initiated by each alpha burst and throughout the sequence. The result was indeed a music which is that of inner space, etherial, yet soothing and representative of consciously experiencable universal quantifiers of tension-integrity universe.
2.3 Real-Time Composition by Digital-Analog Computer
In much of the work described thus far there is a significant disadvantage in music synthesis by computer compared to similar work in the electronic studio. This disadvantage has been the lack of real-time control of the resultant timbres, durations, attacks, and other components of the output. The composer in the electronic studio can manipulate his patch cords and various controls until he has the exact effect he wants. The computer composer has had to live with the necessary delays in getting his material processed. The development of real-time composition has been the subject of a major effort by several persons in the field. Two projects, those by Clough [16,17]and by Mathews and Moore [61]are reported here. Clough [I71 assisted by Leonard of Digital Equipment Corporation, has developed a system known as IRMA (Interactive Real-Time Music Assembler). In this system a segment of computer main memory called the “measure” contains acoustic samples that are treated as a digital loop, which is analogous to a tape loop. The contents of the measure are continuously and repeatedly converted to analog form and monitored by the composer-operator, who may affect the contents of the measure from the console typewriter without interrupting the conversion process. The measure may be modified by insertion and deletion of specified events, reverberation, volume control, change of starting point and duration on a specified or random basis, selected erasure of any measure segment, and execution of macro commands which the composer has previously defined. The user may also call, for example, for a “jumble” which delivers a series of loop
HARRY
86
Q MAGNET I C
B. LINCOLN
10 DISPLAY
DISK FILE
4
---
-
A-TO-D CONVERTER
TYPEWRITER
SAMPLING OSCILLATOR
FIG.4. Block diagram of GROOVE system. From Mathews and Moore [61].Copyright @ 1970, Association for Computing Machinery, Inc.
segments whose lengths are determined by random selection of values between progressively changing minimum and maximum limits. A more elaborate system, and one which requires special hardware, is GROOVE, developed by Mathews and Moore [ G I ] a t Bell Telephone Laboratories. Although described as a general purpose program, with potential applications to various automated processes, its initial application has been to control an electronic music synthesizer. Figure 4 ( [ G I ] p. 716) shows a block diagram of the GROOVE system. Important elements in the real-time situation of this system are the seven real-time inputs to the multiplexer and analog-to-digital converter. Four of these voltage inputs are controlled by potentiometers. The other three are from a threedimensional linear wand, nicknamed the LLjoy-stick,”which projects from a square box. Other equipment includes the input typewriter, a specially built piano keyboard input device, and an oscilloscope. Figure 5 shows a block diagram of the steps that can be followed in carrying out a real-time decision in the GROOVE program. The operator can change the value of a specified disk function without, however, losing the older function. After the new function is sampled and approved it can be made permanent and the old function removed. The actual modification of a function is carefully controlled by the various input devices and the results are not only heard but are visually observed on the oscilloscope. One of the important features of GROOVE is the flexible control of “pro-
USES OF THE COMPUTER IN MUSIC COMPOSITION AND RESEARCH
87
gram time” and is summarized by Mathews and Moore who write ([HI pp. 719-720) : Coarse control of “program time” may be accomplished by typing TIME N , where N is a disk buffer. If N = 0, the computer will simply go back to the beginning of the disk functions. At any point, we may set a switch which will cause the computer to recycle continuously through the same disk buffer. We may also slow down the progress of program time by reducing the frequency of the interrupt oscillator. Or we may stop the progress of time altogether by throwing a switch which essentially tells the computer: “Don’t progress time normally at all, but instead, use the value of a knob to give the current position of time within one disk buffer.” The z axis of the 3-dimensional wand is drafted for this task, since moving it from left to right most resembles the perceptual task of moving a time pointer along an abscissa. Along with the visual display of the time functions and the perceptual feedback from our controlled system, we now have a fine control over the exact position of program time. This is a very powerful feature of the editing system.
ENTER BY INTERRUPT FROM SAMPLING R A T E OSCILLATOR BEGINN ING
1
P READ INSTRUCTIONS FROM TYPEW RlTER
?
I
UNPACK ONE S A M P L E OF DISK FUNCTIONS R E A D CURRENT KNOB VALUES
C COMPUTE ONE SAMPLE OF ANY PERIODIC FUNCTIONS COMPUTE ONE SAMPLE OF NEW DISK FUNCTIONS COMPUTE A N 0 OUTPUT ONE SAMPLE O F 14 OUTPUT FUNCTIONS
I
OF DISK FUNCTIONS
I I
IN I T IATE OUTPUTING CRT OISPLAY
I I
, SERVICE DISK
I / 0 REOUESTS ”
R E T U R N FROM INTERRUPT
FIG.5. Block diagram of GROOVE program. From Mathews and Moore 1611. COPYright @ 1970, Association for Computing Machinery, Inc.
88
HARRY B. LINCOLN
Mathews and Moore point out that, using these features, it is quite possible to stop in the middle of a run and “tune up a chord” and then go right on. The present short description can outline only a few of the features of this system. The GROOVE program represents an important development in the effort to -achieve real-time synthesis of musical sounds. 3. Music Research Using the Computer 3.1 Problems of Music Representation
The computer is a poor musician-it cannot read music. Although flippant, this statement poses one of the first problems confronting a person wishing to use the computer to solve an analytical problem. (An even more basic problem is precisely describing the problem to be solved, although the researcher may not be aware of this fact a t first.) The notation of music, as it has gradually evolved in the West, is complex and two dimensional. Each character depicts pitch, a dlvertical’l dimension in the notation, and rhythm, a “horizontal” dimension in the notation. The computer, designed for scientific and business applications, has no particular qualifications for use by the musician. The researcher must develop a way of translating musical notation into machine-readable language or, more precisely, into a music representation. There are almost as many music representations as there are researchers. It is interesting to study various representations and to see the solutions to these basic questions which quickly come to the fore. Should the representation be determined by needs of the programmer and the characteristics of the programming language? Should the researcher encode only those elements of notation of interest to him in his project, or should he represent all elements of notation with the view of meeting possible future needs or the needs of others who might use his data? Finally, should he develop a representation that is easy to program but tedious to encode, or a representation more quickly encoded but more difficult to program? These are difficult questions. Some argue that the questions are not important because translation programs will make it possible to move easily from one representation to another. A comparison of just three languages, ALMA [Sd], Ford-Columbia [5],and “Plaine and Easie” [ I d ] illustrates the complexities of translation. As might be expected, researchers working with FORTRAN show a preference for numeric representations. Baroni [S] at the University of Bologna encodes six-digit numbers, in fixed fields, for each note, with each digit standing for a different component of the note, e.g., duration, pitch, octave register, etc. With such an encoding, using FORTRAN, it is rela-
USES OF THE COMPUTER IN MUSIC COMPOSITION AND RESEARCH
89
tively easy to make studies of any particular component. Encoding and keypunching a six-digit number for every note is, however, more time consuming than with most other representations. Most representations attempt to make use of mnemonics wherever possible as a means of speeding encoding and also to make it possible for the researcher to “read” back the representation as a melody. A brief survey of several representations in terms of just two components of notationpitch and duration-will show the variety of possibilities available to the researcher. If a mnemonic is used for pitch designation it is necessary to indicate an octave designation since pitch names are repeated in each octave. This, for example, Asuar [la] uses the solfeggio terms for scale degrees (Do, Re, Mi, Fa, etc.) and up to eight octave designations. Thus M I 5 is fourth space E. The next obvious mnemonic for pitch designation is simply the letter name of the note and here again octave designations are necessary In ALMA, a system developed by Gould and Logemann [32] as an expansion of Brook’s “Plaine and Easie Code” [ l d ] , apostrophes precede notes for registers above middle C and commas precede notes for those below middle C. For example, E would designate the same note as MI 5 in Asuar’s system. Wenker [SO] uses plus or minus signs, instead of commas and apostrophes, to indicate pitches in octaves above or below the octave lying above middle E. In both systems duration is indicated by the numerical mnemonics of 1 for whole note, 2 for half note, 4 for quarter note, etc. These or similar mnemonics for duration are common to several representations. Jackson and Bernzott [46]use an entirely different mnemonic system for designating pitch. They simply number the keys of the piano 1-88 and designate any pitch according to its corresponding position on the piano keyboard. Pitches beyond the piano’s range are represented by minus numbers for those in low register, and 89-98 for those in higher register. The number 99 is reserved to designate a rest. In this system, harmonic equivalents are not distinguished, e.g., G sharp and A flat are numbered the same. If necessary, provision is made for a distinction, but it calls for a second pitch card. Jackson and Bernzott do not encode duration per se, but indicate it by assigning a fixed number of columns to various rhythmic values. For example, one measure of 4/4 meter would take 64 columns. It is apparent that this system calls for many more columns of information per note of music than others under discussion. One of the better known music research projects using the computer is the “Josquin Project” at Princeton University, dedicated to study of the Masses of the Renaissance composer, Josquin des Pres. Early stages of
90
HARRY 6. LINCOLN
this project included the development of a programming language for Music Information Retrieval (MIR), first written by Kassler [50] for the IBM 7094 Computer, revised by Tobias Robison, a professional programmer and re-written for the IBM 360 series by John Selleck, a research assistant a t Princeton. The input representation for this language is known as IML (Intermediary Music Language) developed by Jones and Howe a t Princeton. I n the IML-MIR system the convention of indicating duration by the mnemonics of 1 for whole note, 2 for half notes, etc., as already described above, is adhered to. However, pitch is encoded by a numerical designation of staff position, that is, a numbering of staff lines and spaces from ledger lines well below the bottom line of the staff to ones well above the staff ([50]p. 304). This dispenses with the convention of octave registers found in several of the representations already discussed. Probably the most widely used system for encoding music in machinereadable form is DARMS (Digital Alternate Representation of Music Symbols), generally known as the Ford-Columbia Representation, which was developed by Bauer-Mengelberg [5] as part of a larger project in automatic music printing, financed by the Ford Foundation through Columbia University. It is widely used not only for its efficiency and speed of encoding but also because it was the representation taught at the Binghamton Seminar in Music Research and the Computer in 1966. I n Ford-Columbia, mnemonics are used for rhythms, but instead of numbers (e.g., 2 for half note) the note values are indicated by letter, for example, Q for quarter note, H for half note, etc. Pitch is indicated by staff-position codes, in a manner similar to IML-MIR (see above), but with a numbering system which is more easily remembered by the non-musician encoder, that is simply 1,3,5,7,9, for staff lines and 2,4,6, and 8 for staff spaces. With the use of abbreviation conventions, the representation is very compact, with a single number often sufficing for a note. Because it is graphicsoriented, the Ford-Columbia system can be encoded by clerical help unskilled in music, although it is the author’s experience that student musicians will do the job much faster, a statement which is probably true for any representation. The usefulness of the Ford-Columbia Representation will be greatly increased with the completion of a software packages in PL/I currently under preparation by Raymond Erickson under a Fellowship a t IBM
3C VCBA
FIG.6. Gould’s encoding system for Gregorian chant. From Gould [ S l ] .
USES OF THE COMPUTER IN MUSIC COMPOSITION AND RESEARCH
91
FIG.7. Encoding system for Byzantine chant. From Schifldt and Svejgaard [72].
Systems Research Institute in New York. The three main programs in this package are : (1) Syntax-error detector, performing a parse on the data to evaluate correctness of form of the raw dataset. ( 2 ) Display of the “valid” (corrected) dataset in some two-dimensional form, illustrating vertical alignments encoded in the data-string and stratifying the information t o facilitate checking of pitches, rhythmic values, etc., with the original score. (3) Transformation of data (still in string form) into a set of crossindexed tables for musical analysis purposes. Certain repertoiries of music use a notation, or call for transcription in a notation, which differs from the standard Western notation used in the above representations. The Gregorian chant repertory is still notated with special characters on a four-line staff. Gould [SI]developed an encoding system for keypunching the Liber Usualis, the principal book of chant in current usage. His system is illustrated in Fig. 6. The 3C means a C clef on the third line. The letter V stands for the notational sign known as the Virga and VC means a Virga on note C. The B and A signify the notes by the same letters. Other more complex notational patterns are handled in a similar manner. Two Danish scholars, the musicologist Nanna Schifidt and the mathematician Bjarner Svejgaard “721 used a code system of letters for neumes and numbers for accent and rhythm neumes to encode Byzantine chant melodies. An example of this representation is shown in Fig. 7, in which I stands for Ison (L), E for Elaphron ( m ) , P for Petaste (d), A for Apostrophus ( >), and X for Oxeia ( 0 ) .The rhythms are indicated by 1 for Bareia (\), 6 for Diple ( 11 ), and 8 for Tzakisma (u). Medieval notation has note values and groupings of notes rarely found later than the sixteenth century. Erickson, in the project described in the preceding paragraphs, has expanded the Ford-Columbia Representation to include symbols for longa, breve, and the various ligatures (groupings of two or three notes) found in this notation. Instrumental music in past centuries was often notated in what is known as tablature, in which notes are not used, but instead numbers and letters
HARRY B. LINCOLN
92
n
GS, 4=2,2APH++H, 4(-2)BPl 4E+, 2NB. PH-H(-), 4$G,/,
I
3
D. C.
l)AH+H,TT),//,END
FIG.8. Wenker’s representation. From Wenker [go].
are written on lines of a staff. In the case of string instruments such as the guitar or lute the lines stand for the strings of the instrument and the symbols indicate position of the fingers on the string. Thus any system of encoding this music must translate the tablature into modern notation. Certain performance practices create difficulties in this work since the duration of a tone was not always specifically notated but must be inferred by notes (and harmonies) immediately following it. Hultberg has developed an intermediate program which reads an encoding of Spanish lute tablature and translates it into Ford-Columbia Representation. A vast repertory of lute music from the Renaissance awaits transcription into modern notation. With the development of automated printing of music Hultberg’s intermediate program and tablature encoding system would make possible more economic publication of this material. The whole field of ethnomusicology raises complex problems in music representations. The field worker and the person transcribing recordings into modern notation must indicate microtones and other instances of performance and musical tradition not encompassed in Western notation. Ethnomusicologists are not in complete agreement among themselves on the notation of music from other cultures. An idea of the complexity of the problem confronting these researchers can be shown in an example from Wenker’s representation [go] as illustrated in Fig. 8. The use of portamento (P), hold (H), slightly flatted pitch (-), quarter tone flat (-2), etc., are illustrated. If required, Wenker’s representation allows for the designation of pitch in terms of cycles per second, as well as in cents or savarts. The above remarks on music representations have made comparisons only on the basis of encoding of pitch and duration (rhythm) of a note. The variety of approaches among the different systems becomes even more apparent when considering such other components as accidentals, groupettes (triplets, etc.), slurs, ties, expression marks, etc. Another system of entering music information is the use of a piano or organ keyboard linked to an analog/digital computer. Recent work on this possibility has been carried out at the University of Utah by Ashton [ I ] and by Knowlton [5’oal.
[a]
USES OF THE COMPUTER IN MUSIC COMPOSITION AND RESEARCH
93
3.2 Music Theory and Style Analysis
A wide range of activities is encompassed in the broad area of music theory and style analysis. While there are differences between the two fields, there is much overlapping in procedures and techniques, and for our purposes the two are discussed together. Both the music theorist and the musicologist make use of a variety of analytical tools which lend themselves to computerization. Among these are interval counts; citations and descriptions of such components as particular rhythmic patterns, specified intervals, or uses of accidentals; statistical counts of chord roots, inversions, or variants; and correlations of text accents with metrical or agogical accents, to name but a few of the many attempts to use the computer effectively as a tool for research. The amount of data and the range of analysis may vary greatly from one project to another. The computer can be as useful for simple operations on thousands of short melodies in a thematic index as it is for multiple analyses of one particular composition. To date, most publications of computer-assisted analysis have stressed methodology, as might be expected in a new field. Probably the most difficult problem confronting the theorist is not how to measure, but what to measure. Because of the demands for precision in this work, the researcher is obliged to define his problem in the most specific terms, and it is this precise structuring of the problem and the means of attacking it which has proved a stumbling block in many projects. Music is one of the arts, and objective measurement, while superficially easy (anyone can count intervals) can be meaningful only in a carefully defined and structured problem. Researchers are also handicapped by lack of a consistent terminology for the components of music. There are several widely used terminologies for descriptions of vertical chord structures, to say nothing of several schools of thought on chord functions. Because of the multiplicity of problems one might be led to agree with Kassler who writes [49]:
.
. . i t is easier today to keypunch music than to process significantly the musical data so obtained. Would not effort expended to construct suitable theories be a more progressive investment at present than participation in standards committees [to develop a common music representation]?. . . The survey which follows will show the range of historical periods, types of problems, and approaches to their solutions, which can be found among contemporary researchers. Compositions from all historical periods have been the subject of computer-assisted studies. One of the earliest repertories is Gregorian chant, with some two thousand melodies and a long history of change and development. These melodies were often borrowed by polyphonic composers throughout the middle ages and the renaissance, and the borrowings were not only from the beginning of melodies but also from internal phrases. Gould’s encoding system for keypunching the Liber Usualis has already been discussed. Gould set up tables
94
HARRY 8. LINCOLN
of various patterns of two to seven notes (these groupings are known as neumes in chant notations) and showed how a simple alphabetical representation of these patterns could be used as basic material for analytical work. Selleck and Bakeman [73]used a few chants as material for a pilot project in analysis. They were interested in identification of melodic patterns or “cells” which may recur in various places in a chant melody. Their first analysis compared the content of every phrase with that of every other phrase, yielding an indication of identity (represented by one) or nonidentity (represented by zero). Further tests cited occurrences of large cells made up of groupings of small cells usually two or three notes in length. One of the interesting results of their work showed that “in a larger chant such as the Mode V Credo with 43 phrases, almost none of which are identical, the identification by machine of the cells and their arrangement afterwards reveals the same three ‘ur-phrases’ for this piece.” It was their experience that computer techniques ‘ h o t only allow the musical analyst to ask questions, the solutions of which would otherwise be beyond the range of practicality, but the data so generated often suggest new approaches, new problems that would not be suggested by original material itself.” SchiGdt and Svejgaard, in their work with Byzantine chant as cited above, also looked for configurations of neumes common to various chants. A particular configuration is known as a “formula” and a long chant is made up of certain arrangements of formulas. It was found that the strict laws of the formula scheme were abandoned a t the textual and melodic climax of the chant. The computer can be queried for the location of any particular formula in the total body of chant melodies. For example, “if we ask the computer to find A, B, C, and D in the encoded material, the answer from the machine concerning this hymn will be Hymn No. 32 A.a.7, B.d.ll.A.a.23, etc., meaning: I n hymn No. 32 the formula A arrives on neume (or neume group) No. 7 and starts on the tone a, formula B arrives on neume No. 11 and starts on the tone d, etc.” The musicologist working in medieval and renaissance music has a host of problems in collations of manuscript sources, identification of cantus firmi, searches for recurring rhythmic patterns, and questions of borrowings, to say nothing of the broad problem of description of a particular composer’s style. The computer is being used in a remarkable variety of projects, ranging from simple statistical counts to complex analysis of harmony, counterpoint, rhythm, and structure. An example of the varieties of information to be sought in a particular body of music or repertory is provided by the projects of Karp [47].His recent work with medieval melodies illustrates several types of searches of the simply encoded material. The program is designed to analyze music
USES
OF THE COMPUTER IN MUSIC COMPOSITION AND RESEARCH
95
set to “lyric” texts, i.e., stanzaic forms with not more than 20 lines of verse per strophe. The output includes a line by line printout of the music. If the musical structure is different from the poetic structure, the printout is twofold, first by musical structure, then according to poetic structure. Various factors contributing to rhythmic design are analyzed, including degree of floridity for the entire piece and for each line, comparison between floridity of the opening and concluding sections, degree of rhythmic homogeneity, and increase or decrease of motion from beginning to end of a phrase. Modal structure is indicated and the melody is identified by class, range, and tessitura, the latter with peak and low point recorded. Percentage summaries of pitch content are provided for each work and a series of probability profiles will be generated for each class of melody, together with information regarding melodic borrowing. Karp began his work using FORTRAN and has continued with that language, although he recognizes that other languages lend themselves better to the string manipulations called for by his analysis. Brender and Brender also work with medieval music, concentrating on the complex problems of mid-thirteenth century notation [lo].In this notation the duration of a note is determined not only by its graphic form but also by its context with other notes, and the rules for transcription into modern notation are quite complicated. The Brenders encode the music with a simple representation which describes the graphic arrangement of the notes. The encoder makes no decision regarding the rhythmic pattern to be derived. This is left to the computer program which uses a technique called “sliding pattern match” and the output cites the rule number which was applied to achieve each grouping. Once the transcription is achieved, several analytical programs are employed. Among these are measurements of voice crossings, tabulation of melodic intervals in each part, and a listing of harmonic intervals on a two-dimensional chart. The study of melodic intervals includes an interesting study of “average rate of movement. ” Brender and Brender write ( [ l o ]p. 206) : Each melodic interval is divided by the duration of the first note of each pair and the average taken over the whole melodic line. By analogy to physical terms, the interval represents a change of position and the duration the time for that change; hence the quotient is a sort of velocity or speed. It was expected that, averaged over the entire piece, this quantity would be close to zero, in some cases slightly negative, i.e., that, on the average, increasing interval rates are as common as the decreasing interval rates. It was surprising to observe that this quantity was consistently negative. Only one out of the twenty-one lines analyzed had a positive value. This means that d e scending lines tend to change or move faster and over larger intervals than ascending lines. The motetus was normally the most negative, followed by the triplum, then the tenor.
In the harmonic analysis each triad is displayed on a two-dimensional
LOW TO HIGH INTERVAL 0
NONE
. I
0
NONE
1
MN2
4
2
MJ2
E
3
MN3
4
MJ3
5
PF4
li
AC4
7
P"5
>
w I+ B
w
A
n
3
11
MJ7
I2
OC'T
1
2
MN2
MJ2
7
3 MN3 2
I
4 MJ3
5 PF4
6 AG4
2
1
8
PF5
MN6
9 MJ6
10 MNI
11
12
MJ7
OCT
13
14
3
I
1
3 6
1
2
1
3
@
I 1
2
1
1
1
1
1
2
2 5
2
El
FIG. 9. Graphic harmonic analysis of Cruci Domini-Cruci Forma-Portare. Note that the major triad (circled) occurs eight times, and that an octave with an inner fifth (boxed) occurs thirteen times.
USES OF THE COMPUTER IN MUSIC COMPOSITION AND RESEARCH
97
chart (Fig. 9) with the left ordinate representing the low-to-inner interval, and the horizontal ordinate the low-to-high interval of the chord. The resultant chart provides a graphic description of overall chord usage in each piece. In another study of medieval music, Erickson [.%$I has attacked the complex question of rhythm in twelfth-century music as represented in the Magnus liber organi of Leonin. This much studied collection has been the subject of debate on interpreting the notation of rhythm. Using a modified Ford-Columbia Representation, Erickson developed a “phrase-structure analysis” program which uncovered a number of significant stylistic characteristics of this music. It was discovered “that (1) some ten formulae (incorporating at least three turning points) account for the beginnings of almost half of the more than 900 duplum ordines tests; (2) specific melodic contours are associated with specific pitch levels and tenor notes; (3) families of organa can be grouped according to the use of certain melodic formulations; (4) a given melodic cliche often recurs several times in the course of a single composition; ( 5 ) the phenomenon of organum contrafacts, hitherto unrecorded in the literature, is to be found in the Magnus liber.” A distinctive feature of Erickson’s work is his use of syntax-directed compiling techniques to transform the data as encoded (in string form) into internally stored tables. Specifically he implemented, in PL/I, the CheatamSattley compiler [23, 251. Beginning with the Renaissance the problems of notation and rhythm are not as tangled as in medieval music, but there are other vexing questions which seem to lend themselves to computer-oriented procedures. The sheer quantity of manuscript and printed sources and the frequent confusion in attributions have always proved a challenge to maintaining control of data in hand card file systems. The apparent stylistic homogeneity among composers, and within the total repertory of any single composer, has made dating of works difficult. One must stress the word apparent in the previous sentence. Much progress has been made in defining the components of a musical style and new tools for measuring them have been developed by theorists and historians. With carefully structured procedures one can define stylistic differences among compositions which past historians have tended to lump together into one style. In spite of this progress, however, there is still a strong element of intuition which enters into much stylistic analysis. The researcher using the computer does not want to sacrifice this element of intuition, and yet is compelled to define his problems in the very precise terms demanded by computer-assisted analysis. It can probably be argued that these more precise definition of problems have contributed as much to date as have the results showing on computer printouts.
98
HARRY
B. LINCOLN
The Josquin project at Princeton is a long range effort to analyze the Masses of Josquin des Prez in order to define certain stylistic characteristics and help to develop a more accurate chronology of the works than is now known. Lockwood [SS] has described efforts to determine those points in the Josquin Masses in which musica jicta might be required. This involves citing the locations of melodic leaps of the augmented fourth or presence of that interval in a vertical sonority. Other tests are being devised to determine the range of each voice and the medium note (pitch) of each voice as well as for the whole composition. There is evidence that a composer may adjust these matters slightly depending upon his geographical location or the singers available to him. By developing various statistical profiles of a genre of composition, one can argue for certain groupings and perhaps, eventually, establish the chronological sequence of a composer’s works. This work is very tentative, no matter by whom it is being carried out, because there is yet so much to be learned about the validity of various measurements . An interesting attempt to apply certain standard numerical methods, already used in other disciplines, to musical style analysis, has been tried by Crane and Fiehler [LO]. They argue that three classes of characters may be distinguished: two-state, ones which are either present or absent in a composition; multistate, such as meter in a set of works that show several different meters; and continuous characters, those that have any value within a certain range, such as beats per chord. Using standard formulas, the coefficients of association, correlation, or distance are measured. “The result of the affinity computations will be a matrix like a mileage-betweencities table, whose columns and rows are headed by the identifications of each work. At the intersection of row i and column j will be entered the affinity between works i and j ([LO] p. 215).” Of the various ways available to show the distribution and clustering of points, the authors have chosen a dendogram to represent this information graphically. Figure 10 illustrates the clustering of a group of twenty chansons. I n their article on this work the authors also briefly describe other possible statistical techniques which can reveal new insights from stylistic data. Among these techniques are seriation analysis, factor analysis, and discriminant analysis. In an early project, with procedures overlapping those used in thematic indexing (see Section 3.4), Collins [lo] encoded a number of madrigals by the English composer Thomas Weelkes. Weelkes’ mannerism of borrowing melodic fragments from his own earlier works was the subject of a systematic search of the bass lines of a large group of works to identify borrowings of melodic contours. The computer was directed, in effect, to compare intervals 1 to 6 with intervals 2 to 7, then 3 to 8, and so o n . . . then compare intervals 2 through 7 throughout. When two segments of intervals
USES
+-
OF THE COMPUTER IN MUSIC COMPOSITION AND RESEARCH
..
F
99
-7
I
w
u z
2 v,
n
B I-
z
w
0 L L LL W
.:-
8
.t
i
N W > >> W > O
FIG.10. A dendrogram showing twenty chansons clustered according to style. Each horizontal line shows the coefficient of distance at which the two works or clusters below i t join. A lower coefficient of distance indicates a greater affinity.
were found to be identical, the citation would read, for instance, ‘‘Piece A:27, piece B:265; meaning that the 27th to the 32nd intervals of piece A were the same as the 265th to the 270th intervals of piece B.” Further first hand comparisons of the pieces are then necessary to determine if the borrowing is simply a standard melodic formula or cliche pattern, or the signal of a more important and longer quotation from another work. Sixteenth-century secular repertories which have been the subject of computer analysis include the French chanson in France and Italian madrigal and frottola. Studies by Hudson [@I and the present author [55, 571 are described under thematic indexing below (Section 3.4). Bernstein [S] began with a thematic indexing project with the chanson repertory but decided that a procedure analogous to the literary concordance would
100
HARRY
B. LINCOLN
reveal more information than the thematic index. He chose 300 chansons for a pilot project which involved encoding the complete works, using CLML (Chicago Linear Music Language) which is a modification of a Ford-Columbia Representation. He gives the following detailed description of one of several programs developed ([7] p.159) : The input data, compared to a table of relevant parameters, ww decoded and stored in two parallel tables. One such table, in integer form, gives the actual note value, counting each semitone as an integer starting with the lowest note on the piano. The other table contains, in decimal form, the time a t which the note will stop sounding. The unit in the letter table may be varied by the user. A third table indicates the ends of phrases and is similar in structure to the table of note duration. I n the performance of harmonic analysis, the chords had to be examined each time a note changed in any of the voices. A chord was rejected for either of the following reasons: (1) if i t contained less than three notes; or (2) if i t included intervals other than thirds and fifths. If neither of these conditions prevailed, the root of the chord was determined by examining the intervals, concentrating on the interval of the fifth, and designating its lower member the root.
Bernstein indicates that he hopes t o expand the analysis programs to permit identification of the roots of seventh and diminished chords or even of incomplete triads in which two tones are given and the third implied. There has been very little use of the computer for analysis of music between the Renaissance and the twentieth century. Although the reasons for this are not clear they may be due to the length and size of the major genres of these periods-encoding a complete Brahm’s symphony would be a real task. Also, the harmonic vocabulary becomes increasingly complex and the analysis of chromaticism in the nineteenth century, for example, is fraught with differences of opinion regarding nomenclature and analytical tools. However, two projects, one in the Baroque and one in the Classical period, have been reported. Logemann [59] explored a problem that seems designed for computer analysis. Since the middle ages composers have written works in the form of puzzles in which the correct performance of the work is hidden by enigmatic key signature, clefs, or veiled instructions (often in the form of a pun or conundrum) for performance. Sometimes the correct answer can only be found by “trying” a number of possibilities. Logemann worked with two canons (a canon is a form of consistent imitative counterpoint) from Bach’s Musical Offering in which Bach indicated the transposition or the inversion of the second voice but omits the sign indicating the point a t which the second voice begins. Pragmatically, one simply has t o try beginning the second voice a t all possible entry points and selecting those (or the one) which lead to the best sound. After encoding the melody of the canon Logemann translated the data into a table of numbers representing the pitches by assigning to each note its integer number of semitones above middle E. Tests were made of the distribution
USES OF THE COMPUTER IN MUSIC COMPOSITION AND RESEARCH
101
of intervals obtained at each possible entry point of the second voice. Logemann illustrates, in music notation, the results obtained for differing degrees of “harmonicity” ([59]pp. 70-71). The project from the classical period to be reported is that of LaRue [52], who already had established the effectiveness of a large hand card system for clarifying the vast confusion in attributions among hundreds of symphonies by contemporaries of Haydn. With the help of Logemann and Berlind a t New York University, he devised procedures for style analysis that included a computer program to cite changes in articulation of rhythm. Points of change were tabulated in reference to measure numbers or finer points on a time line. When applied to the symphonies of Haydn, the initial work pointed to some interesting observations. LaRue writes ([52]p. 200) : . . . in the second result, however, one aspect was found in which Haydn seems to behave consistently. He uses more themes in the second part of the exposition sections of his first movements, after the modulation to the dominant key, than he does in the beginning of his symphonies, in the tonic area. I owe this possible breakthrough, this fist crack in Haydn’s armor, entirely to the computer. It had simply never occurred to me to look at Haydn in this way. All sorts of experimental tabulations had been made, exploiting the wonderful flexibility that the computer provides; and in studying these tabulation, suddenly this unsuspected correlation just popped out. One can immediately see all the interesting questions this raises. If Haydn uses more themes in a tension area such as the dominant, does he use more themes, relatively, in other tension areas, such as modulations in the development? And what happens in recapitulations, where Haydn does not modulate to the dominant, so there is no tension area: Does he eliminate some of the themes? Everyone knows that Haydn makes changes in recapitulations, but no one has been able to say why. Maybe we have a clue at last.
Some researchers have attempted to develop computer-assisted systems for musical analysis which are applicable to music of any style. In some instances the emphasis is on the development of a research tool rather than describing its application to the music of a particular historical period. Forte, who is responsible for much of the fundamental work in this area, has developed a program for the analytic reading of scores [27] which, while designed for use with atonal music, illustrates logic and procedures which can be applied to a wide variety of styles. Using the Ford-Columbia Representation and the SNOBOL programming language, Forte has established various procedures to parse strings for blocks and segments of information and to identify accurately their temporal locations in the string. Forte writes ([27] p. 357): [Other programs] yield increasingly higher representations of structural relations. If the collection of pitch-class representatives in each segment of the output strings from the reading program is scanned to remove duplicates the resulting collection may be called a compositional set. Then, from a listing of all the compositional sets the analysis programs (1) determine the class to which the set belongs; (2) list and count all occurrences of each set-class represented; (3) compute, for each pair of set-class representa-
102
HARRY B. LINCOLN
tives, and index of order-similarity; (4) determine the transposition-inversion relation for each pair of set-class representative; (5) list for each set-class represented, those classes which are in one of three defined similarity relations to i t and which occur in the work being examined; (6) summarize in matrix format the set-complex structure of classes represented in the work; (7) accumulate and retrieve historical and other informal comments in natural language.
Forte concludes that “any syntactic class or combination of classes could be defined as delimiters in a similar way, thus creating new analytic strata. It might be pointed out in this connection that aspects of an individual composer’s ‘style’ can be investigated to any depth, provided, of course, that the researcher can specify the syntactic conditions with sufficient precision ([d7] p. 362).” This concluding remark is, of course, the nub of the problem and the challenge to anyone using the computer for analysis of music. An ambitious set of programs for style analysis was developed by Gabura [d9] and tested on compositions by Haydn, Mozart, and Beethoven. Gabura relies heavily on statistical procedures such as the computation of frequency distribution for pitch classes. Music theorists in general appear wary of statistical procedures, although it is the conviction of some that when properly used they can provide much helpful general information as well as objective confirmation of intuitive insights. Gabura shows a variety of interesting tabulations of results, including graphs, correlogranis, and tabulations of pitch structures. Traditional systems of musical analysis, which work with reasonable satisfaction for earlier music, have proved inadequate for atonal and serial compositions of the twentieth century, to name but two styles which hardly fit in the traditional tonal schemes of most music from the seventeenth through the nineteenth centuries. The newer tools of analysis for serial composition, for example, already include studies of combinatorialitypermutation schemes, and other procedures which lend themselves to computation. It is not surprising, then, to find contemporary theorists, turning to the computer to test various analytical ideas, some of them of great complexity. Using the year 1910 as a dividing line, Jackson [45] studied a group of compositions in terms of their harmonic structure, with emphasis on the presence or absence of certain harmony. His comparisons were in three categories: “(1) those involving the chordal or intervallic content of the pieces; ( 2 ) those involving the dissonance content; and finally (3) those involving what shall be called the ‘recurrent chord content’ ([45] pp. 133-134).” The program for chordal and intervallic content used a chord table and interval vector table against which the various chords forms in a piece were matched. Studies of dissonance content were difficult because
USES
OF THE COMPUTER IN MUSIC COMPOSITION AND RESEARCH
103
there is no agreement on what constitutes dissonance. There was also a problem in handling sonorities which contained two intervals consonant in themselves, but which produced a dissonant clash because of their combination. Results were shown in terms of frequency of dissonant intervals, expressed in percentages. The music of Anton Webern is the subject of two recent studies, both of them concerned with some aspect of the composer’s use of the twelvetone (serial) technique in the Piano Variations, Opus 27. Fiore [ZS] working with the Ford-Columbia Representation and SNOBOL, has analyzed the complete score for harmony as a function of the theme-and-variations form. Among other conclusions she demonstrates “the change in Webern’s late works from an emphasis on the preferred tritone, half-an-octave, to the minor third in juxtaposition to the major seventh.” Working with the same music, Fuller [28] developed the program CONHAN (CONtextural Harmonic ANalysis) to find the most important roots in a musical passage, arguing that there are “points of harmonic reference which stand out from the dissonant texture.” The calculation of permutations of the twelve-tone row gives useful information for the analysis of music based on the twelve-tone system. The twelve transpositions of the row, along with the four basic orderings of the row (original, inversion, retrograde, and retrograde inversion), yield fortyeight permutations of the row. Within these permutations of a particular row one may find segments with identical pitch class sets, and this information can be of value to both the composer and the theorist. Lefkoff [53]has developed programs which calculate and print recurrent pitch-class sets in various useful formats. Theorists of the twelve-tone system are also interested in properties and relations of sets of less than twelve pitch-classes. A summary of some of these relations and their applications to combinatorial music systems is given by Howe in an article [4Z] which includes an analysis of a few measused of the fourth movement of Webern’s Fiinf Satze as well as a FORTRAN program to calculate pitch-structures of size 2-1 1 (semitones). In an unusual departure from traditional analysis Howe suggests a reworking of one measure of the piece to make it more representative of the analysis. In a remarkable concluding paragraph Howe writes ([&‘I p. 59) “The possibility of making such a simple revision in order to clarify the compositional procedures in this passage indicates that either our analysis or the composition itself is inadequate in its present form. In an analysis, we usually attempt to make the best possible sense out of a composition, and we feel free to adjust our concepts to clarify the piece; we can also adjust the piece to clarify our concepts. But in the latter case we have not given a sufficient explanation of the composition as it stands.”
104
HARRY
B. LINCOLN
For centuries musicians and theorists have struggled with the problem of various tunings of the octave scale found in Western music. The established system of 12 semitones to the octave has often been challenged and systems offering anywhere from 19 to 72 degrees have been proposed. Stoney [77] developed programs to test various systems and has drawn conclusions of interest to composers and theorists experimenting in this area. He classifies systems of equal temperament as either positive systems, that is, systems with wide fifths, or negative systems having narrow fifths, and further charts the most effective of systems from 12 t o 72 degrees. He concludes that “the most promising low-order systems are those of 24, 22, and 19 degrees. For these last three systems practical experimentation over an adequate period of time and utilizing suitably designed instruments would be required in order to assess their respective deficiencies and merits ([7’7’] p. 171).” 3.3 Ethnomusicology
Reference has already been made (see Section 3.1) to special problems in music representation faced by the ethnomusicologist. Since he works with vast repertories of music, usually monophonic, the ethnomusicologist has often relied on large card file systems to control his materials. He often works with many subtle variants of the same melody, variants transcribed from field recordings, and must devise systems for distinguishing among these variants as well as grouping them into clusters with common characteristics. An idea of the size of these files can be gained by noting the work of Dr. Ladislav Galko, Director of the Institute of Musicology a t the Slovak Academy of Sciences in Bratislava, who has a manual system of 25,000 cards as well as some 200,000 melodies, on tape, awaiting transscription. The computer has had limited use by ethnomusicologists to date, in spite of the fact that, as noted in the Introduction, the earliest use of data processing techniques in music research was by a researcher in folk music. Lieberman [54] has reported on a project a t the Institute of Ethnomusicology at UCLA. The purpose of the project was to clarify the concept of patet (mode) in Javanese gamelan music, and to learn how requirements of patet affect and guide group improvisation. While aware of the dangers of overuse of statistical methods, Lieberman points out a rationale for their use in a studying the music of other cultures when he writes [54]p. 187) : Statistics have the advantage, however, of objectivity; and since ethnomusicologists frequently deal with musical cultures as external observers rather than as native practitioners, objective techniques are welcome safeguards against unconscious superimposition of alien values, which seldom apply.
USES
OF THE COMPUTER IN MUSIC COMPOSITION AND RESEARCH
105
I n addition t o studies of frequency counts, programs were developed to cite two-, three-, and four-note patterns. To further assist in patternsearch, ‘‘a program was developed which, once provided with archetypical four-note formula, could recognize, isolate, and label direct or retrograde patterns with embellishments or extensions of any length.” I n another example of computer applications to ethnomusicology Suchoff [78] has continued and extended the work begun by B61a Bart6k, who was not only a great composer but also a pioneer in the systematic study of folk music. Suchoff’s work in systematic indexing of these materials is discussed in Section 3.4. 3.4 Thematic Indexing
The thematic index, first developed in the late eighteenth century, has been used in many forms, both in printed indices, usually of the works of a particular composer, and in private card files for the researcher. A thematic index, in addition to listing of composer and title to each piece of music, includes a quotation of the opening few notes of the piece in question, and thus is comparable to an index of first lines of poetry. If the researcher already knows the composer or title he has no problem locating the melody in question. However, in some cases a researcher wishes to trace a particular melody which he may simply have in his mind or, more likely, have found in a manuscript or printed source where it is unidentified. If he can characterize the unknown melody as a pattern described by letters and or numbers he should be able to consult a listing of melodies arranged by some ordering of these letters and numbers and find other instances of the use of his particular melody, if they exist. Several systems of ordering themes have been proposed, including simple alphabetization of the letter names of the notes, transposition of the theme into the key of C major and then identifying by the names of the notes, or more commonly, some scheme of denoting the sequence of intervals in the melodies. The latter is usually preferred because it permits citation of transposed melodies which would otherwise be lost in an alphabetized file. This is illustrated in an example which shows two melodies which are identical except for transposition. The letter names of the notes differ, but the interval sequences are the same. The problem becomes more complex when a composer borrows a melody and makes only a slight change in the contour, or fills in a large interval with one or two smaller ones. Various schemes have been proposed for computing the broad contour of a melody, ignoring repeated notes, designating each note’s relation to the first note rather than the one immediately preceding, etc. Hudson [CS] gives a good discussion of some of the alternatives in these procedures in his description of a catalog of the renaissance French chanson. Some of the fundamental
106
HARRY B. LINCOLN
problems of ordering incipits in any repertory are discussed by Meylan [63] in his description of work with the fifteenth century basse danse. Trowbridge, working with a group of 702 compositions in fifteenth century chansonniers, has devised effective techniques for finding closely (but not identically) related incipits. He first attempted to represent an entire incipit by a single complex polynomial equation, but due to practical limitations of the mathematical routines as well as the complexity and variety of the melodic curves, the method failed to produce satisfactory results. I n one of his conclusions he writes [7'9]: Any mathematical representation of a melody must be an approximation based on certain musical assumptions. Some types of relationships must be judged more important than others and weighted accordingly. The solution of this problem poses the most difficult questions of a musical nature, since almost all generalizations about music fail in certain specific instances. One of my assumptions, for instance, has had to be that the direction of a melodic lines a t a given point is more important than the actual rate of change but that the latter-the rate of ascent or descent-is more important than a particular rhythmic configuration a t this point.
The present author [55,57'1 has developed a large file of incipits (opening melodies) of sixteenth-century Italian music. Some 40,000 incipits have been encoded from the frottola, madrigal, and motet repertories. The computer extracts the interval sequence from the encoding (in Ford-Columbia Representation) of an incipit. Interval sequences are listed in numerical order along with information on composer and title. Interesting examples of borrowings within the frottola repertory have been cited [57].The large data bank of melodies has been used for testing of a new music typography discussed below. A similar plan of indexing is used by Suchoff to catalog Eastern European folk melodies. I n addition to questions of borrowings, he is interested in citing variants among peasant melodies as notated first hand by Bart6k. Suchoff has used the GRIPHOS program for organizing and annotating the text materials in this large project [78]. 3.5 Bibliographies and Information Retrieval Systems
The thematic indexing plans described above, while essentially informational retrieval systems, pose problems peculiar to music in their demand for music representations. It is of course possible to use established procedures for developing bibliographies, concordances, and similar data banks and information retrieval systems which can prove useful in many situations. Two such projects are briefly described a t this time. Nagosky [64] has developed the Opera Workshop Codex, whose object is "to place into a single volume as much objective information on operas as would be most useful to college and university opera and voice pro-
USES OF THE COMPUTER IN MUSIC COMPOSITION AND RESEARCH
107
grams.” The data bank for this project is a thoroughly cross-indexed file of information on 250 operas, including details as to title, composer, date, type of opera, duration of the opera, size of chorus, details on the ensembles within the opera, and so forth, for some two dozen items of information. Nagosky proposes that this data file be accessible from terminals throughout the state or larger area and thus provide an information retrieval for large numbers of music students in different institutions. Willis [81] developed a data bank of specific factual information on all performances at Orchestra Hall in Chicago, including details on ticket prices, attendance, type of concert program, persons participating, and so forth. From this data he drew various conclusions about the history of the Hall during periods of changing tastes and, of course, his output provides det,ailed information for the interested researcher. 4. Automated Music Typography for Composition and Research
The development of automated printing procedures such as photocomposition during the past two decades has left music printing relatively untouched. Most music copy is still prepared by autography, music typewriter, or hand engraving. The fields of music composition and music research would both benefit greatly if economies in the music printing industry would permit publication of serious works not presently feasible or profitable. Three methods of automated music printing are currently being explored: photocomposition, computer plotter, and special typography for the high-speed computer printer. Photocomposition of music is the subject of a project at Columbia funded by the Ford Foundation. Bauer-Mengelberg and Ferentz developed a music representation called DARMS (Digital Analog Representation of Music Symbols), although more generally known as the Ford-Columbia Representation for the encoding of music to be printed (see Section 3.1). The Photon disk was engraved with some 1400 symbols for the printing of music, symbols including all the usual music characters, upper and lower case alphabets, and a wide range of slurs and ties. Work has begun on the complex programming necessary to justify right margins (a more difficult matter than with printed text), extract parts from scores or build a score from parts, etc., but to date the project remains incomplete. The possibilities of using the computer plotter for printing music has attracted the attention of several researchers. Byrd [ I d ] ,Gabura [SO], and Raskin [7O] have developed programs which do a very respectable job of music printing. Byrd’s system is machine dependent (CDC 3600) and uses a program known as PLOTZ 9.
HARRY
108
B. LINCOLN
FIG. 11. Music printing by computer plotter. Courtesy of Professor J. Raskin.
It is, of course, difficult to draw smooth curves or slanting lines on the incremental plotter, but if the original is fairly large, reduced copy minimizes this problem. Figure 11 shows a sample of Raskin’s music printing by plotter. Each of the three researchers using the plotter developed his own music representation and unfortunately none of these representations is currently widely used by other researchers. The plotter also has the disadvantage of being slow (although cheap to run offline) and of course it is not as widely available on campuses as the line printer. A third possibility for printing of music, the use of special type characters for the line printer, has been the subject of a project currently under the direction of the present writer [56].Special type slugs have been designed in cooperation with the Glendale Laboratory of the International Business Machines Corporation in Endicott, New York. In designing the typography it was decided to work with the printer set for eight lines to the inch, thus spacing staff lines an eighth inch apart. This gives a half-inch staff (the distance between the outer lines), a size easily read in printout and yet permitting a reduction up to 50% for publishing of thematic indexes by any photooffset printing process. Figure 12 illustrates an example of the current state of this typography. Since the paper moves in only one direction through the printer, and since each music character is made up of two to six type pieces, the information for a line of music must be stored in a two-dimensional array and then the various pieces of type on each line are printed as the paper passes through the printer. Programming for the project has been carried out by Granger in the Computer Center at State University of New York at Binghamton. With the development of more type characters the possible uses of the typography might include, in addition to the present thematic indexing project of the author, such applications as special library cards and the printing of scores.
-
I
>
-
C
1 v
I I I ‘
.r 1 .
I 1 I
0
I’
8
.-
. >
I
I
1
I
?
h
R
r iE A
0
r L
- I
1
K
FIG.12. Computer typography for music.
I
I
I
A ”
-
USES OF THE COMPUTER IN MUSIC COJlAPOSlTlON AND RESEARCH
109
ACKNOWLEDGMENTS The author gratefully acknowledges helpful discussions with M. Mathews and F. Moore at Bell Telephone Laboratories, L. Hiller at the State University of New York a t Buffalo, J. K. Randall at Princeton, and M. Gallery and C. Granger a t the State University of New York at Binghamton. A DISCOGRAPHY OF COMPUTER MUSIC Prepared by M. Gallery, Computer Center, SUNY/Binghamton, N.Y. The compositions listed with each record are only those using the computer. The letters enclosed in brackets indicate whether the computer waa used in composition [C], sound synthesis [S], or both [C, S]. Angel S-36656 [C] Yannis Xenakis: ST/lO-1.080262 for Ten Instruments. Angel S-36560 [C] ST/g for string quartet. Decca DL 79103 [C, S] Music from Mathematics (this was the first computer music record). M. V. Mathews: May Carol Numerology The Second Law Joy to the World (Arrangement) Bicycle Built for Two (Arrangement) Molto Amoroso J. R. Pierce: Variation in Timbre and Attack Stochatta Five Against Seven Random Canon Beat Canon Melodie Theme and Variation S. D. Speeth: D. Lewin: Study % 1 Study % 2 N. Guttman: Pitch Variations J. Tenney: Noise Stuny Fr&reJacques Fantasia by Orlando Gibbons Decca 710180 [C, S] Voice of the Computer M. V. Mathews: Masquerades Slider Swansong J. R. Pierce: EightrTone Canon Computer Suite from “Little Boy” Jean-Claude Risset: R. N. Shephard: Shephard’s Tunes Wishful Thinking About Winter Wayne Slawwn: James Tenney : Stochastic Quartet Deutsche Grammophon Gesellschaft DG-2543005 [C] Algorithms I, Versions I and IV Lejaren Hiller: Heliodor HS-25053 [C] Computer music from the University of Illinois (note: this record has been deleted by the publisher). Illiac Suite for String Quartet Lejaren Hiller and Leonard Isaacson: Computer Cantata Lejaren Hiller and Robert Baker:
110
HARRY
B. LINCOLN
HeIiodor 2549006 [C] Lejaren Hiller:
Computer Music for Tape and Percussion, Avalanche for Pitchman, Prima Donna, Player Piano, Percussionist and Prerecorded Tape. Nonesuch H-71224 [C, S1 John Cage and HPSCHD (for Harpsichord and ComputerLejaren Hiller: Generated Sound Tapes) Nonesuch H-71245 [C, S] Computer Music Quartets in Pairs J. K. Randall: Quartersines Mudgett : Monologues by a Mass Murderer Barry Vercoe: Synthesism Charles Dodge: Changes Nonesuch H-71250 [C, S] Charles Dodge: Earth’s Magnetic Field Vanguard VCS-10057 [C, S] Lyric Variations for Violin and Computer J. K. Randall: REFERENCES 1 . Ashton, A. C. Electronics, music and computers. P h D . dissertation, Univ. of Utah, Salt Lake City, Utah, 1970. l a . Asuar, J. V. (Director of Tecnologia del Sonido, Facultad de Ciencias y Artes Musicales, Universidad de Chile), interview with Mr. Asuar, February 1971. 2. Barbaud, P., Znitiation a la composition musicale automatique. Dunod, Paris, 1966. 3. Baroni, M., Computer studies of the style of renaissance four-voice instrumental canzone. Unpublished manuscript, Institute for Music History of the University of Bologna, 1970. 4. Batstone, P., letter to author dated February 23, 1971, describing use of MUSIC 4BF, University of Colorado a t Boulder. 6. Bauer-Mengelberg, S., The Ford-Columbia input language, in Musicology and the Computer (B. Brook, ed.), pp. 4 S 5 2 . City Univ. of New York Press, New York, 1970. 6. Beauchamp, J. W., A computer system for time-variant harmonic analysis and synthesis of musical tones, in Music by Computers (H. von Foerster and J. Beauchamp, eds.), pp. 19-62. Wiley, New York, 1969. 7. Bernstein, L., and Olive, J., Computers and the 16th-century chanson, a pilot project a t the University of Chicago. Computers and the Humanities III(3), 153-161 (1969). 8. Bernstein, L., Data processing and the thematic index. Fontes Artis Musicae XI(3), 159-165 (1964). 9. Bowles, E., Musicke’s handmaiden: Or technology in the service of the arts, in The Computer and Music (H. Lincoln, ed.), pp. 3-20. Cornell Univ. Press, Ithaca, New York, 1970. 10. Brender, M., and Brender, R., Computer transcriptions and analysis of midthirteenth century musical notation. Journal of Mt~sicTheory 11(3), 198-221 (1967). 11. Bronson, B. H., Mechanical help in the study of folksong. Journal of A m r i c u n Folklore 63, 81-86 (1949). 12. Brook, B., The simplified “Plaine and Easie Code System” for notating music. F o n t s Artis Musicae XII(2/3), 156-160 (1965).
USES OF THE COMPUTER IN MUSIC COMPOSITION AND RESEARCH
1s. Brooks, F. P., Jr., Hopkins, A. L., Jr., Neumann,
111
P. G., and Wright, W., An experiment in musical composition. I R E Trans. Electron. Comput. EC-6, 175 (1957). 14. Byrd, D., Transcription by plotter. Random Bits 5(9), 1, 6-8 (1970). Indiana University Research Computing Center, Bloomington, Indiana. 16. Carlos, W., Switched on Bach. Columbia Record No. MS 7194 (1969). 16. Clough, J., TEMPO: A composer’s programming language. Perspectives of New Music Fall/Winter, 113-125 (1970). 17. Clough, J., A report from Oberlin. Computer Music Newsletter No. 1, pp. 2-5 (1971). Music Division, Purdue University, West Lafayette, Indiana. (Mimeographed.) 18. Cohen, J. E., Information theory and music. Behavioral Science VII(2), 137-163 (1962). 19. Collins, W., A new tool for musicology. Music and Letters XLVI, 122-126 (1965). 20. Crane, F., and Fiehler, J., Numerical methods of comparing musical styles, in The Computer and Music (H. Lincoln, ed.), pp. 209-22. Cornell Univ. Press, Ithaca, New York, 1970. 21. Divilbis, J. L., The real-time generation of music with a digital computer. Journal of Music Theory 8(1), 99-111 (1964). 22. Dodge, C., The composition of “Changes” and its computer performance. Ph.D. dissertation, Columbia Univ. School of the Arts, New York, 1970. See D i s cography, above. 23. Erickson, R., A general-purpose system for computer aided musical studies. Journal of Music Theory 13(2), 276-294 (1969). 24. Erickson, R., Rhythmic problems and melodic structure in Organum Purum: A computer assisted study. Ph.D. dissertation, Yale Univ., New Haven, Connecticut, 1970. 26. Erickson, R., Syntax-directed compiling. Proc. AFZPS Eastern Joint Computer Conf. 1964, pp. 31-57. Reprinted in Programming Systems and Languages (S. Rosen, ed.), pp. 264-297. McGraw-Hill, New York, 1967. 26. Fiore, M., Webern’s use of motive in the “Piano Variations,” in The Computer and Music (H. Lincoln, ed.), pp. 115-122, Cornell Univ. Press, Ithaca, New York, 1970. 27. Forte, A., A program for the analytic reading of scores. Journal of Music Theory 10(2), 330-363 (1966). 28. Fuller, R., Toward a theory of Webernian harmony, via analysis with a digital computer, in The Computer and 2clusic (H. Lincoln, ed.), pp. 123-131. Cornell Univ. Press, Ithaca, New York, 1970. 29. Gabura, A. J., Music style analysis by computer, in The Computer and Music (H. Lincoln, ed.), pp. 223-276. Cornell Univ. Press, Ithaca, New York, 1970. SO. Gabura, A. J., Music style analysis by computer. Master’s thesis, Univ. of Toronto, Toronto, Canada, 1967. 31. Gould, M., A keypunchable notation for the Liber Usualis, in Elektronische Date* uerarbeitung in der Musikwissenschuft (H. Heckmann, ed.), pp. 2 5 4 0 . Gustav Borne Verlag, Regensburg, 1967. 3%’.Gould, M., and Logemann, G., ALMA: Alphameric language for music analysis, in Musicology and the Computer (B. Brook, ed.), pp. 57-90. City Univ. of New York Press, New York, 1970. 33. Henke, W., Musical Interactive Tone Synthesis System. Mass. Inst. of Technol., Cambridge, Massachusetts, December, 1970. (Users Manual, Mimeographed.) 34. Hiller, L., Music composed with computers-a historical survey, in The Compi~ter and Music (H. Lincoln, ed.), pp. 42-96. Cornell Univ. Press, Ithaca, New York, 1970,
112
HARRY B. LINCOLN
56. Hiller, L. HPSCHD, Avalanche, and Algorithms. See Discography, above. $6. Hiller, L., Some compositionaltechniques involving the use of computers, in Music by Computers (H. von Foerster and J. Beauchamp, eds.), pp. 71-83. Wiley, New York, 1969. 57. Hiller, L., and Isaacson, L., ZZZim Suite for String Quartet, New Music Ed., Vol. 30, No. 3. Theodore Presser Co., Bryn Mawr, Pennsylvania, 1957. 58. Hiller, L., and Isaacson, L., Experimental Music. McGraw-Hill, New York, 1959. 59. Hiller, L., and Ruiz, P., Synthesizing musical sounds by solving the wave equation for vibrating objects. Unpublished manuscript. (Abstracted from P. Ruiz, Mus.M thesis, Univ. of Illinois, Urbana, Illinois, 1970.) 40. Hiller, L., and Baker, R., Computer Cantata: A study in compositional method. Perspectives of New Music 3(Fall/Winter), 62-90 (1964). 41. Howe, H. S., Jr., Music 7 Reference Manual. Queens College Press, New York, 1970. 42. Howe, H. S., Jr., Some combinational properties of pitch structures. Perspectives of New Music Fall/Winter, 45-61 (1965). @. Hudson, B., Towards a French chanson catalog, in The Computer and Music (H. Lincoln, ed.), pp. 277-287. Cornell Univ. Press, Ithaca, New York, 1970. 4.4. Hultberg, W. E., Transcription of tablature to standard notation, in The Computer and Music (H. Lincoln, ed.), pp. 288-292. Cornell Univ. Press, Ithaca, New York, 1970. 46. Jackson, R., Harmony before and after 1910: A computer comparison, in The Computer and Music (H. Lincoln, ed.), pp. 132-146. Cornell Univ. Press, Ithaca, New York, 1970. 46. Jackson, R., and Bernzott, P., A musical input language and a sample program for musical analysis, in Musicology and the Computer (B. Brook, ed.), pp. 130-150. City Univ. of New York Press, New York, 1970. 47. Karp, T., A test for melodic borrowings among Notre Dame Organa Dupla, in The Computer and Music (€I. Lincoln, ed.), pp. 293-297. Cornell Univ. Press, Ithaca, New York, 1970. 48. Kassler, J., Report from London: Cybernetic serendipity. Current Musicology No. 7, p. 50 (1968). 49.Kassler, M., Review of H. Lincoln, The current state of music research and the computer. [Orig. article published in Computers and the Humanities 5(1), (1970).1 Computing Reviews 11(12), 652-653 (1970). 60. Kassler, M., MIR-A simple programming language for musical information retrieval, in The Computer and Music (H. Lincoln, ed.), pp. 300-327. Cornell Univ. Press, Ithaca, New York, 1970. 60a. Knowlton, P. H., Interactive communication and display of keyboard music. Ph.D. dissertation, Univ. of Utah, Salt Lake City, Utah, 1971. 61. Kostka, S., The Hindemith String Quartets: A computer-assisted study of selected aspects of style. Ph.D. dissertation, Univ. of Wisconsin, Madison, Wisconsin, 1970. 62. LaRue, J., Two problems in musical analysis: The computer lends a hand, in Computers in Humanistic Research: Readings and Perspectives (E. A. Bowles, ed.), pp. 194-203. Prentice-Hall, Englewood Cliffs, New Jersey, 1967. 65. Lefkoff, G., Automated discovery of similar segments in the forty-eight permutations of a twelve-tone row, in The Computer and Music (H. Lincoln, ed.), pp. 147-153. Cornell Univ. Press, Ithaca, New York, 1970. 64. Lieberman, F., Computer-aided analysis of Javanese music, in The Computer and Music (H. Lincoln, ed.), pp. 181-192. Cornell Univ. Press, Ithaca, New York, 1970.
USES OF THE COMPUTER IN MUSIC COMPOSITION AND RESEARCH
113
66. Lincoln, H., A Computer application to musicology. ZnformatiME Processing (ZFIPS) 68,957-961 (1968). 66. Lincoln, H., Toward a computer typography for music research: A progress report.
Paper presented to the International Federation Information Processing Societies, Ljubljana, 1971. 67. Lincoln, H., The thematic index: A computer application to musicology. Computers and the Humanities II(5), 215-220 (1968). 68. Lockwood, L., Computer assistance in the investigation of accidentals in renaissance music. Proceedings of the Tenth Cmgress of the Znternatwnal Musicological Society, Ljubljana, 1967. 69. Logemann, G., The canon in the Musical Offering of J. S. Bach: An example of computational musicology, in Elektronische Datenverarbeitung in G?+T Musikwissenschuft (H. Heckmann, ed.), pp. 63-87. Gustav Bosse Verlag, Regensburg, 1967. 60. Mathews, M. V., The Technology of Computer Music. M I T Press, Cambridge, Massachusetts, 1969. 61. Mathews, M. V.,and Moore, F., GROOVE-A program to compose, store, and edit functions of time. Commun. ACM 13(12), 715-721 (1970). 6.8.Mathews, M. V., and Rosler, L., Graphical language for the scores of computergenerated sounds. Perspectives of New Music 6(2), 92-118 (1968). 6s. Meylan, R., Utilisation des calculatrices electroniques pour la comparaison interne de repertoire des basses danses du quinzieme siecle. Fondes Artis Musicae X11(2), 128-134 (1965). 64. Nagosky, J., Opera Workshop Codex. Dept. of Music, Univ. of South Florida, Tampa, Florida, 1969. (Mimeographed.) 66. Olson, H., and Belar, H., Aid to music composition employing a random probability system. J . Acwust. Soc. Amer. 33, 1163 (1961). 66. Olson, H., and Belar, H., Electronic music synthesizer. J . Acwust. SOC.Amer. 26, 595-612 (1955). 67. Pinkerton, R. C., Information theory and melody. Sci. Amer. 194,76 (1956). 68. Randall, J. K., For recordings, see Discography, above. 69. Randall, J. K., Three lectures to scientists. Perspectives of New Music 5(2), 124-128 (1967). 70. Raskin, J., A Hardware independent computer graphics system. Master’s thesis in Computing Science, Pennsylvania State Univ., University Park, Pennsylvania, 1967. 71. Rosenboom, D., Homuncular homophony. Paper presented at the Spring Joint Computer Conference, 1971. (Mimeographed.) 72. Schigdt, N., and Svejgaard, B., Application of computer techniques to the analysis of byzantine sticherarion melodies, in Elektronische Datenverarbeitung in der Musikwissenschuft (H. Heckmann, ed.), pp. 187-201. Gustav Bosse Verlag, Regensburg, 1967. 73. Selleck, J., and Bakeman, R., Procedures for the analysis of form: Two computer applications. Journul of Music Theory 9(2), 281-293 (1965). 74. Simon, H. A., and Newell, A., Heuristic program solving: the next advance in operations research. Oper. Res. 6, 1-10 (1958); quoted in Framkel, A. S., Legd information retrieval, in Advances in Computers (F. L. Alt and M. Rubinoff, eds.), Vol. 9, p. 114. Academic Press, New York, 1969. 76. Slawson, W., A speech-oriented synthesizer of computer music. Journal of Music Theory 13(1), 94-127 (1969).
114
HARRY B. LINCOLN
76. Slawwn, W., review of G. Lefkoff, Computer Applications in Music. (West Virginia Univ. Libr., Morgantown, West Virginia, 1967.) Journal of Music Theorg 12(1), 108-111 (1968). 77. Stoney, W., Theoretical possibilities for equally tempered musical systems, in The Computer and Music (H. Lincoln, ed.), pp. 163-171. Cornell Univ. Press, Ithaca, New York, 1970. 78. Suchoff, B., Computer applications to Bartok’s Serbo-Croatian material. Tempo LXXX,15-19 (1967). 79. Trowbridge, L., A computer programming system for renaissance music. Unpublished paper presented to the American Musicological Society, Toronto, November, 1970. 80. Wenker, J., A computer-oriented music notation, in Musicology and the Computer (B. Brook, ed.), pp. 91-129. City Univ. of New York Press, New York, 1970. 81. Willis, T. C., Music in Orchestra Hall: A pilot study in the use of computers and other data processing equipment for research in the history of music performance. Ph.D. dissertation, Northwestern Univ., Evanston, Illinois, 1966. 82. Xenakis, Y., Achurripsis. Bote & Bock, Berlin, 1959. 83. Xenakis, Y., Free stochastic music from the computer. Gravesaner Blaetter VI, 69-92 (1962).
File Organization Techniques DAVID C. ROBERTS Informatics, Inc.
Rockville, Maryland
1. Introduction . . 2. Survey of File Organizations . 2.1 Introduction . 2.2 Sequential Organization . 2.3 Random Organization. . 2.4 List Organization . 2.5 Tree Structures . . 3. Random File Structures . 3.1 Introduction . 3.2 Direct Address . . 3.3 Dictionary Lookup . 3.4 Calculation Methods . . 4. List File Structures . . 4.1 Introduction . 4.2 List Structures . . 4.3 Linear Lists . 4.4 Inverted File and Multilist Organization 4.5 Ring Organizations . 5. Tree File Structures . 5.1 Introduction . 5.2 Tree Structured . . 5.3 Minimizing Search Time . 5.4 Representations of Trees . 6. Implementation of File Structures . 6.1 Introduction . . 6.2 Sequential Data Set Organization . 6.3 Partitioned Data Set Organization 6.4 Indexed Sequential Data Set Organization . 6.5 Direct Data Set Organization 6.6 A Procedure for File Design . References. .
.
115
. 116 . 116
. . . .
. .
.
.
. . . .
.
.
. . . . .
. . . .
. . . . .
117 118 121 128 130 130 131 131 131 143 143 145 146 148 150 151 151 152 155 157 160 160 161 162 162 164 164 166
1. Introduction
This paper is a survey of file organization techniques. Section 2 is an elementary introduction to file structures, and also introduces the more detailed discussions of Sections 3 through 5 . Section 3 discusses random file structures, including direct address, dictionary lookup, and calculation 115
116
DAVID C. ROBERTS
methods of file addressing. Section 4 describes list structures-lists, inverted lists, multilists, and rings are discussed. Tree file structures, including symbol trees, immediate decoding trees, and directory trees, conclude the discussion of file structures. In Section 6, the file structure survey material is related to an available computer system, the IBM System/360. The data set organizations that are supported by Operating System/36O are each briefly described, followed by a presentation of a systematic method for file design. The objective of this paper is to introduce the reader with limited knowledge of file organization techniques to this area in such a manner that he will have some feeling for the tradeoffs that are made in the design of a file organization. For this reason the discussion of each technique, rather than summarizing the methods that have been deveIoped to obtain the utmost in efficiency, emphasizes instead the more fundamental considerations that are important in the selection of a file organization technique. 2. Survey of File Organizations 2.1 Introduction
This section classifies and introduces various techniques of file organization. Extensive work is being performed in this area, and, because of this, the terminology is expanding somewhat faster than it is being standardized. Therefore, the definitions offered here are not absolute, and some TABLEI EXAMPLE FILE Automobile Record number License
Make
Owner Model year
1
477-018 Corvair
1962
2
481-062 Plymouth
1968
3
791-548 Mercury
1967
4
521-607 Cadillac
1969
5
521-608 Volkswagen 1964
Name
Street address
City
JohnT. Mayo 5559Oak Laurel Lane Stanley N. 8003 Ross- Waldorf Rudd burg Drive Patrick J. 1627 Daisy Silver Berry Lane Spring Roger Johnson 1901 Bruce Waldorf Place VeraC. 1901 Bruce Waldorf Johnson Place
State Md. Md. Md. Md.
Md.
FILE ORGANIZATION TECHNIQUES
LAST NAME
117
FIRST NAME, INITIAL
STREET ADDRESS CITY
I
AUTOMAKE
STATE
I
1 MODELYEAR
LICENSE NUMBER
I
I
FIG.1. Record format for example file.
authors may use different terminology. The overall classification scheme presented here, and the order of presentation, is a modification of Dodd’s
WI.
For the basic terminology, the standard COBOL definitions are available [55].An elementary data item is a piece of information that is stored, retrieved, or processed. A collection of one or more data items that describes some object is called a record. For handling convenience, records are usually grouped into logical units called Jiles. A data management system, then, deals with records in files. Each record contains some number of elementary data items which are the object of processing operations. An elementary data item that is used to organize the file is called a key. Table 1 is a sample file that is used to illustrate this discussion. The file contains data on automobiles and their owners, as might be maintained by a state department of motor vehicles. Figure 1 shows a record layout that might be used for each record of this 61e. The number of bits, characters, or machine words allocated to each field in the record would depend on the characteristics of the computer to be used, and are therefore ignored here. 2.2 Sequential Organization
Sequential organization was, historically, the first to be developed and is the best known. In a sequential file, records are arranged in position according to some common attribute of the records. Sometimes records in a sequential file are stored in order of their arrival in the file. Figure 2 shows part of the example file organized as a sequential file, where the file is ordered by license number. The principal advantage offered by sequential organization is rapid access to successive records. That is, if the nth record has just been accessed, then the n l t h record can be accessed very quickly. This is always true if the sequential file is resident on a single-access auxiliary
+
DAVID
118
Mayo J o h n T . 5559 Oak Lane Laurel Md.
Md. Piyrnouth 1968 481-062 Johnson
C. ROBERTS
Corvair 1962 477-018
Roger
1901 Brucr Place
Rudd Stanley N. 8003 Rossburg Drive
Waldorf
Md. Cadillac 1969
Waldorl
521-607
FIG.2. Sequential organization of example file.
storage device, such as a tape drive. However, if the file is stored on a multi-access device such as a disk drive, then, in a multiprogramming environment, head motion caused by other users may reduce this advantage. A sequential file is searched by scanning the entire file until the desired record is found. For a large file, this is a lengthy process. For this reason, transactions for a sequential file are usually accumulated, sorted so that they are in the same order as the file, and then presented a t all once for processing, so that only one pass through the file is required. Using such batching techniques, it is possible to process sequential files a t a very low cost per transaction. Obviously, a sequential file is not well suited to an online application, where the transaction batch size is small. When a file is to be processed on the basis of more than one key, efficiencies that are achieved by ordering the file on a single key are impossible unless a duplicate copy of the file, ordered on the second key, is maintained. The alternate copy can then be used to determine the primary keys of the record to be processed, and standard sequential processing techniques can be used. This technique adds many steps to any process accessing the file, greatly reducing efficiency. I n summary, sequential organization permits rapid access to successive records and is the most economical implementation of a file that will have large batches of transactions processed against it according to one key. However, processing and retrieving records out of sequence is very slow, and small-volume updates are inefficient, since any update requires recopying of the entire file. 2.3 Random Organization
The records in a randomly organized file are arranged according to some established relationship between the key of the record and the location of the record on direct-access storage; records are stored and retrieved through the use of this relationship. There are three types of random organization : direct address, dictionary lookup, and calculation.
FILE ORGANIZATION TECHNIQUES
119
F I 8003 Rossburg Drive
5559 Oak Lane Laurel
1627 Daisy Lane
Plymouth
Corvair
79 1-548
48 1-062
477-01 8
062
018
L=J==l Johnson
548
Johnson
Roger
Vera
I
c.
I90 I Bruce Place
I901 Bruce Place
Waldorf Volkswagen 521-608
52 I 4 0 7 L
1964
I
~~
607
I
I
608
FIG.3. Direct address organization of example file. 2.3.1 Direct Address
When the address of the record is known, this address can be used directly for storage and retrieval. This presumes that some address bookkeeping is performed outside the computer system, or that each record contains some field that is used directly as the key. Figure 3 shows a direct organization of the example file. I n this case the last three digits of the vehicle license number must be known to access a record. A direct address organization cannot often be used because usually the programs that are to access a file must also keep track of the locations assigned to records. Whenever direct addressing can be used, of course, it is optimally efficient, since no accesses to any file or table need to be made in order to access a record in the file. 2.3.2 Dictionary lookup
A dictionary is a table of two-element entries, each of which specifies a key-to-address transformation. When a record is added to the file, an entry is added to the dictionary; when a record is deleted, a dictionary entry is removed. Retrievals are performed by looking up the desired key in the dictionary, and then using the address obtained from the dictionary to access the record. Figure 4 shows a dictionary for the example file, which could be used for access based on owner’s surname.
120
DAVID C. ROBERTS
Berry
Johnson
Johnson
Mayo Rudd
FIG.4. Dictionary for example file.
Since each reference to a record in the file requires a search of the dictionary, the search strategy used has great influence on the effectiveness of the file design. The two search strategies that are commonly employed are the binary search and sequential scan. If the dictionary is not maintained in any collating sequence, a sequential scan is the only method that can be used to obtain an address. If the dictionary has n entries, then, on the average, (n 1)/2 accesses will be necessary to obtain an address. On the other hand, if the dictionary is maintained in collating sequence of the keys, a binary search is possible. The binary search begins by first testing the key at the location that is a power of 2 nearest to the middle of the dictionary. A comparison indicates whether it is the desired key, and, if not, in which half of the file the key is located. This operation is then repeated, eliminating half the remaining dictionary at each step, until the desired key is located, or its absence is established. In order to use a dictionary with maximum effectiveness, the entire dictionary should be kept in core memory. But for many files of practical size, the dictionary is so large that this becomes impossible. In this event, it is necessary to segment the dictionary. Segmented dictionaries are often cascaded, so that there is a hierarchy of dictionaries that are searched. Such an organization is really a tree structure, which is discussed below. The degradation in esciency produced by linear segmentation of a dictionary depends on the frequency of reference to parts of the dictionary that are not resident in main memory. If the most frequently referenced entries are kept in main memory, this degradation may be slight.
+
2.3.3 Calculation
In the calculation method, a key is converted into an address by some computational manipulation of the key. Since the set of addresses produced
FILE ORGANIZATION TECHNIQUES
121
is much smaller than the set of possible keys, the addresses generated from distinct keys are not always distinct. Two processes, compression and hashing, are discussed here; only hashing, however, is an address calcdation method. Compression is included because an important performance measure of both compression and hashing algorithms is the extent to which they map different keys into the same transformed key. Compression (called “abbreviation” by Bourne and Ford [as]) is the transformation of the key so that it requires as little storage as possible, and yet retains as much as possible the discrimination and uniqueness of the original key [25]. Compression is normally applied to keys that are words or names in natural language, so as to remove characters that add the least to the information content of the keys. Hashing is a transformation on keys that produces a uniform spread of addresses across the available file addresses [IQS].Thus, hashing is used to transform a key (that may have been compressed) to an address. ,4popular hashing algorithm is to split the key into two parts, multiply the two halves of the key, and use the middle bits of the product as the hash address. Compression techniques are especially useful in cases where the keys may contain errors, such as may occur in systems performing retrievals based on names, particularly if the name has been transmitted verbally. Specialized compression techniques based on phonetics have been developed that map various spellings of the same word into the same compressed key [Q9].A name-based retrieval system is an example where both compression and hashing might be used together. The main problem connected with the use of calculated addressing concerns the ambiguity that is introduced by the transformation to a shorter key [IQS].When two keys calculate to the same file address, a “collision” is said to have occurred; some method of resolution of the conflict must be employed. The general effect of incorporating collision handling into a file design is to require a large increase in the number of processing steps that are performed whenever a collision is encountered. This effect can substantially degrade the performance of a calculated addressing sckieme, despite its intuitive simplicity. Therefore, in a calculated addressing scheme, careful selection of transformation algorithms and the collisionhandling scheme must be made. 2.4 List Organization
A pointer to a record is the address of that record, expressed in a way that permits the direct location of the record. Thus, a pointer can be an actual disk address, or it can be an address relative to the disk address of
122
DAVID
C. ROBERTS
the first record of the file, or some other quantity. By the use of pointers to imply linking relationships between records, it is possible to completely divorce the physical and logical arrangement of a file. I n fact, through the use of pointers, it is even possible to represent recursive data structures, which have no representation without pointers. The fundamental component of a list is a record, as defined above, where one or more of the fields may be pointers. Then a list can be defined as a finite sequence of one or more records or lists. Lists that are not recursive and do not have loops or intersections can be represented without the use of pointers, if physical ordering is used t o represent the linking relationship between records. This type of allocation is called sequential allocation; in contrast, the use of pointers to join related records is called linked allocation. Linked allocation schemes are easier to update, but require more storage for pointers. I n the figures showing file organizations with pointers, pointers are represented by arrows from the record containing the pointer. The end of a list is indicated by a pointer of some special value, often zero; in the figures this end-of-list indicator is represented by the symbol used to represent “ground” in circuit diagrams, after Knuth [I071:
-
6
2.4. I Linear Lisfs
A linear list is a set of records whose only structure is the relative linear positions of the records [IOi”]. There are special names for linear lists in which insertions and deletions are made a t one end of the list [I071: A stack is a linear list for which all insertions and deletions are made a t the same end, called the top. A queue is a linear list for which insertions are made a t one end, called the back, and deletions are made a t the other end, called the front. A deque (contraction of “double-ended queue”) is a linear list for which insertions and deletions can be made a t either end, called the left and right of the deque.
These three structures are encountered frequently. They are sometimes also called queueing disciplines; a stack is a LIFO (last in first out) queue, a queue is a FIFO (first in first out) queue, and a deque is a queue that can be used in either way. These names reflect the primary use of these structures-the construction of various types of task queues. They also occasionally are useful as intermediate files in applications that require complex retrieval processes on large files. For example, a stack might be used t o accumulate the results of a search of a file on a primary key for later qualification by other processes. A linear list may be implemented using either sequential or linked allocation. Linear lists are not limited to these restricted cases. In general,
FILE ORGANIZATION TECHNIQUES
I Rudd
John T.
I Stanley N. I
Patrick I.
5559 Oak Lane
1627 Daisy Lane
Laurel
I962
Waldorf
Md.
Plymouth
1968
477-018
I
123
791-548
I + -
3
/----Johnson
Roger
Johnson
Vera C.
1901 Bruce Place
190 1 Bruce Place
Waldorf
Waldorf
Md.
Cadillac
Volkswagen
1964
521-607
521-608
+
0
4
1_ -
4
FIG.5. List organization of example file.
additions and deletions may be made at any point in a list, and a record may be on several lists at once. The great ease with which additions and deletions can be made in a linked list is one of the chief advantages of linked allocation. Figure 5 shows a linear list organization for the example file, where a separate list has been used for each distinct city of residence in the file. The first spare field has been used as the list pointer. The Waldorf list of Fig. 5 could also be implemented by constructing a sequentially allocated list consisting of record No. 2 followed by record No. 4, followed by record No. 5, without any pointers. 2.4.2 Inverted File
An inverted file is composed of a number of inverted lists. Each inverted list is associated with one particular value of some key field of a record and contains pointers to all records in the file that contain that value of the key field [120]. Inverted lists are normally produced for all fields of the record, permitting the file to be accessed on the basis of any field. Figure 6 shows an inverted file structure for the example file. Since the file is inverted on the basis of all fields, any one can be used to access a record.
124 DAVID C. ROBERTS
0
c
w
FILE ORGANIZATION TECHNIQUES
125
FIQ.7. Multilist organization of example file.
126
DAVID
C. ROBERTS
Note that the longest inverted list is the one for “Md.,” which appears in all records in the example file. An inverted file permits very rapid access to records based on any key. However, updating an inverted file structure is difficult, because all the appropriate inverted lists must be updated. For this reason, an inverted file structure is most useful for retrieval if the update volume is relatively low or if updates can be batched. A variation of inverted file structure that includes features of lists is multilist structure. A multilist consists of a sequential index that gives, for each key, the location of the start of a list that links together all records characterized by that key value. A multilist can be regarded as an inverted list structure in which all entries after the first in each inverted list have been represented by lists in the file rather than by entries in the inverted list. Figure 7 shows a multilist organization of the example file. Note that one link field is needed for each of the key fields in the original record, since exactly one list will pass through every record for each field in the record. If the lists are divided into pages, if pointers in the index refer to records by page and record number within page, rather than record number within the file, and if each list is restricted in length to one page, the structure is called a cellular multilist. In a cellular multilist, then, each inverted list is represented by a number of sublists, where a sublist is a linked list within a page. The index points to the first record of each sublist. A multilist is easier to update than an inverted file because it avoids the necessity for complete reorganization of the sequentially allocated inverted lists, but retrievals are slower than with an inverted file because the lists must be traversed to perform a retrieval. A cellular multilist organization lies midway between the inverted file and multilist, both in updating difficulty and in retrieval speed, because it represents the inverted lists by a structure that is partially linked (the file) and partially sequentially allocated (the index). 2.4.3
Rings
A ring is simply a linear list that closes upon itself. A ring is very easy to search-a search starting a t any element of the ring can search the entire ring. This facilitates entry into a ring from other rings. The danger that an unsuccessful search may continue endlessly around the ring is solved very simply if the search program saves the record number of the first record it searches; the record number of each record that is searched is then compared to this stored number, and the search terminates unsuccessfully when it returns to its starting point [195].
8
f
P
g
4
5 d
FIG.8. Ring organization of example file.
h)
u
128
DAVID C. ROBERTS
One important use of rings is to show classifications of data. A ring is used to represent each class; all records in that class are linked t o its ring. If a tag field is included in each record along with each pointer, then a class can be searched for the records it contains, or a record can be searched for the classes which contain it with equal ease. The classification scheme can be hierarchic; if it is, a hierarchic search of the file can be performed very easily, starting at the highest level ring. Figure 8 shows a ring organization of the example file. This example shows only a city ring; other rings could be incorporated if the file was to be entered using other keys. The organizations of Figs. 7 and 8 are very similar; in fact, the only two dserences are the circularity of the ring (rather than having some end) and the ring versus sequential organization of the index information. If a file is to be accessed using a variety of types of data, using a ring structure to classify the index information reduces the average access time. Updating a ring structure is slightly easier than with other list structures because all the pointers to be modified during an update can always be found by traversing the ring. There are two chief disadvantages associated with the use of ring structures: the overhead introduced by the pointers (which is essentially identical to the overhead associated with any list organization) , and the number of operations required to locate a record. With an inverted file, for example, if a record is to be located on the basis of three keys, the three appropriate inverted lists can be used in combination to locate the desired record very quickly; with a ring structure, one of the three rings would have to be searched sequentially. Thus, if the rings are very long, the search time to locate a record can be long. 2.5 Tree Structures
A tree file structure corresponds to the hierarchic segmentation of a dictionary. The highest level dictionary points to a second-level dictionary, and so on, until the bottom level dictionary points to records. A tree is a hierarchic structure that corresponds to the Dewey Decimal notation used for classifying books [lor].The root of the tree might be given the number 1, and the first-level branches would be 1.1, 1.2, and 1.3, and so on. Another example of a tree structure is the numbering system used for this article, which is also a Dewey Decimal notation. Trees can be named by the information content of each record. If each record contains only one symbol, the tree is called a symbol tree; if each record contains part of a directory, the tree is called a directory tree [120]. Figures 9 and 10 show symbol and directory trees, respectively, for searching the example file by license number. In both figures, *n is used to
FILE ORGANIZATION TECHNIQUES
4
7
A
1
i i 1
129
7
9
8
I
I
1
i
5
6
4
7
I
*4
8
I
*5
8
*3
FIG.9. Example file symbol trees.
represent a pointer to the location of the nth record. Three symbol trees are required because there are three starting symbols. With this structure, a record address is obtained in six comparisons; using the directory tree, in two to five comparisons. The correspondence between a tree structure and a binary searching algorithm is very strong; in fact, a tree can be drawn to represent the actions of a binary search. The root of the tree would correspond to the middle record of the file; the two elements on the second level would correspond to the one-quarter and three-quarter elements in the file, and so on [283]. In this way, a tree can be searched with the same number of operations as a binary search, but without the necessity for sequential allocation. Thus, at the cost of storage for pointers, the speed of a binary search can be obtained with the ease of linked updating. This is the primary motivation for the use of tree structures in large random-access files.
130
DAVID
4774318
481-062
C. ROBERTS
1-
521-607
521-608
791-548
FIG.10. Example file directory tree.
3. Random File Structures 3.1 Introduction
The availability of direct-access auxiliary storage devices naturally suggests the use of random file organization. The use of a single file address or record number t o retrieve any record in a file with only one access to auxiliary storage is intuitively attractive, both in terms of programming simplicity and speed of execution. If some key is available whose number of possible distinct values is equal t o the capacity of the file in records, and if some one-to-one mapping of these keys onto record addresses can be constructed, then random organization is very simple t o use. I n such a case the method called direct address organization, described below, is used. But, more often, the number of possible distinct key values is much greater than the number of record addresses that can be allocated for the file, and the distribution of key values t o be expected may not be known. I n this case, if random organization is to be used, some method must be developed to transform key values into addresses of records in the file for all possible key values. The two principal methods of performing this transformation, namely, dictionary lookup and calculation, are discussed below. Note that the contents of a record are also often under the control of the file designer. I n particular, each physically addressable storage unit may contain several records. I n this case, the addressable unit of auxiliary storage is called a physical record, the file record a logical record. I n this discussion, the term record always refers t o logical record.
FILE ORGANIZATION TECHNIQUES
131
3.2 Direct Address
Direct addressing, when it can be used, is optimally efficient, since only one auxiliary storage access is required to obtain any record. In this case, the file design task is trivial and is completed immediately; auxiliary storage for one record is allocated for each possible key, the key to record number transformation is programmed, and auxiliary storage for one record is allocated for each possible key. 3.3 Dictionary lookup
Dictionary lookup is a very effective file accessing method if the dictionary can be kept in main memory and if all keys have the same length, I n this situation, the dictionary can be kept in collating order of the keys, and a binary search can be used to find the address of any record very quickly, and only one access to auxiliary storage is needed to obtain any record. If the keys are not all the same length, two alternatives are available. Space sufficient for the longest key in the dictionary can be allotted to every key, in which case a binary search of the dictionary can still be used, or just sufficient space can be allocated for each key, in which case a sequential scan must replace the binary search. The first alternative increases the size of the dictionary; the second increases the average number of operations needed to access a record. If the dictionary is too large to be kept in core, it must be segmented. A hierarchically segmented dictionary is a tree, discussed in Section 5. A sequentially segmented dictionary requires complicated programming techniques to make it an efficient addressing mechanism. The basic problem is that if the dictionary is arranged so that its most frequently accessed entries are always kept in main memory in order to minimize auxiliary storage accesses, a binary search of the dictionary is impossible. Dictionary lookup can be used to provide access to a file based on more than one key-a separate dictionary is used for each key. If a dictionary is provided for every field, the structure becomes an inverted file, as discussed in Section 4. Multiple dictionaries tend to slow down updating. If a record is added to or deleted from a multiple-dictionary file, all the dictionaries must be updated. Since they are normally kept ordered by collating sequence, sorting of the dictionaries is also necessary. 3.4 Calculation Methods
Dictionary lookup imposes the overhead of an index between the key and the address of the desired record. If the key is known imprecisely, and several probes of the dictionary might be necessary to locate the file key
132
DAVID C. ROBERTS
that is closest to the desired key, maximum possible use of the information content of the dictionary has been made. On the other hand, if the key is known precisely, it seems reasonable to structure the file so that the key can be used to retrieve the record without access to any intermediate tables [l7]. Calculation addressing methods have been developed to permit access to a file without the use of intermediate tables. Hash addressing is the use of a key of the desired record to locate the record in storage by means of some transformation, called a hash, on the key. The ideal hash spreads the calculated addresses uniformly throughout the available addresses. A collision is said to have occurred when two keys calculate to the same address. Some method of handling collisions must be provided as part of any file structure incorporating hash addressing techniques [I@]. In the sense that a key representing part of the contents of a record is used to address the record, a file organization using hash addressing acts as an associative memory [IIY]. Sometimes the key field to be used for calculation may contain some characters that do not contribute to the information content of the key; the classic example of this is the redundancy of words in natural language. If keys with redundancy are used as the basis for hash addressing, the distribution of calculated addresses produced by the hash algorithm will be adversely affected. Of course, the performance of any other scheme of key-to-address transformation will be similarly degraded by redundancy in keys; for example, key redundancy would cause every dictionary entry to be longer than otherwise required. In the case of proper nouns and English language words, compression techniques have been developed to reduce this redundancy. Compression is not the same process as hashing: a compression technique transforms a key into another key, but, a hash transforms a key into a file address. 3.4.1 Cornpression
As stated by Bourne and Ford [65],“the major objective [for a compression technique is] to provide as much condensation as possible while maintaining a maximum amount of discrimination between the [keys.]” They tested a large number of compression techniques and tabulated the results of their testing. Each algorithm was tested on a number of different sample data bases, including English words, and various collections of names. TO test an algorithm, it was applied to all the members of a set of test keys. The number of unique compressed keys that were produced by the algorithm were counted, and this number was used as a performance measure.
FILE ORGANIZATION TECHNIQUES
133
Some of the more important compression measures tested were: (1) Selective dropout of every nth letter. Starting from the left end of the word, every nth letter is dropped. The algorithm is applied as many times as necessary to shorten the word to the desired length. For example, consider the word “abdication” for n = 3, and compressing to four letters: First pass abictin Second pass abctn Third pass abtn (2) Selective dropout by character usage rankings for each letter position. Using a separate ranking of character usage for each letter position, eliminate the most popular letters, in order of decreasing popularity, until the desired compressed key length has been reached. (3) Selective dropout by a single ranking of bigram usage. Using a single ranking of bigram usage, starting with the most popular bigram, delete bigrams in order of popularity until the desired compressed key length is reached. A bigram is a pair of adjacent letters; each letter of a word except the first and last contribute to two bigrams. (4) Selective dropout by a single ranking of letter usage. From a single ranking of the usage of letters for all letter positions, remove letters in the order of popularity, until the desired compressed key length has been reached. Bourne and Ford used different rankings for common and proper words : Common words
EIROATNSLCPMDUHGYBFVKWXZJQ Proper words EARNLOISTHDMCBGUWYJKPFVZXQ (5) Vowel elimination. Starting from the left, eliminate the vowels a, el i, 0,and u until the desired compressed key length is reached. If there are not enough vowels in the key to reach the desired compressed key length, then truncate from the right to the desired length. (6) Truncation from the right. Starting from the right, remove letters until the desired compressed key length is reached. Bourne and Ford augmented one of their techniques by generating a “check” character from the characters that had been removed, and appending it to each compressed key. Generation of a check character is straightforward. For example, if the letters of the alphabet are represented by the integers 1 to 26, the deleted characters can be summed modulo 26 to obtain an integer in the required range for a check letter. Consider the
134
DAVID
C. ROBERTS
word “abdication” truncated above to “abtn” : a b
t n
1 2 20 14 37
=
11 mod 26
In this case, the check character would be the eleventh letter of the alphabet, “k”. Tables I1 through V present some of Bourne and Ford’s results.’ Table I1 shows the symbols that are used to identify the compression techniques in the figures that follow. Table I11 is a ranking of the techniques by their performance on words, for various lengths of compressed keys. Note that selective dropout with n = 2 and a check character appended gave the best results in every case. The techniques of selective dropout by character usage ranking were uneven performers and gave good results only for certain compressed key lengths. Vowel elimination and truncation from the right performed very poorly. Table IV lists the number of unique words remaining in a collection of 2082 words, after compression to various lengths. This table is useful for the selection of a compressed key length. For technique A with a key length of four characters the number of distinct compressed keys obtained TABLEI1 KEY TO SYMBOLS USED IN TABLES 111 TO V Technique symbol
Technique
G H
Selective dropout of every 2nd letter; check character added Selective dropout by character usage rankings for each letter position Selective dropout by a single ranking of bigram usage Selective dropout of every 2nd letter; no check character Selective dropout of every 3rd letter; no check character Selective dropout by a single ranking of letter usage Vowel elimination Truncation from the right
Copyright 1961, Association for Computing Machinery, Inc.; reprinted by permission of the Association.
FILE ORGANIZATION TECHNIQUES
135
TABLE I11 PERFORMANCE RANKING OF VARIOUS TECHNIQUES BY PERFORMANCE ON COMMON WORDS Compressed key length, in characters 3
4
5
G E
A D E C G B F
H
H
An
C D
F B
* See Table I1 for key
6
to symbols.
with a compressed key length of four characters was nearly identical to the original size of the t,est set, so that a compressed key length of four characters should suffice for technique A for a key collection similar to the test collection. With the use of any other technique, a compressed key length of a t least six characters would be required to obtain the same performance. TABLE IV NUMBEROF UNIQUECOMPRESSED KEYSGENERATED FROM COMMON WORDS Compressed key length, in Characters 1
2 3 4
5 6 7 8
Technique symbol0
A
B
C
D
E
F
G
H
26 511 1831 2056 2078 2080 2082
26 418 1511 1997 1960 2077 2081 2082
26 472 1611 1965 2043 2069 2076 2082
26 401 1576 1991 2054 2075 2077 2079 2080
26 196 1060 1912 2048 2068 2077
26 388 1545 1871 1957 2073 2079 2081 2082
26 300 1087 1653 1968 2051 2078 2082
26 196 841 1456 1762 1938 2012 2054 2073 2081 2082
9
10 11 a
See Table I1 for key to symbols.
136
DAVID
C. ROBERTS
TABLEV NUMBER OF UNIQUE COMPRESSED KEYSGENERATED FROM NAMES Compressed key length, in characters 1 2 3 4 5 6 7 8 9 10 11 12 a
Technique symbola A
D
E
H
25 606 7117 8122
25 557 6313 8013 8171
25 245 2559 7115
25 245 1542 3561 4914 5875 6756 7377 7766 7953 8042 811.7
See Table I1 for key to symbols.
Table V lists the number of unique proper names remaining in a test collection of 8184 unique names of people. Before the compression algorithms were tested, the names were edited by removing blanks and other nonalphabetic characters to form a single word of no more than 22 characters. In this case, very good performance could be obtained using algorithm A and a compressed key length of four letters. In every case, however, the names, which originally contained six to 22 characters, could be represented with very little loss of uniqueness by a compressed key of ten characters. In a system such as an airline reservation system, a bank account record system, or a credit card account system, the system must be able to perform retrievals based on names, and the names may be misspelled, especially if they are being communicated verbally. Special compression techniques, such as Soundex and Consonant Coding have been developed for such systems. These techniques are designed to map, as much as possible, all the possible spellings of any given name into the same compressed key. The Soundex compression scheme is usually used to convert names into one alphabetic character and three digits, although other compressed keys could be produced. The rules are as follows [65]: (1) The first character of the name is used for the first character of the compressed key.
FILE ORGANIZATION TECHNIQUES
137
(2) Except for the first letter, drop all occurrences of a, el i, 0, u, y, w, and h. (3) Assign the following digits to the following similar-sounding sets of characters: Character Digit B,F,P,V 1 C,G,J,K,QS,X,~ 2 DlT 3 L 4 M,N 5 R 6 Insufficient consonants 0 (4) Delete the second letter in a pair of adjacent identical letters of pair of letters in the same group. (5) If there are insufficient consonants, fill out with zeros. The Soundex technique has been used widely and is described in the literature of several computer manufacturers, without reference to its originator.
Another compression technique used in a system that tolerates misspelling of names is described by Davidson [49]; he calls this scheme “consonant code,” and cites an IBM publication as its source [ l d l ] .This method produces a compressed key of five alphabetic characters. (1) Use the first character of the surname as the first character of the compressed key. (2) Use the first letter of the first name as the fifth character of the compressed key. If there is no first name, use a dummy character that is programmed to match all characters. (3) Initialize the second, third, and fourth characters of the compressed key as blanks. (4) Replace the second, third, and fourth characters of the compressed key using the following rules : (a) Delete all vowels and all occurrences of H, W, or Y, except for the first letter of the surname. (b) If any letter repeats, delete all but the first occurrence. (c) Insert the first three remaining letters after the first letter into the second, third, and fourth positions in the compressed key. With this compression scheme, the first four characters of the compressed key are called the compressed surname. Table VI shows the results of applying these two algorithms to several spellings of three easily misspelled names. The vowel-elimination technique enables both algorithms to handle
138
DAVID
C. ROBERTS
TABLE VI COMPRESSION TECHNIQUES RESULTSOF SPECIALIZED APPLIEDTO SAMPLESURNAMES Compression technique Surname
Soundex
Miller Mueller Muller Lo Loew Lowe Korfhage Korvage Rogers Rodgers Herring Herron
M460 M460 M460 LOO0 LOO0 LOO0 K612 K612 R262 R326 H652 H650
Consonant coding MLR
MLR MLR L L L KRFG KRVG RGRS RDGR HRNG HRN
Miller, Mueller, and Muller; the elimination of the letter W makes Lo, Loew, and Lowe map into the same compressed key for both algorithms. The advantage of the use of equivalence classes of similar-sounding letters in Soundex permits it to compress Korfhage and Korvage into the same compressed key, while consonant coding fails in this case. In the last two cases, Rogers and Rodgers and Herring and Herron, both algorithms fail. This shows the necessity of including searches on secondary keys in nameretrieval systems, in case the specialized compression algorithm fails. The airline passenger record system described by Davidson, for example, includes facility for a search on telephone number for use in such cases. 3.4.2 Hash Addressing
I n hash addressing, the key is operated upon by some mathematical algorithm called a hash that generates an address in secondary storage directly from the key. The hash maps elements of the set of keys into the set of auxiliary storage addresses, so it is useful to examine the characteristics of these two sets to develop the requirements for a hashing algorithm. The set of keys is a set of variable length quantities which may have varying restrictions on the permitted values for portions of the keys (BCD, integer, etc.). Therefore, the key may in fact be a mixed radix expression; the set of keys will be usually a small subset of the set of possible keys; and the keys will be clustered in various ways, and the way in which they are
FILE ORGANIZATION TECHNIQUES
139
grouped may change as deletions and additions to the file are made. I n contrast to this, the set of secondary storage addresses is numeric, restricted in range, densely occupied, consecutively ordered, and timeinvariant [117]. It is desired that the transformation algorithm enable any record to be located in only one secondary storage access, which is to say, that each record should be stored at its calculated address, called its home address. Due to the unpredictable characteristics of the key set, the best possible performance will be obtained from a hashing algorithm that maps keys into what would correspond to a random selection of addresses [ l l r ] . No matter how carefully the hashing algorithm is chosen, collisions will inevitably occur. Therefore, it is necessary to choose some method for handling collisions. Hash addressing, then, consists of two stages: a mapping t o allocate calculated addresses as uniformly as possible among available addresses, and overflow procedures to direct collisions to other addresses [ll7].
Secondary storage addressing is usually described in terms of the addressing of buckets, where a bucket is simply the largest quantity of data that can be retrieved in one auxiliary storage access. [lSZ, 1411. Usually, a bucket is large enough to hold more than one record. The effect of transforming to bucket addresses, where each bucket holds several records, rather than to a single record address, is generally to improve the performance of the addressing technique in use. In this discussion, the grouping of records into buckets is ignored, since it does not affect the relative merits of these addressing schemes. (a) Transformation Algorithms Several algorithms have been developed for transforming keys t o addresses, and a number of them have been in common use for several years.
(1) If the key occupies several machine words, form the product and use the middle bits for the hash address. This technique is not useful when part of the key can be zero, such as may occur when short keys are justified by filling with zeros, or if blanks are represented by zeros [I@]. (2) If the key is longer than the desired address, cut the key into several pieces each as long as the address and form the sum of the pieces. This method can be used for single-word and multiple-word keys [l42]. (3) Square the key, and use some bits from the middle of the square as the address. The middle of the square depends on all the bits of the key, so there is a high probability that different keys will hash to different addresses [ l 4 Z ] . (4) Exclusive-or the words of a multiword key, then divide the result
140
DAVID C. ROBERTS
by the table size in records and use the remainder as the calculated address. The table length must be odd, or the rightmost bit of the key will become the rightmost bit of the hash address. This method produces a calculated address within the range of file addresses that has very attractive randomness properties [132]. (5) Interpret the key as a series of 4-bit digits of some radix p (where p 5 16) and transform the key representation into some radix q, where p is relatively prime to q, then truncate the result to obtain the calculated address. This method produces a very uniform distribution of calculated addresses, and is especially good in breaking up clusters of adjacent keys [llr].
Experimentation is presently the best method available for choosing a hashing algorithm. Generally, some sort of representative sample of the keys is obtained, and a selection of hashing algorithms is used to produce several sets of calculated addresses. Each of these sets is then counted, and the number of distinct calculated addresses is compared to the number of distinct sample keys as a measure of the performance of each algorithm. This process is essentially identical to the measures used by Bourne and Ford [25]for the performance of compression algorithms. (b) Scatter Table Construction When records are being added to a file using hash addressing, first a calculated address is produced. Then, if there are no records stored at that address, the record is entered a t its home address and the process is complete. However, if there is already a record a t the home address of the new record, either the record that is already there must be moved, or the new record must be stored a t some other location, and methods for later locating both records must be provided. To retrieve a record, first a calculated address is produced, using the same algorithm that was used to load the file, then that address is accessed. If the desired record is stored in the home address, or if that location is empty, the operation is complete. If another record is stored a t the home address, some means of searching other locations that where the record might have been stored in case of a collision must be employed. Three popular organizations for scatter tables are considered here, along with two other specialized techniques [I&?]. For each method, the insertion procedure is given; the retrieval procedure is virtually identical except that it does not include writing.
(1) Linear Probing. If the calculated address is occupied, then begin scanning forward in the file until an empty location is found, and write the record into that location. This method is very easy to program, but it is the least efficient, because it tends to cluster overflow records. If a collision has occurred a t location j, this method raises the probability that
FILE ORGANIZATION TECHNIQUES
+
141
a collision will occur a t location j 1 above the average collision probability of the file. (2) Random Probing. When a collision occurs, a pseudorandom number generator is used to generate an offset and that offset is added to the calculated address and the resulting address is probed. This process continues until an empty location is found, or the entire file is full. The pseudorandom number generator must be such that it generates all the addresses in the file once and only once before repeating. This method is more efficient than linear probing because it tends to distribute overflow records. However, deletion of records using random probing is dif€icult. If other records have collided a t the deleted record, they may become unlocatable. The only way to locate all such records is to recompute the calculated address for every record in the file. Alternatively, a special deletion flag may be inserted in the location, and the search program designed so that a deleted record is treated as an unequal probe. Eventually, however, this technique leads to an excessive number of searches, and the file must be reorganized. (3) Direct Chaining. When a collision occurs, if the record in the home address has the same key as the record to be stored, the new record is stored in any available location obtained by any means, and a pointer to the overflow record is stored in the home location. If the record occupying the home location does not have the same key as the record to be stored, it is moved anywhere else, its overflow chain is adjusted, and the new record is stored in its home location. This is the most efficient scheme for handling collisions. Storage space for a chain pointer is used in each record to permit more rapid search. The greatest programming complexity is encountered in the necessity to move records from one location to another. (4)Scatter Index Tables. It is possible to separate the data area completely from the scatter table if every- record is treated as an overflow record. The scatter table then consists only of pointers to chains of records in the data area. This technique is particularly advantageous in the case of variable-length records; if variable-length records are to be stored in a scatter table, the table must include, for each record, space for the longest record in the file. Deletion is not difficult with the use of scatter index tables; the deleted record is simply removed from its chain, and its space is returned to free space. ( 5 ) Randomized Check Code. When the key is very complex, or is a variable-length string of characters, the comparison of the key with the key of each record encountered during a probe may be very time consuming. In this case, to speed this comparison, a calculated address that consists of more bits than are needed for the address can be computed, and the extra bits can be stored with each record. Then, when a record is encountered during a probe, these extra bits are compared, and the full keys are compared only when these two supplementary keys are equal.
142
DAVID
C. ROBERTS
TABLEVII AVERAGENUMBEROF PROBESTO RETRIEVEA RECORDFOR VARIOUSSCATTERTABLE ORGANIZATIONS
Load factor
E-random probing
E-linear probing
E-direct chaining
.1
1.05 1.39 1.83 2.56
1.06 1.50 2.50 5.50
1.05 1.25 1 ..38 1.45
.5 -75 .9
If an equal number of bits are used for the calculated address and the supplementary key, it is possible that no two keys will ever be compared unless they are equal. Morris [l4%']has determined the average number of probes needed to locate a record for each of the first three methods above; his results are tabulated in Table VII and shown as a graph in Fig. 11.2The load factor
Linear
D--
Chaining
DLR
I
I
I
.I
.5
.75
---J
.o
Load Factor
FIG. 11. Average number of probes to retrieve a record for various scatter table organizations. Copyright 1968, Associat.ion for Computing Machinery, Inc. ; reprinted by permission of the Association.
FILE ORGANIZATION TECHNIQUES
143
is simply the fraction of the available locations in the file that are occupied, and the average number of probes depends only on the load factor, and not on the file size. For small load factors, all three methods give practically identical performance; therefore, for a load factor between .1 and .5, linear probing, which is the simplest to program, could economically be used. For higher load factors, however, random probing gives better performance than linear probing, at moderate increase in programming complexity. For the best possible performance a t high load factors and a t the highest cost in programming complexity, direct chaining is used. Thus, the selection of a scatter table organization is dependent on the occupancy factors to be experienced by the file and the performance requirements.
4. List File Structures 4.1 Introduction
This section describes list structures, which include lists, linear lists, inverted lists, multilists, and rings. For each structure, methods for searching and updating are outlined, along with a discussion of the various tradeoffs that must be made in selecting a particular organization. I n dealing with list structures, it is desirable to have a compact notation for specifying algorithms for manipulating them. A convenient notation for this purpose has been introduced by Knuth [lor].Every record consists of a number of fields, some of which are pointers; therefore, all operations on lists of records can be expressed in terms of operations on fields of records. Let the notation also include link variables, whose values are pointers. Then, let a reference to any field of a record consist of the name of the field, followed by the name of a pointer variable pointing to the record, enclosed in parentheses. To use the example of Section 2, let each record consist of the fields LAST, FIRST, STREET, CITY, STATE, MAKE, YEAR, and LICENSE, corresponding to the fields of the record layout shown in Fig. 1. In addition, augment that record by adding one pointer field, and call the pointer field NEXT. Consider the example of Fig. 5, which shows the file with such a field added to each record. Suppose the initial value of the pointer variable LINK is 2; that is, it points to recurd number 2. Then the value of NEXT(L1NK) is 4, the value of NEXT(NEXT(L1NK)) is 5, and NEXT(NEXT(NEXT(L1NK))) is the null link, which is represented by A, the Greek letter lambda. The only operator included in the notation is t,which is used as an assignment operator. A + B is read “assign the contents of B to A.” The usual mathematical symbols for equal, greater than, less than, and so forth will be used with their conventional meanings. Let the notation also include one reserved
144
DAVID C. ROBERTS
word: LOC. If V is a variable or field, let LOC(V) be the storage address of V. To return to the example given, LOC(NEXT(NEXT(L1NK))) = 4. The last notational convention needed is the use of square brackets to indicate relative position within a list. If W is the name of the list in the example above, then
YEAR(W[l])
=
1968
and
NEXT(W[3])
= A,
where YEAR and NEXT are field names, and W[n]is treated as a link variable. List processing languages were originally developed during work in artificial intelligence, in which the structure of data files, and not just their contents, must be changed drastically during a program, and cannot be predicted. For applications of this type, list structures are used not only for files that are stored on auxiliary storage devices, but also for core memory. A variety of list-processing languages were developed to simplify programming of applications that make extensive use of list structures; these languages typically provide extremely sophisticated capabilities, and usually are rather slow in their execution. Because of the existence of these languages, a common notion has developed that a list-processing programming language must be used to write any program that manipulates list structures. The use of Knuth’s method of expressing algorithms in this discussion, which is very similar to several programming languages, shows that these manipulations can easily be performed in a popular programming language, such as FORTRAN, PL/I, or COBOL. Whenever lists are being manipulated by a program, no matter what the type of list, one problem always must be solved: the maintenance of a pool of available space. Before any list manipulation takes place, the storage space that is available for all list structures is first initialized in some way so that all the records in these area are linked to the pool of available space, called the free list. Then, when a record is to be added to a list, it is unlinked from the free list, its fields are set to their desired values, and the record is linked onto the desired list. The same effect can be achieved by the use of an availability table containing one bit for each record. When a record is used, its bit is set to 1; when it is released, its bit is returned to zero. During execution of a program, an interchange of records between the pool of available space and the lists being processed will take place. Since the processing associated with this function is a pure overhead task, and does not contribute directly to solving a problem, considerable attention has been devoted to the methods of performing this interchange, with the
FILE ORGANIZATION TECHNIQUES
145
goal of minimizing the number of instruction executions needed. There are two basic approaches to the maintenance of the free list: (1) Return a record to the free list whenever it is no longer being used by the program. (2) When a record is released by the program, do not link it to the free list. Rather, wait until requests for records from the free list exhaust it, and then locate and link to the free list all records that are not currently in use. The procedure of identifying and linking up all free records is known as garbage collection. Three principal methods have been described, two that use variations on the first method, and one that uses garbage collection [166].Newel1 I1461 describes the method incorporated in the IPL-V programming language: leave the responsibility to the programmer. Thus, whenever a node is deleted from a list, the program must make a decision regarding its return to the free list. If the only structures being manipulated are relatively simple, this is not an unacceptable solution; however, if, for example, lists are permitted to be parts of other lists, the task of determining when a record should be returned to the free list becomes complicated, and this approach loses much of its appeal. A second approach is due to Gerlernter et al. [74].A reference counter is maintained for each record, showing the number of references to it. When a reference count is decremented to zero during a deletion from a list, the record is returned to the free list. This approach rapidly becomes more complicated as the permitted list structures become more complex. For example, if part of a list is added to another list, a new reference counter for the part must be established. In this way, reference counters tend to proliferate. McCarthy [I261 originally proposed the scheme of garbage collection. Using this method, once the free list has been exhausted, processing of the program is temporarily suspended, and the garbage collection routine is initiated. This routine traces through all the lists in use by the program, marking each node that is encountered in a list. Every record in the file is then read in order, and those that have not been marked are placed on the free list. Finally, the marks are removed from the marked nodes, to initialize for the next garbage collection. A variety of algorithms for traversing lists have been proposed; for further details in this area, see Schoor and Waite [I661or Knuth [lor]. 4.2 List Structures
The completely general dehition of a list is a sequence of one or more records or lists. This definition, due to Knuth, permits a list to be an ele-
146
DAVID
C. ROBERTS
ment of itself. Such a facility is useful in applications requiring recursive structures, particularly artificial intelligence applications, but it is not really applicable to the general file organization problem, so it is not considered a t any great length in this discussion. 4.3 Linear Lists
AS defined in Section 2, a linear list is a set of records whose only structure is the relative linear positions of the records. Note the difference between a list and a linear list; a list can include circular or recursive structures, whose structural relationships are more complex than a simple linear arrangement. The three important classes of linear lists in which insertions and deletions are made at the ends of the list are defined in Section 2 (see page 122). Linear lists can be implemented using either sequential or linked allocation. Sequential allocation simply places the records in the list one after another, so that LOC(X[i 11) = LOC(X[i]) c ,
+
+
where c is the number of words in each record (this implies that all records are the same length). Sequential allocation is especially convenient for the implementation of a stack. The only pointer variable needed is a pointer to the top of the stack, called TOP. To place an element X onto stack Z, two steps are necessary; TOP is incremented, and X is placed into the storage location pointed to by TOP : TOP t TOP
+c
Z[TOP] t X. Removing an item from the top of the stack, provided the stack is not empty, is easily accomplished by reversing the above procedure:
X t Z[TOP] TOP + TOP - C. From these two algorithms, it is obvious that a stack is very convenient to implement using sequential allocation. Most stacks that are resident in main memory are implemented in this fashion. Instead of using sequential storage locations, it is possible to arrange a linear list using pointers so that the logical structure and physical structure are completely independent. This is accomplished by representing the logical connection between two records by a pointer that points from one record to another. Linked allocation permits greater flexibility in the use of storage, at the cost of additional storage space for the pointers.
FILE ORGANIZATION TECHNIQUES
147
Knuth gives these points in a comparison of linked and sequential allocation: (1) Linked allocation requires more storage, because of the pointers. (2) Deletion from a linked list is much easier (one pointer is changed) than from a sequential list, which may require shifting of the entire list. (3) Insertion into the midst of a linked list is much easier. (4)Random access to any element of the list is much faster for a sequential list. (5) It is easy to join or break two linked lists. (6) Intricate structures, such as lists of variable-length records, can be represented more easily with linked lists. The number of links in a record is not by any means limited to one; varying number of links may be included in each record for different list organizations. One such structure is a doubly linked list, illustrated in Fig. 12. The extra space required to store the pointers for a doubly linked list provides two advantages: the list can be searched in either direction, and insertion and deletion of records can be accomplished very easily. For example, the algorithm for deletion of the record pointed to by the link variable is RLINK(LLINK(X)) t RLINK(X) LLINK(RLINK(X))
t LLINK(X).
In this case, the two pointer fields of each record are called RLINK and LLINK. This algorithm permits the deletion of a record, given only its
Head
L FIG.12. Doubly linked linear list.
148
DAVID
C. ROBERTS
record number; no tracing through the list is required. Similarly, a record may be inserted to the right or left of any record in the list in a similarly simple manner. Consider the insertion of record V to the right of record U: LLINK(V) t U RLINK(V) t RLINK(U) LLINK(RLINK(V)) t V RLINK(U)
t V.
Singly and doubly linked lists are used in many applications, both for structures in core memory and for auxiliary storage. One example of the use of a singly linked list is the chaining method of scatter table organization described in Section 3. In this structure, the pointers to overflow chaiis are list pointers; this is an example of a singly linked list. Since the overflow chains are searched in only one direction, and since deletion is performed only during a forward search of the list (so that all the pointer values needed are immediately available), there is no need for double linking in this case. 4.4 Inverted File and Multilist Organizations The inverted file organization has been developed to minimize the time needed to retrieve from a file, at the expense of update time [ l 7 ] .The ideal file structure for retrieval purposes would read only the desired records from auxiliary storage and no others, performing all preliminary searching in core. If the complete inverted file can be kept in core, the inverted file structure accomplishes this goal. Even if the inverted lists must be stored on auxiliary storage, one access to auxiliary storage can read an inverted list pointing to many more records than could be read in one access; so even in this situation the number of accesses to auxiliary storage is less than would be required to search the records themselves. In order to facilitate conjunction and disjunction (“and’’ and “or”) operations on inverted lists, the simplified structure of Fig. 6 is usually augmented by the addition to each inverted list of a count giving the number of items in the list. This count is used to select the shortest lists for scanning, if the search permits such a choice. For example, consider the inverted file of Fig. 7. If the file is to be searched for all persons who are residents of Maryland, have the name Johnson, and own Volkswagens, then it is obviously easier to intersect the Volkswagen and Johnson lists first and then intersect the results of that search with the list for Maryland; in this way, only one candidate record remains after the first intersection. This advantage is more striking when the two lists are very different in
FILE ORGANIZATION TECHNIQUES
149
length. If one wants to find a sergeant who speaks Urdu and Sanskrit in an army personnel file, it is obviously easier to first intersect the Urdu and Sanskrit lists, and then intersect the result with the list on sergeant, than to start with the sergeant list. Union and intersection operations on inverted lists are facilitated if each list is maintained in collating order of the record addresses. In this case, two lists can be intersected or their union can be found with one pass through both lists. For this reason, inverted files are almost always kept in collating sequence. An inverted file is difficult to update. To add a record, the inverted list corresponding to the value for each field in the record must have a pointer t o the new record added. The necessity of keeping each inverted list in order by record number (to facilitate searches and merges) tends to increase the complexity of this operation. Deleting a record from an inverted file organization similarly requires the modification of inverted lists corresponding to the value of every field in the record, with the same problems. A data management system that has been designed around the inverted file organization is TDMS (Time-shared Data Management System) [20] produced by the System Development Corporation. TDMS uses inverted files, with a hierarchically segmented dictionary of several levels that is used to locate the inverted lists. Xaturally, the performance characteristics of TDMS are those of an inverted file organization. Retrievals can be performed on the basis of any field, which is a useful feature if the queries are unpredictable as to field requirements, and logical operations in queries can be performed efficiently. Of course, the use of an inverted file organization optimizes the retrieval performance a t the expense of updating, so updating TDMS files is very time consuming. Thc particular organization of the inverted lists chosen for TDMS is rather elaborate, so the storage of the inverted file requires as much storage space as does the original file itself. Thus, a substantial penalty in ease of updating and storage economy has been incurred in order to optimize the response to unpredictable queries. Multilist file organizations are also usually augmented by the addition of a list length to each index entry, so that the shortest lists can be searched. A multilist structure is conceptually an inverted file, where only the heading of each inverted list is kept in the index, and the rest of each list is indicated by links among the records. Therefore, the multilist is easier to update, because the index information is not stored in a link organization, but it is more time consuming to search, because now every record on each inverted list must be brought into core. If the lists are long and the queries being processed are conjunctions containing a number of keys, the ratio of records read from auxiliary storage to the number of records satisfying
150
DAVID C. ROBERTS TABLE VIII TIMING COMPARISONS OF INVERTED FILEA N D ~IIJLTILIST ORGANIZATIONS~ Transitftion Retrieval \Vhole rerord addition \\'hole recwrd deletion Xnn-key modification (without relocat ion) Non-key modification (with reloration)
Inverted file
Yultilist
13.9 15.0 15.8 1. 3
86.8 4.0 1.3 1.3
25.8
4.8
the query may be as low as ljl000, which means that 99.9% of the reads from auxiliary storage retrieve no useful information [If51. Table YIII presents a comparison of timing calculations to perform various operations on multilists and inverted files. The transactions also include an initial decoding using a three-level tree as directory to the index. The time units in the table are in seconds [115];hotvever, their values for any specific implementation of these structures would depend on the details of the hardware and software used in the implementation. Therefore, these numbers should only be used as a basis for comparing the relative speeds of the two organizations for the transactions presented. The inverted file is much faster in retrieval, but much slower in updating. The choice of one or the other organization depends on the relative volume of updates and retrievals and the response time requirements for retrievals and updates. A ring is no more than a linked list that closes upon itself. A ring structure shown in Fig. 8 can be viewed as a variant of multilist organization. If the multilist index is placed in a linked list that closes upon itself, and if the null link that terminates every list in a multilist is modified to point to the head of the list, the multilist has been converted to a ring organization. The performance characteristics of a ring and multilist are similar. 4.5 Ring Organizations
Because a ring has no end and no beginning, and because, given some record number x the record on each side of x can always be located by moving forward through the ring, it is somewhat easier to construct a general-purpose system using rings than multilist. In fact, two such systems have been constructed. One of these, Integrated Data Store (IDS), produced by General Electric, is available on a variety of GE computers.
FILE ORGANIZATION TECHNIQUES
151
IDS is a set of subroutines for processing rings; facilities to create and destroy, add to and delete from rings are provided. IDS provides a convenient way to mechanize a file that may change drastically and that requires the representation of complex interrelationships of records [lo]. Rings can also be used to represent matrices; a separate ring can be used to represent each row and column, and each record will therefore reside on exactly two rings. This storage method is useful only for sparsely occupied matrices; for dense matrices, the sequential allocation techniques that have been developed for use with programming languages such as FORTRAN (see, e.g., Knuth [lor])are more economical of storage space. Rings have been used to construct cylinders, where a cylinder is a pair of rings with cross linking between them. Various types of information can be represented by these links. For example, a figure consisting of n points joined by line segments can be represented by a cylinder whose two rings each have n records [195].A line connecting point j with point k on the figure is represented by a link between the j t h record in one ring and the kth record in the other ring of the cylinder. At the present time it is not clear that cylinders are useful for other than research applications; however, useful applications in the area of graphic data representation may be developed. 5. Tree File Structures 5.1 Introduction
A binary search is used for a sequentially allocated random-access file that is stored in order of the collating sequence of its keys. This arrangement reduces search time at the expense of update time. For a file that is updated more often than it is searched, linked allocation can be used to minimize update time at the expense of search time. For a file that is updated and searched with similar frequency, however, neither of these approaches is very practical, and some sort of compromise must be struck. A tree structure is such a compromise, combining the speed of a binary search with the updating ease of linked allocation. A precise definition of a tree structure can be expressed easily with the use of elementary graph theory. The following is a modification of definitions used by Sussenguth [I84and Birkhoff and Bartee [18]: A graph is a set of nodes and branches joining pairs of nodes. A path is a sequence of branches such that the terminal node of each branch coincides with the initial node of each succeeding branch. A graph is said to be strongly connected if there exists a path between any two nodes of the graph. A circuit is a path in which the initial and final node are identical. A tree is a strongly connected graph with no circuits.
152
DAVID C. ROBERTS
The root of a tree is a node with no branches entering it. The root of the tree is said to be at the f i s t level; a node which lies at the end of a path containing j branches from l t h level. the node is said to be at the j
+
A leaf is a node with no branches leaving it. The filial set of a node is the set of nodes which lie at the end of a path of length one from the node. The set of nodes reachable from a node by moving toward the leaves are said to be governed by the node.
From the above definitions, it follows that a tree will always have one and only one root; a tree of n nodes will always have exactly n - 1 branches; and n - 1 is the smallest number of branches required for the tree to be strongly connected. Tree-organized files are used most often as indexes to some other file. Such an arrangement permits a record in the tree file to consist of only keys, pointers to other records in the tree, and addresses in the file. This approach is particularly useful if the records in the file are variable in length. If the file consists of short, fixed-length records, then the entire record can be placed within the tree structure. In this discussion, it is assumed that the tree structures under consideration are being used for key-to-address transformation, also called decoding. A key, of course, can be any information that can be represented in a computer-a series of mixed letters and digits, binary integers, or some other quantity. Since decoding is an application of tree searching, which is the fundamental operation performed in any tree-structured file, this discussion is also applicable to any tree-structured file. It is important to be aware of the distinction between the structure of a tree, which consists of certain items of information and some defined relationship between pairs of these items, and the representation of a tree within a computer, which is composed of words of storage containing numeric values. To index a given file, one of several dzerent tree structures might be used, each of which could be represented in several different ways in the computer. Thus, to design a tree-structured index to a file, first the tree itself, which is the embodiment of certain indexing relationships among the records, must be selected, and then a representation of that tree must be designed. This section first considers various tree organizations, then discusses methods of reducing search time for each of them. Finally, several representations of trees are considered. 5.2 Tree Structures
Many names have been used for various tree structures; no standard terminology exists. For this discussion, trees are classified into three types : symbol trees, immediate decoding trees, and directory trees. These three cate-
FILE ORGANIZATION TECHNIQUES
153
gories provide a reasonable basis for a survey of tree organizations. However, the following discussion by no means exhausts all the possible combinations of node contents and linkage arrangements that can be made. Any key can be decomposed into a string of individual symbols. For example, a decimal number can be decomposed into a string of singledigit symbols, a series of double-digit symbols, and so on; a 36-bit binary number can be decomposed into twelve symbols of three bits each, nine symbols of four bits each, and other groups of symbols. In the construction of a symbol tree, the leaves of the tree are placed in one-to-one correspondence to the addresses in the file that is indexed by the tree. The non-leaf nodes of the tree are used only for searching; each leaf node of the tree contains a file address. Each key is broken into a number of symbols, and one node is used for each symbol. The tree will have one node on the first level for each distinct first symbol of a key; one node on the second level for each distinct second symbol; and so on [183]. The filial set for each symbol will have one node for each distinct following symbol in the key set. Figure 13 shows a symbol tree for the words platypus, P
A.
S.
"I
1 1 1
E o
1
A 0
I
N e
1 1 I 1 1 "i
N e
1
P e
Y O
'i "i "i I M.
8
S O
*
FIG.13. Symbol tree.
'i *
154
DAVID C. ROBERTS
platinum, peasant, poet, and playboy. Note that keys with common initial polygrams (sequences of symbols) share nodes in the tree. Such polygram sharing tends to reduce the storage requirements of the tree. A symbol tree is therefore especially useful when the keys are long and share initial polygrams. A further advantage of symbol trees is that variable length keys present no additional complications. With a symbol tree, decoding is never complete until a leaf is reached, There is another type of tree that has appeared in the literature [8], called in this discussion an immediate decoding tree, in which an entire key is stored at each node, and decoding can be completed without reaching a leaf. I n this structure, one node is used for each distinct key value. The tree is searched until an exact match is found, and then the file address in that node is taken as the result of decoding. An immediate decoding tree for keys which take the integral values from one to fifteen is shown in Fig. 14. This structure is particularly suitable for relatively short keys that can be represented in one or two machine words. For long keys, particularly if they share many initial polygrams, a symbol tree makes more efficient use of storage. When an immediate decoding tree is searched, the search key is compared with the key at the root. If the two keys are equal, the file address stored a t the root is the decoded address. If the search key is less than the key a t the root, the left branch is taken and the previous step is repeated. If the search key is greater than the key a t the root, the right branch is taken and the previous step is repeated. If a leaf is reached without an equal comparison, the search terminates unsuccessfully. A directory tree lies in the middle ground between a symbol tree and an 8
FIG.14. Immediate decoding tree No. 1.
155
FILE ORGANIZATION TECHNIQUES
immediate decoding tree. Each node of a directory tree contains several entire keys, but no file addresses. During the search, if the search key is greater than the j t h key a t a node but less than the j l t h key, then the branch corresponding to the j t h node is tested next. When a leaf is reached, that leaf will contain the desired file address, if the search key is in the file. Figure 10 shows a directory tree. A directory tree is most useful if the file keys are all the same length, or can be compressed to the same length. In that situation, the keys stored at each node can be searched using a binary search, greatly speeding the search process.
+
5.3 Minimizing Search Time
Considerable attention in the literature has been given to methods of reducing the search time for a tree. Various methods have been proposed for the three tree types under consideration here. Sussenguth [I831first addressed the problem of searching a symbol tree. He claims that the searching time requirements for such a tree are only 24% greater than for a binary search. However, his results are limited by his assumptions about both the computational steps needed for each process and that all leaves lie a t the same level of the tree, which amounts to assuming that all keys are of equal length. Patt [I501 removes some of the limitations of Sussenguth’s analysis. Patt’s work has been criticized by Stanfel [I781 because of his assumption that no strings share initial polygrams. Nevertheless, one of Patt’s theorems is useful in reducing the search time of a symbol tree: The average search length for this type of structure is minimized if the nodes of every filial set are ordered according to the number of terminal nodes reachable from each.
If the symbol tree is constructed in the way Patt suggests, the symbols stored a t those nodes that are elements of the largest number of key strings will be the first to be compared with symbols from the search key. Scidmore and Weinberg [16Y] propose a slight modification of Sussenguth’s structure; instead of using terminal nodes for data record pointers they allow such a pointer to be placed in any node. Their analysis therefore does not require that every such pointer lie a t the same level of a tree, or even that the pointers terminate a path through the tree. But in their analysis they assume a uniform random distribution of keys. This assumption is open to question; several factors act to cluster keys. For example, if keys are words from natural language, rules of spelling and phonetics act to cluster the keys. Updat,ing a symbol tree is not difficult, and it can be done online. The only additional structure that is needed is a list of available space, maintained as discussed in Section 4.
156
DAVID C. ROBERTS
With an immediate decoding tree, since the root does not correspond to the first symbol in the key, it is possible to completely reorganize the tree for faster searching. In particular, if some paths from the root to the leaves are considerably longer than others, the upper bound on search time can be lowered by reorganizing the tree to make all the path lengths more nearly equal. Consider as an example the immediate decoding tree illustrated in Fig. 15. The longest path from the root to a leaf has length six; the shortest, two. To reorganize this tree, note that it is simply a representation of a total ordering relation of the keys; this can be demonstrated very easily by traversing the tree in postorder [lo71 : 1. traverse the left subtree (in postorder) 2. visit the root 3. traverse the right subtree (in postorder). Tracing the immediate decoding trees in postorder of Fig. 15 gives this ordering of the keys: 1 , 2, 3, 4, 5, 6 , 7 , 8,9, 10, 11, 12, 13, 14, 15. Note that this is the same ordering that is obtained by traversing the immediate decoding tree of Fig. 14 in postorder. Therefore, these two trees are alternative representations of the same total ordering. The simplest method to reOrganize an immediate decoding tree is to traverse it in postorder, constructing a sequentially allocated random access file that corresponds to the tree. This file can then be used to completely reconstruct the tree by
\ 4.
1
J
3 5
7
FIG.15. Immediate decoding tree No. 2.
I4
FILE ORGANIZATION TECHNIQUES
157
assigning the middle element to the root, the one-quarter and three-quarters elements to the first level, and so on. More complicated algorithms that avoid complete reconstruction can also be devised. Minimization of search time for directory trees has been treated by Landauer, who suggests the use of balanced directory trees. Using a balanced directory tree of n levels, all keys are decoded in exactly n or n - l levels [ I l l ] . Landauer made an extensive theoretical investigation of the branching properties of directory trees and derived an optimum branching factor for such trees. However, his analysis does not include the amount of information that can be obtained in one access to auxiliary storage, and in a practical situation, this parameter is of great importance. If the number of keys at each node is selected so that every node exactly fills the amount of auxiliary storage that can be read in one access, decoding will be more rapid than with the use of Landauer's value for the optimal branching factor. 5.4 Representations of Trees
A number of different computer representations of trees have been developed. This discussion by no means exhausts all the representations that have been suggested; even if it did, the number of possible new representations is unlimited. Rather, this discussion is intended only to present the most important representations, to serve as a starting point for a file design. The most popular representation of a symbol tree, suggested by Sussenguth, is double chaining. Each record contains three items: a symbol, a pointer to the node's filial set, and a pointer to the next node within the same filial set as the node. Figure 16 shows a doubly chained representation of the symbol tree of Fig. 13. When such a tree is searched, every time a match is encountered the filial set of that node is searched. If a match is not found, the search continues through the siblings of the node. The search terminates successfully when a * symbol is encountered; that node contains the file address. The search terminates unsuccessfully when a sibling set is exhausted without finding a match. Thus, the two links provide sufficient information for the complete search, and no other links are needed for a symbol tree. An immediate decoding tree never requires the search of a set of siblings; therefore, the double chaining of Fig. 16 would not be the best arrangement for such a tree. Rather, a scheme such as that of Fig. 17 would be used. In this case, each node includes pointers to the two members of its filial set. During the search, either the left or right branch is taken from each node, or if the search key and file key are equal, the file address, which is stored in the fourth field, is taken as the result of the search.
158 DAVID C. ROBERTS
FIG.16. Doubly chained symbol tree.
FILE ORGANIZATION TECHNIQUES
FIG.17. Doubly chained immediate decoding tree. 159
DAVID C. ROBERTS
160 Degree Key
0 2 0 2 0 2 0 2 0 2 0 2 0 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 (a)
Degree Key
0 2 0 2 0 2 0 1 1 2 0 2 0 2 0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 (b)
FIG.18. Postorder sequential representation of immediate decoding trees. (a) Immediate decoding tree No. 1; (b) immediate decoding tree No. 2.
Some tree searching applications require frequent references upward as well as downward in the tree. Although this is not usually the case in decoding, it can arise if the tree structure is being used for the storage of hierarchically organized files. In this case, Knuth has suggested the addition of another pointer to each node-a pointer to the predecessor node. Directory trees are usually multiply chained, with one pointer stored for each key a t each node. The above representations of trees have all assumed the use of linked allocation. However, if sequential allocation is used for trees that are stored on auxiliary storage, the saving of the space used for the links can permit the searching of more nodes with one auxiliary storage access than is possible with linked allocation. From the discussion of postorder traversal above, it is clear that a tree structure represents an ordering relation among the nodes. A postorder sequential representation of a tree can be constructed by listing the nodes in postorder and giving the degree of each node [107],where the degree is the number of branches leaving that node. Figure 18 shows the postorder sequential representation of the immediate decoding trees of Figs. 14 and 15. 6. Implementation of File Structures
6.1 Introduction
When a file is implemented, the implementation must be performed within the constraints imposed by an available computer system. These constraints are in two areas : physical constraints arising from the hardware characteristics of the computer and its peripherals, and software constraints imposed by the data management facilities provided by the operating system. This section discusses the latter constraints as imposed by the data management facilities of Operating System/36O (OS/360) , and suggests how these facilities can be used to implement a file.
FILE ORGANIZATION TECHNIQUES
161
The term “data set” is used in OS/360 documentation to refer to a logically distinct file, and the term is used with that meaning in this discussion. A data set consists of some number of “volumes,” which are standard physical units of auxiliary storage, such as reels of magnetic tape. OS/360 provides for four data set organizations : sequential, partitioned, indexed sequential, and direct-access. This does not at all mean that a user cannot implement some other file organization, such as a tree structure; it does mean, however, that (if OS/360 data management is to be used) any structure must be implemented as a collection of data sets of the four permitted organizations. A tree structure, for example, could be implemented using a direct-access data set or, perhaps, an indexed sequential data set. Thus, the data set organizations that are supported by OS/360 provide an environment in which file organizations can be implemented, rather than limiting in any way which organizations can be implemented. Naturally, any similarity between a desired file organization and one of the data set organizations should be exploited in order to simplify programming. OS/360 has been designed so that data set organizations are, as much as possible, independent of the physical device upon which the data set resides. There are, of course, basic restrictions that arise from the physical characteristics of storage media; for example, a data set residing on a sequential-access peripheral device, such as magnetic tape, must have sequential organization. All four types of data set organization can be used on a direct-access storage device, such as drum, disk, or data cell, while storage media that are essentially sequential in nature, such as magnetic tape, can contain only sequentially organized data sets. Therefore, if a program is to be run with a data set that can be resident on sequential or direct-access devices, the data set organization used should be sequential, or it must be reformatted before some executions of the program. Another type of “independence” that is available using 0s is the ability to make a direct-access data set independent of its location within a volume. This is accomplished by making all pointers within the data set relative to the start of the data set. In this way the data set becomes ‘‘relocatable” similarly to the way that programs that use only addresses relative to the instruction counter are relocatable. 6.2 Sequential Data Set Organization
Sequential data set organization is analogous to a deck of punched cards, where each card is one record. All records must be processed sequentially; that is, to process the n l t h record, the first n records must fmt be processed.
+
162
DAVID C. ROBERTS
The average access time to any record in a sequential data set is a linear function of the number of records in the data set, since locating any record requires a linear search of the data set until the desired record is located. For this reason, sequential data sets are not efficiently used for applications that require random-access capabilities. 6.3 Partitioned Data Set Organization
A partitioned data set is a collection of sequential files (called“ members”) along with a directory that gives the name of each member and its location within the data set. The members are organized sequentially. The directory is a series of records a t the beginning of the data set. The directory contains an entry for each member of the data set. Each entry contains a member name and the starting location of the member within the data set; the entries are arranged in alphabetic collating sequence by member name. By use of the directory, any member of the data set can be retrieved directly. Members are added to a partitioned data set by writing them after the last member of the set; members are deleted by deleting their directory entry. Deleted members leave gaps of unused storage in the data set; when an addition to a partitioned data set cannot be accomplished because of insufficient available space a t the end of the data set, it is “reorganized.” Reorganization consists of recopying all members of the data set, discarding gaps between members. Partitioned data sets are particularly useful for program libraries ; any program within the library can be loaded without access to any other. A partitioned data set could also be used for files that are loaded into core in their entirety and searched in core, such as dictionaries and inverted lists (see Section 4). Since a member of a partitioned data set cannot be read other than all at once, partitioned data sets are not useful for storage of files that are searched one record a t a time. 6.4 Indexed Sequential Data Set Organization
Indexed sequential organization is analogous to a file of punched cards with a table that gives, for each file drawer, the contents of the last card in the drawer. The data set consists of a hierarchical set of these tables called indexes and a data area that contains the records that comprise the data file. The index structure corresponds loosely to a directory tree, with a very large number of keys at each node. The records in the data area are stored in collating sequence of a key field of each record. Each block of storage in the da.ta area is preceded by a key field that gives the key value for the last record in the block.
FILE ORGANIZATION TECHNIQUES
163
There are three levels of index: track, cylinder, and master. A track index gives the key of the last record on each track; there is one track index per cylinder in the data area. A cylinder index gives the key of the last record on each cylinder; there is one cylinder index for each data set. When the size of the cylinder index exceeds a user-set threshold, a master index is created, which gives the key of the last record in each track of the cylinder index. Up to three levels of master index can be provided; the number generated is under user control. Each index entry consists of three fields: a count, a key, and a data area. The key field gives the key of the last record that is indexed by the entry, and the data area contains the full address of the track or record, the level of index, and the entry type. Updating an indexed sequential data set is complicated by the necessity of maintaining the records in collating sequence by key. When a record is added, it is written into its correct position. Records after the new record are moved up one position. If the last record in the cylinder will not fit, then it is written onto the first available location in the overflow area. Overflow areas are specified by the user and are reserved to accommodate overflow during updating. They can be allocated either as part of each cylinder in the data area, or as an independent overflow area, into which overflow from all cylinders is written. Use of an independent overflow area reduces the unused space that is used for overflow, but has the disadvantage that the time to search records in the overflow area is higher because of the disk head motion required to access the independent area. When a record is written into an overflow area, a pointer is appended to the record to connect it logically to the correct track, and appropriate adjustments are made to the track indexes. Once overflow to an independent overflow area has occurred for a substantial number of cylinders in a data set, the time needed to locate a record will be greatly increased because more than one head motion will be needed to access each cylinder that has overflowed. For this reason, and also because overflow areas can become full, an indexed sequential data set must occasionally be reorganized. The user can initiate reorganization, and the system makes available three types of information on which that decision can be based: the number of cylinder overflow areas in use, the number of unused tracks in the independent overflow area, and the cumulative number of references to cylinder overflow areas other than the first. Indexed sequential organization has been developed to satisfy a frequent requirement in the design of random-access files: a file that must be accessed randomly according to some fixed key field of each record, where the set of possible keys has a much greater range than the number of records in the file. I n order to implement such a file, a method must be
164
DAVID C. ROBERTS
developed to transform all possible keys into record addresses, where the record addresses are a set of compact integers. Various techniques for performing this transformation have been described in this document (see Sections 2, 3, 4, and 5); however, any of these techniques require special programming, while the indexed sequential organization is readily available. Although the directory approach used in indexed sequential organization requires at least two disk accesses to locate any record (first to the directory, then the record) , it greatly simplifies programming. For applications where performance requirements are not critical, the use of indexed sequential organization provides an economical implementation method. The most serious limitation of indexed sequential organization concerns the use of multiple key fields; if record selection based on a part of the record other than a single key field is to be performed, then the entire file must be scanned sequentially. Another limitation concerns updating. If a file is subject to a very large transaction volume, the degradation in performance caused by the necessity to process overflow conditions can be severe. 6.5 Direct Data Set Organization
A data set of direct organization has space for one record allocated for each possible key value. A fixed relationship is established between the key of a record and its address; then, using this relationship, any record can be accessed without the use of any intermediate tables or indexes. System/360 disk storage is divided into tracks, where a track is a bucket-the largest unit of information that can be retrieved in one access. Although track addresses are not contiguous (for example, track 0100 may follow track 0045), the operating system permits reference by relative track address. Thus, the address of a record in a direct data set is relative track address and a record number within the track. The key of a record can be used as its address by a simple transformation. The key must be numeric, and the records must be fixed in length. The key is divided by the number of records per track; the quotient is used as the relative track address, and the remainder plus one is the record number. 6.6 A Procedure for File Design
Before a file can be implemented, both a file structure and a data set organization (if the implementation is being performed on System/36O) must be selected. It is necessary to use a systematic approach to file design, and not try to specify everything at once, An attempt to settle all aspects of the file design problem simultaneously leads to confusion and poor design. To discuss a procedure for file design, it is necessary to establish terminology for discussing various aspects of the problem. These definitions are offered for that purpose.
FILE ORGANIZATION TECHNIQUES
165
I n f o m t i m structures. Representations of the elements of a problem or of an applicable solution procedure for the problem. Data structures. Representations of the ordering and accessibility relationships among data items, without regard to storage or implementation considerations. Data structures are highly dependent upon the class of problem to be solved and the environment in which the solution is to be used. Storage structures. Representations of the logical accessibility between data items as stored in a computer memory. Performance requirements. Requirements specifying the performance that is to be achieved with the file; these are characteristics that are expressed in units of time, such as maximum access time and update time. The inputs to the file design process are performance requirements, which are determined by the user of the file, and information structures, which are the data to be placed in the file in their original form. Data structures are formulated as an intermediate step toward storage structures. A systematic procedure for designing a file, using the above terminology, is : (1) Identify the information structures that are to be stored and manipulated. (2) Determine the performance requirements to be met by the file design. (3) Identify the ordering and accessibility relationships that must exist among the data items, and use these to determine the data structures that must be incorporated into the file design. (4) Select representations of the data structures to be used within the machine that will meet the performance requirements; this determines the storage structures. Steps 1 and 2 are concerned with the problem that is to be solved with the file. A complete understanding of that problem and of the level of performance that will constitute an acceptable solution is a necessary prerequisite to any more detailed work with the design. Once the problem and the solution requirements are definitely established, then the relationships among the data items can be established. These relationships will be such characteristics as which fields will be used as keys, which data items must be directly accessible from other data items, and so on. The data structures embody the linking relationships among the records in the file. These relationships must be represented, either explicitly by pointers or implicitly by position. These relationships are the key to the final file design; a file structure must be selected to embody these relationships. Sufficient information has now been developed so that the data structures and performance requirements can be used to select the storage
166
DAVID
C. ROBERTS
structures, which include both a file structure and a data set organization well suited to that file structure. ACKNOWLEDGMENTS The author wishes to thank Dr. John Hutchings, Mr. Sy Kaback, and Mr. Malcolm Tanigawa for their support, encouragement, and suggestions, and Dr. Thomas C. Lowe for his assistance in this and many other endeavors. REFERENCXS
1. Abate, J., Dubner, H., and Weinberg, S., Queueing analysis of the IBM 2314 disk storage facility. J . A C M 15(4), 577-589 (1968). 8. Abraham, C. T., Formatted file organization techniques. AD-819 338L. IBM Watson Res. Center, Yorktown Heights, New York, 1967. 3. Abraham, C. T., Ghosh, S. P., and Ray-Chaudhuri, D. K. File organization schemes based on finite geometries. Inform. Contr. 12(2), 143-163 (1968). 4. Amey, G. X., Proceedings of session on information retrieval a t the 1968 defense research board symposium. AD-854 854. Defense Res. Board, 1969. 5. Anderson, R., File organization for a large chemical information system. AD-633 354. U.S. Nat. Bur. of Stand., Washington, D.C., 1965. 6. Angell, T., and Randell, T. M., Generalized data management systems. I E E E Comput. Group News Nov., 5-12 (1969). 7 . Annu. Rev. Automat. Program. (Int. Tracts Comput. Sci. Technol. Their Appl.), 5 (1969). 8. Arora, 6. R., and Dent, W. T., Randomized binary search technique. Commun. A C M 12(2), 77-80 (1969). 9. Baber, R. L., Tape searching techniques. J . A C M 10(4), 478486 (1963). 10. Bachman, C. W., and Williams, S. B., A general purpose programming system for random access memories. Proc. A F I P S Fall Joinl Computer Conf. San Francisco, 1964, pp. 411-422. 1 1 . Baskin, H. W., and Morse, S. P., A multilevel modeling structure for interactive graphic design. I B M Syst. J. 7 ( 3 4 ) , pp. 218-228 (1968). 12. Baum, C., Proceedings of the symposium on computer-centered data base systems (2nd). AD-625 417. System Development Corp., Santa Monica, California, 1965. IS. Bayes, A. J., Retrieval times for a packed direct access inverted file. Commun. ACM 12(10), 582-583 (1969). 14. Becker, J., and Hayes, R. M., Information Storage and Retrieval: Took, Elements, Theories. pp. 448 Wiley, New York, 1963. 15. Belady, A study of replacement algorithms for a virtual storage computer. I B M Syst. J . 5, 2 pp. 78-101 (1966). 16. Benner, F. H., On designing generalized file records for MIS. Proc. A F I P S Fall J t . Computer Conf., Los Angeles pp. 291-303 (1967). 1'7. Berul, L., Information storage and retrieval a state-of-the-art report. AD-630 089. Auerbach Corp., Philadelphia, Pennsylvania, 1964. 18. Birkhoff, G., and Bartee, T. C., Modern Applied Algebra. 428 pp. McGraw-Hill, New York, 1970. 19. Black, W L., Discrete sequential search. Inform. Contr. 8(2), 159-162 (1965). 20. Bleier, R. E., and Vorhaus, A. H., File organization in the S I X time-shared data management system. Proc. ZFIP Amsterdam, 1968. pp. 12451252. 21. Bloom, B. H., Space/time trade-offs in hash coding with allowable errors. Commun. A C M 13(7), 422-426 (1970).
FILE ORGANIZATION TECHNIQUES
167
22. Blunt, C. R., An information retrieval system model. AD-623 590. HRB-Singer,
Inc., State College, Pennsylvania, 1965. $3. Blunt, C. R., A general model for simulating information storage and retrieval systems. HRB-Singer, Inc., State College, Pennsylvania, 1966. 24. Bourne, C. P., Evaluation of indexing systems. Annu. Rev. Inform. Sci. Techrwl. 1, 171-190 (1966). 25. Bourne, C. P., and Ford, D. F., A study of methods for systematically abbreviating English words and names. J . ACM 8(4), 53S552 (1961). 26. Buchholz, W., File organization and addressing. ZBM Syst. J . 2, 86-111 (1963). 27. Buckerfield, P. S. T., A technique for the construction and use of a generalized information table. Proc. ZFZP Amsterdam, 1968. pp. 395-403. 28. Bymes, C. J., and Steig, D. B., File management systems: a current summary. Datamation 15(11), 13t3142 (1969). 29. Campi, A. V., Dunn, R. M., and Gray, B. H., Content addressable memory systems concepts. ZEEE Trans. Aerosp. Electron. Syst. 1, 168-173 (1965). 30. Carr, J. W., List processing research techniques. AD-858 561L. Univ. of Pennsylvania, Philadelphia, Pennsylvania, 1969. 31. Cass, J. L., Organization and applications of associative file processors. Proc. ONR/RADC Seminar on Associative Processing, May, 1967. 32. Chandy, K. M., and Ramamoorthy, C. V., Optimization of information storage systems. Inform. Contr. 13(6), 509-526 (1968). 33. Chapin, N., A deeper look a t daat. Proc. 93rd Net. ACM Conf. Las Vegas, Nevada, pp. 631-638 (1968). 34. Chapman, R. L., The case for information system simulation. Proc. 2nd Congress Information System Science, pp. 477484. Spartan Books, Washington, D.C., 1964. 35. Chen, F. C., and Dougherty, R. L., A system for implementing interactive applications. ZBM Syst. J . 7 ( 3 4 ) , pp. 257-270 (1968). 36. Childs, D. L., Description of a setrtheoretical data structure. Rep. TR-3 (AD-668 404). 28 pp. Univ. of Michigan, Ann Arbor, Michigan, 1968. 37. Childs, D. L., Feasibility of a set-theoretic data structure. Proc. ZFZP Amsterdam, 1968. pp. 4-30. 38. Chu, W. W., Optimal file allocation in a multicomputer information system. Proc. ZFZP Amsterdam, 1968. pp. 1219-1225. 39. Clampett, H. A,, Jr., Randomized binary searching with tree structures. Commun. ACM 7(3), 163-165 (1964). 40. Climenson, W . D., File organization and search techniques. Annu. Rev. Inform. Sci. Technol. 1, 107-135 (1966). 41. Codd, E. F., A relational model of data for large shared data banks. Commun. ACM 13(6), 377-387 (1970). 4%’. Coffman, E. G., Jr., and Eve, J., File structures using hashing functions. Commun. ACM 13(7), 427436 (1970). 43. Collmeyer, A. J., File organization techniques. ZEEE Comput. Goup News Mar./Apr., 3-11 (1970). 44. Conger, C. R., The simulation and evaluation of information retrieval systems. HRB-Singer, Inc., State College, Pennsylvania, 1965. 45. Craig, J., General design specifications for a random access storage management system. AD-819 350. Computer Associates, Inc., 1967. 46. Craig, S. W . , Jr., NIPS course 2, computer-rtssisted instruction segment system description. Vol. I. Formatted file system concepts and file structuring. AD-860 919L. IBM Federal Systems Division, Gaithersburg, Maryland, 1969.
168
DAVID
C. ROBERTS
47. Curtice, R. M., Magnetic tape and disc file organizations for retrieval. Center for Information Sciences, Experimental Retrieval Systems Studies Rep. No. I . 44 pp. Lehigh Univ., Bethlehem, Pennsylvania, 1966. 48. Daley, R. C., and Neumann, P. G., A general-purpose file system for secondary storage. Proc. AFZPS 1966 Fall Jt. Computer Conf. Las Vegas, Nevada, 1965. pp. 213-229. 49. Davidson, L., Retrieval of misspelled names in an airlines passenger record system. Commun. ACM 5(3), 169-171 (1962). 60. Defense Intelligence Agency, Washington, D. C.; IDHS 1410 formatted file system: file maintenance and file generation manual. AD-637 017. 1966. 61. de la Briandans, S. R., File searching using variable length key. Proc. WJCC pp. 295-298. San Francisco, California, 1959. 62. Delgalvis, I., and Davison, G., Storage requirements for a data exchange. ZBM Syst. J . 3(1), pp. 2-13 (1964). 63. Dennis, J. B., Programming generality, parallelism and computer architecture. Proc. ZFZP, Software. Amsterdam, 1968. pp. 484492. 64. D'Imperio, M . E., Data structures and their representation in storage. Annu. Rev. Automat. Program. (Znt. Tracts Comput. Sci. Technol. Their Appl.) 5, 1-75 (1969). 66. Dodd, G. G., Elements of data management systems. Computing Surv. I(2), 117-133 (1969). 66. Dopping, O., Selection of the most economical type of direct access storage.
67. 68. 69.
60. 61.
6'2. 63.
64.
6'6. 66.
67. 68.
Proceedings of the International Symposium on Economics of Automatic Data Processing. Rome, Italy, Oct. 1965. pp. 21e213. Dumey, A. I., Considerations on random and sequential arrangements of large numbers of records. Proc. ZFZP Congr. 1, 255-260 (1965). Dzubak, B. J., and Warburton, C. R., The organization of structured files. Commun. ACM 8(7), 446452 (1965). Elcock, E. W., Note on the addressing of lists by their source language names. Comput. J . 8(3), 242-243 (1965). Evans, D., and Van Dam, A., Data structure programming system. Proc. ZFIP, Amsterdam, 1968. pp. 557-564. Fed. Amer. SOC.Exp. Biol., Congr., Znt. Fed. Doc. (FZD) Washington, D.C. Abstr. AD-625 498 (1965). Feller, W., An Introduction to Probability Theory and its ApplicatimLs. Vols. I & 11. Wiley, New York, 1950. Ferris, R. J., AD-610 131. An analysis of the multiple instantaneous response file. Rome Air Development Center, Rome, N.Y., December, 1964. Fife, D. W., and Smith, J. L., Transmission capacity of disk storage systems with concurrent arm positioning. ZEEE Trans. Electron. Comput. 14(4), 575-582 (1965). File Organization, selected papers from FILE 68. Swets and Zeithinger, Amsterdam, 1969. Fischler, M., and Reiter, A., Variable topology random access memory organizations. Proc. AFZPS Spring Jt. Computer Conf. 34, pp. 381-391. Boston, Massachusetts, 1969. Flores, I., Computer time for address calculation sorting. J . ACM 7 ( 6 ) , 389-409 (1960). Fossum, E. G., and Kaskey, G., Optimization and standardization of information
FILE ORGANIZATION TECHNIQUES
169
retrieval language and systems. AD-630 797. UNIVAC, January, 1966. Blue Bell, Pennsylvania. 69. Fredkin, E., Trie memory. Commun. ACM 3,490-499 (1960). 70. Frer, E. H., and Goldberg, J., A method for resolving multiple responses in a parallel search file. I R E Trans. EZectron. Comput. 10(4), 718-722 (1961). 71. Fuller, R. H., Contenbaddressable memory systems. UCLA Rep. No. 63-25, Contract No. NR-233 (52). 1963. 72. Gabrini, P. J., Automatic introduction of information into a remobaccess system: a physics library catalog. AD-641 564. Univ. of Pennsylvania, Philadelphia, Pennsylvania, 1966. 73. Garner, H. L., Mathematical models of information systems. AD-673 386. Univ. of Michigan Tech. Rep., Ann Arbor, Michigan, 1968. 74. Gerlernter, H., et al., A FORTRAN-compiled list-processing language. J . ACM 7 , 87 (1960). 76. Ghosh, S. P., On the problem of query oriented filing systems using discrete mathematics. Proc. IFZP Amsterdam, 1968. pp. 1226-1232. 76. Ghosh, S. P., and Abraham, C. T., Application of finite geometry in file organization for records with multiple-valued attributes. ZBM J . Res.Develop. 12(2), 180-187 (1968). 77. Ghosh, S. P., and Senko, M. E., File organization: on the selection of random access index points for sequential files. J . ACM 16(4), 56S579 (1969). 78. Gluss, B., A record storage model and its information retrieval strategy. Computers and Operational Research, Proceedings of the Second International Conference on Operational Reseurch. 810 pp. Wiley, New York, 1961. 79. Goldberg, J., Multiple instantaneous response fie. AD-266-169. Stanford Res. Inst., Stanford, California, 1961. 80. Goldberg, J., and Green, M. W., Large fles for information retrieval based on simultaneous interrogation of all items, in Large Capacity Memory Techniques for Computing Systems ( M . C. Yovits, ed.), pp. 63-77. Macmillan, New York, 1962. 81. Gotlieb, C. C., The processing of files, data and information. Amer. Math. Mon. ?2(2), 119-124 (1965). 89. Greenberger, C. B., The automatic design of a data processing system. Proc. ZFIP Edinburgh, 1965. pp. 277-282. 83. Grems, M., A card format for reference files in information processing. Commun. ACM 4(2), 90-98 (1961). 84. Griffith, J. E., An intrinsically addressed processing system. I B M Syst. J . 2, (3) pp. 182-199 (1963). 86. GrifEths, T. V., and Petrick, S. R., Top-down versus bottom-up analysis. Proc. ZFIP Amsterdam, 1968. pp. 437443. 86. Gurk, H. M., and Minker, J., The design and simulation of an information processing system. J . ACM 8(2), 260-270 (1961). 87. Hays, D. G., Introduction to Computational Linguistics. Elsevier, Amsterdam, 1967. 88. Head, R. V., Dribble posting a master file. Commun. ACM 9(2), 106-107 (1966). 89. Heising, W. P., Note on random addressing techniques. ZBM Syst. J . 2, 112-116 (1963). 90. Henry, W. R., Hierarchical structure for data management. ZBM Syst. J . 8(1), 2-15 (1969). 91. Herner, S., Methods of organizing information for storage and searching. Amer. DOC.13(1), 3-14 (1962).
I70
DAVID
C. ROBERTS
92. Hess, H.. A comparison of disks and tapes. Commun. ACM 6(10), 6 3 4 4 3 8 (1963). 93. Hibbard, T. N., Some combinational properties of certain trees with applications to searching and sorting. J . ACM 9(1), 13-28 (1962). 94. Hoare, C. R., Data structures in two-level store. Proc. I F I P Amsterdam, 1968. pp. 322-329. 96. Holt, A. W., Some theorizing on memory structure and information retrieval. Proc. ACM Conf. (1963). 96. Holt, A. W., Information system theory project. Vol. I. Mem theory. A mathematical method for the description and analysis of discrete, finite information systems. AD-626 819. Applied Data Res., Inc. Princeton, New Jersey, 1965. 97. Humphrey, T., Promenade-an improved interactive-graphics man/machine system for pattern recognition. Appendix 9E. The putget virtual-memory filehandling system. AD-694 115. Stanford Res. Inst., Stanford, California, 1968. 98. IBM Federal Systems Division, Gaithersburg, Maryland. Program description : NMCS information processing system (NIPS) (IBM 1410/7010). Vol. I. General purpose components. Part 2. File maintenance (FM). AD-835-982. 1967. 99. IBM Federal Systems Division, Gaithersburg, Maryland. User’s manual for information processing system. AD-838 001, 1967. 100. Johnson, C. I., Principles of interactive systems. ZBM Syst. J . 7 ( 3 4 ) , pp. 147-174 (1968). 101. Johnson, L. R. An indirect chaining method for addressing on secondary keys. Commun. ACM 4(5), 218222 (1961). 102. Johnson, T. G., Mass storage relational data structure for computer graphics and other arbitrary data stores. Rep., MIT Uep. of Architecture, Cambridge, Massachusetts, 1967. 103. Kapps, C. A., SPRINT: a direct approach to list processing languages. Proc. A F I P S Spring Jt. Computer Conf. Atlantic City, New Jersey. pp. 6 7 7 4 8 3 (1967). 104. Kay, M., and Ziehe, T., The catalog: a flexible data structure for magnetic tape. AD-623 938. Rand Corp., Santa Monica, California, October, 1965. 105. Kememy, J. G., and Snell, J. L., Finite Markou Chains. Van Nostrand, Princeton, New Jersey, 1960. 106. Knuth, D . E., Minimizing down latency time. J . ACM 8(2), 119-150 (1961). 107. Knuth, D. E., Fundamental Algorithms: The Art of Computer Programming. Vol. 1. Addison-Wesley, Reading, Massachusetts, 1968. 108. Koller, H. It., Safety data and social issues: the national highway safety data system. Proc. ZFIP Amsterdam, 1968. pp. 1253-1259. 109. Kronmal, R., and Tarter, M. F., Cumulative polygon address calculation sorting. Proc. ACM 20th Nat. Conf. Cleveland, Ohio, 1965. pp. 376-385. 110. Kurki-Suonio, R., Formal description of input data. Proc. I F I P Amsterdam, 1968. pp. 444447. 111. Landauer, W. I., The balanced tree and its utilization in information retrieval. I E E E Trans. Electron. Comput. 12(5), 863-871 (1963). 112. Lee, C., Content addressable and distributed logic memories, in Applied Automata Theory (J. T. Tou, ed.). Academic Press, New York, 1968. 11s. Lefkovitz, D., File Structures for &-Line Systems. 215 pp. Copyright by Computer Command and Control Company. Spartan Books, New York, 1969. 114. Lefkovitz, D., and Prywes, N. S., Automatic stratification of information. proc. AFZPS Spring Jt. Computer Conf. Detroit, 1963. pp. 229-240.
FILE ORGANIZATION TECHNIQUES
171
115. Levien, R. E., and Maron, M. E., Relational data file: a tool for mechanized interference execution and data retrieval. RM-4793-PR. Rand Corp., 1965. 116. Lewin, M. H., Retrieval of ordered lists from a content addressed memory. RCA Rev. 23,21&229 (1962). 117. Lin, A. D., Key addressing of random access memories by radix transformation. Proc. AFIPS Spring Jt. Computer Conf. pp. 355-366 (1963). 118. Lombardi, L., Theory of files. Proc. 1960 EJCC,New York pp. 137-141 (1960). 119. Lowe, T. C., Boundary crossing in partitioned systems. Informatics, Inc., Rockville, Maryland, 1969. 120. Lowe, T. C., Design Principles for an on-line information retrieval system. AD-647 196. Ph.D. dissertation, Univ. of Pennsylvania, Philadelphia, Pennsylvania, 1966. 1.21. Lowe, T. C., Direct-access memory retrieval using truncated record names. Software Age Sept. (1967). 122, Lowe, T. C., Encoding from alphanumeric names to record addresses. Software Age Apr. (1968). 123. Lowe, T. C., The influence of data-base characteristics and usage on direct-access file organization. J. ACM 15(4), 535548 (1968). 124. Lowe, T. C., and Roberts, D . C., On-line retrieval interim report. T R 69 10 0 1A. Informatics, Inc., Bethesda, Maryland, 1969. 126. Lum, V. Y., Notes on FOREM. IBM San Jose Res. Lab., San Jose, California, 1968. 126. McCarthy, J., Recursive functions of symbolic expressions and their computation by machine, part I. Commun. ACM 3 , 84 (1960). 127. McGee, W. C., File structures for generalized data management. Proc. IFIP Amsterdam, 1968. pp. 1233-1239. 128. McGee, W. C., The property classification method of file design and processing. Cmnmun. ACM 5(8), 450453 (1962). 129. McIlroy, M. D., A variant method of file searching. Commun. ACM 6(3), 101 (1963). 130. Mandelbaum, D., Optimal schenuling of disk file data transfers. IEEE Trans. Electron Comput. 12(5), 551 (1963). 131. Maron, M. E., Relational data file 1: design philosophy. P-3408. Rand Corp., Santa Monica, California, 1966. 132. Maurer, W. D., An improved hash code for scatter storage. Commun. ACM 11(1), 35-38 (1968). 133. Meadow, C. T., The Analysis of Information Systems. Wiley, New York, 1967. 134. Mealy, G. H., Another look a t data. Proc. AFIPS Fall Jt. Computer Conf. Anaheim, California, 1967. pp. 525-534. 135. Meisel, R. M., Monte Carlo techniques for simulation and design. Electro-Technol. (New York) 78(4), 48-51 (1966). 136. Miller, L., Minker, J., Reed, W. G., and Shindle, W. E., A multi-level file structure for information processing. Proc. WJCC 17, 53-59 (1960). 137. Miller, S. W., Fundamental investigation of digital computer storage and access techniques. AD-260 117. 90 pp. Stanford Res. Inst. Stanford, California, 1960. 138. Miller, S. W., Investigation of storage and access techniques suitable for use ifi large-capacity digital memories, in Large-Capacity Memory Techniques for Cornputer Systems (M. C. Yovits, ed.), pp. 1-14. Macmillan, New York, 1962. 139. Minker, J., Generalized data management systems-some perspectives. Univ. of Maryland Computer Science Center Rep. No. 69-101. College Park, Md, 1969.
172
DAVID C. ROBERTS
140. Minker, J., and Sable, J. D., File organization and data management. Annu. Rev. Inform. Sci. Technol. 2, 123-160 (1967). 141. Modern Coding Methods, IBM Booklet 32-3793-6. White Plains, New York, 1962. 146. Morris, R., Scatter storage techniques. Commun. ACM 11(1), 3-4 (1968). 143. Moxham, J. G., Batch information processing system. AD-852 599. Defense
Scientific Information Service, Washington, D.C., 1969. 1.44. Munn, W. J., Central control by random access. Data Process. 4(1), 36-39 (1962). 145. Naman, P., Algebra of management information. Proc. ZFZP Amsterdam, 1968.
pp. 1240-1244. 146, Newell, A., ed., Information Processing Language-V Manual. 2nd Ed. Prentice-
Hall, Englewood Cliffs, New Jersey, 1964. 147. Nolan, J., An experimental on-line data storage and retrieval system. AD-623 796.
Lincoln Lab., MIT, Cambridge, Massachusetts, 1965. 1.48. Opler, A., Dynamic flow of programs and data through hierarchical storage. Proc. I F I P Edinburgh, 1965. pp. 273-276. 149. Pan American Word Airways, Inc., Interference prediction model development: EMETF support of random access discrete address systems. AD-857 006. 1965. 160. Patt, Y. N., Variable length tree structures having minimum average search time. Commun. ACM 12(2), 72-76 (1969). 161. Perlis, A. J., Hess, J., and Phillips, J. A., Disc file applications. 136 pp. American Data Processing, Inc., Detroit, Michigan, 1964. 169. Peterson, W. W., Addressing for random access storage. I B M J . Res. Develop. 1(2), 130-146 (1957). 163. Poland, C. B., Advanced concepte of utilization of mass storage. Proc. I F I P Edinburgh, 1965. pp. 249-254. 164. Postley, J. A., File management application programs. D P M A Quart. 2(4), 20-2s (1966). 166. Prywes, N. S., Man-computer problem solving with multilist. AD-646 154. Univ. of Pennsylvania, Philadelphia, Pennsylvania, 1966. 166. Prywes, N. S., and Gray, H. J., The multi-list system for real-time storage and retrieval. Proc. I F I P Munich, 1962, pp. 112-116. 167. Reynolds, J . C., Automatic computation of data set definitions. Proc. I F I P Amsterdam, 1968. pp. 456468. 168. Ricour, D. H., and Mei, V., Internal data management techniques for DOS/360. I B M Syst. J . 6, (1) pp. 38-48 (1967). 169. Rothenberg, D. H., An efficiency model and a performance function for an information retrieval system. Inform. Stor. Retrieval 5(3), 109-122 (1969). 160. Rybak, F. M., Study to determine the applicability of the solomon computer to command and control. Vol. I. Information storage, retrieval and communication system control. AD454 765. Westinghouse Corp., Baltimore, Maryland, 1964. 161. Rybak, F. M., Study to determine the applicability of the solomon computer to command and control. Vol. 4. Summary. AD-450 214. Westinghouse Corp., 1964. 166. Sable, J. D., Data management system study. Report NASA-CR-86057. 289 pp. Auerbach Corp., Philadelphia, Pennsylvania, 1968. 163. Savas, M. A., A system for filing and retrieving variable length facts about unlike items. Tech. Note 6, 14 pp. Information Systems Dep., Ramo-Wooldridge, Canoga Park, California, 1960. 164. Schay, G., and Raver, N., A method for key-to-address transformation. I B M J . Res. Develop. 7(2), 121-129 (1963).
FILE ORGANIZATION TECHNIQUES
173
166. Schay, G., and Spruth, W. G., Analysis of a file addressing method. Commun. ACM 5(8), 459-462 (1962). 166. Schorr, H., and Waite, W. M., An efficient machine-independent procedure for garbage collection in various list structures. Commun. ACM lo@), 501-506 (1967). 167. Scidmore, A. K., and Weinberg, P. L., Storage and search properties of a treeorganized memory system. Commun. ACM 6(1), 28-31 (1963). 168. Seaman, P. H., Lind, R. A., and Wilson, T. L. An analysis of auxiliary storage activity. ZBM Syst. J . 5, (3) pp. 158-170 (1966). 169. Senko, M. E., Abraham, C. T., Ghosh, S. P., Lum, V. Y., Owens, P. W., Pomper, I. H., Baker, F. T., Schenken, J. D., and Walker, T. P., Formatted file organization techniques. IBM Thomas J. Watson Res. Center, Yorktown Heights, New York, 1967. 170. Senko, M. E., Lum, V. Y., and Owens, P. J., A file organization evaluation model. Proc. 4th ZPZFS Conf., Edinburgh pp. 19-23 (1968). 171. Senko, M. E., Meadow, H. R., Ling, H., Lum, V. Y., Bryman, M. R., Drake, R. J., Meyer, B. C., File design handbook. IBM San Jose Res. Lab., San Jose, California, 1969. 172. Seppala, Y., On optimization of the maintenance of a register. BIT 6(3), 212-227 (1966). 173. Shafer, P., Block oriented random access memory (BORAM). AD-832 405L. Burroughs Corp., Paoli, Pennsylvania, 1968. 174. SHARE Committee on Theory of Information Handling, General data files and processing operations. SHARE Rep. TIH-1 SSD-71, Item (3-1663. 22 pp. 1959. 176. Sharma, R. L., Analysis of a scheme for information organization and retrieval from a disc file. Proc. ZFZP Amsterdam, 1968. pp. 853-859. 176. Shoffner, R. M., The organization, maintenance and search of machine files. Annu. Rev. Inform. Sci. Technol. 3, 137-167 (1968). 177. Shoffner, R. M., A technique for organization of large files. Amer. Doc. 13(1), 95-103 (1962). 178. Stanfel, L. E., A comment on optimal tree structures. Commun. ACM 12(10), 582-583 (1969). 179. Stanfel, L. E., Tree structures for optimal searching. J . ACM 17(3), 5 W 5 1 7 (1970). 180. Stelwagon, W. B., Principles and procedures for the automatic flowcharting program Flow 2. AD-637 863. Naval Ordnance Test Station, 1966. 181. Stevenson, D. A., and Vermillion, W. H., Core storage as a slave memory for disk storage devices. Proc. ZFZP Amsterdam, 1968. pp. 1277-1284. 18.9. Sussenguth, E. H., Jr., An evaluation of storage allocation systems. Sci. Rep. No. ISR-12, 53 pp. Computation Lab., Harvard Univ., Cambridge, Massachusetts, 1962. 183. Sussenguth, E. I€., Jr., Use of tree structures for processing files. Commun. ACM 6(5), 27S279 (1963). 184. Swed, R. E., Linear vs. inverted file searching on serial access machines. The Coming Age of Information Technology, pp. 122-128. 166 pp. Documentation, Inc., Bethesda, Maryland, 1965. 186. Symonds, A. J., Auxiliary-storage associative data structure for PL/1. ZBM S y ~ tJ. . 7, (3-4) pp. 229-246 (1968). 186. Symonds, A. J., Relational language for accessing a software associative memory. Proc. Jt. Conf. on Math and Computer Aids to Design, 1969.
174
DAVID C. ROBERTS
M. E., and Kronmal, R. A., Non-uniform key distribution and address calculation sorting. Proc. A C M Conj., Washington, D.C. pp. 331-337 (1966). 188. Taunton, B. W., “Name code,” a method of filing accounts alphabetically on a computer. Data Process. 2(3), 23-24 (1960). 189. Teichrow, D., and Lubin, J. F., Computer simulation-discussion of the technique and comparison of the language. Commun. A C M 9(10), 723-741 (1966). 190. Thompson, D. A., Benningson, L. A,, and Whitnam, D., A proposed structure for displayed information to minimize search time through a data base. Amer. Doc. 19(1), 80-84 (1968). 191. Unk, J. M., General purpose external memory system for data base processing (ISAR base). Proc. ZFZP Edinburgh, 1965. pp. 267-271. 192. Wallace, E. M., User requirements, personal indexes, and computer support. AD-636 833. System Development Corp. Santa Monica, California, July 25, 1966. 193. Weingarten, A., The analytical design of real-time disk systems. Proc. ZFZP Amsterdam, 1968. pp. 860-866. 194. Weizenbaum, J., Knotted list structures. Commun. A C M 5(3), 161-165 (1963). 195. Weston, P., and Taylor, S. M., Cylinders: a data structure concept based on rings. AD-679 948. Coordinated Science Lab., Univ. of Illinois, Urbana, Illinois. 196. Wilkes, M. V., Lists and why they are useful. Proc. ACM Cunf. Fl-1-F1-5 (1964). 197. Wilkes, M. V., A programmer’s utility fding system. Comput. J . 7(3), 180-184 (1964). 198. Windley, P. F., The influence of storage access time on merging processes in a computer. Comput. J . 2, 49-53 (1959). 199. Witkin, N., A fast random search method. S. Afr. Comput. Bull. 2 (1960). 200. Woodward, P. M., and Jenkins, D. P., Atoms and Lists. Comput. J . 4(1), 47-53 (1961). 201. Ynt,ema, D. B., and Klem, L., Telling a computer how to evaluate alternatives as one would evaluate them himself. Proc. 1st Congr. Information System Sciences Session 4, pp. 21-36. Mitre Corp., Bedford, Massachusetts, 1962. 202. Younker, E. L., Heckler, C. H., Jr., Masher, D. P., and Yarborough, J., Design of an experimental multiple instantaneous response file. Proc. A F Z P S Spring J t . Computer Conf. pp. 515-528 Washington, D. C., 1964. 80.9. Zurcher, F. W., and Randell, B., Iterative multi-level modelling. Proc. ZFIP Amsterdam, 1968. pp. 867-871. 187. Tarter,
Systems Programming languages R. D. BERGERON,’ J. D. GANNON? D. and A. VAN DAM
P. SHECTER, F. W. TOMPA?
Department o f Computer and Information Sciences Brown University Providence, Rhode Island
1. Introduction . 2. Criteria for a Systems Programming Language . 2.1 Target Code Efficiency . 2.2 Run-Time Environment . 2.3 Error Checking . . 2.4 Debugging Facilities . . 2.5 Syntax Considerations . 2.6 Adaptability . 2.7 Program Modularity . . 2.8 Machine Dependence . . 2.9 Multiple-User Systems . 2.10 Miscellaneous . . 3. Specific Constructs . . 3.1 Data Facilities . . 3.2 Storage Allocation . 3.3 Data Manipulation . . 3.4 Program Control and Segmentation 3.5 1/0 and Debugging . 4. Reviews of Several Systems Programming Languages 4.1 PL/I . 4.2 AED . 4.3 BLISS . 4.4 PL360 . 5. Extensibility and Systems Programming . 5.1 Universal Language or Universal Processor . 5.2 Facilities of Extensible Languages . 6. Language for Systems Development . . 6.1 Overview of the Base Language . . . 6.2 Variables and the Declaration Statement 6.3 Program Segmentation and Storage Allocation . 6.4 Procedures and Their Invocation . 6.5 Statements and Operators . . 6.6 1/0 and Debugging Facilities .
176 180 180 184 185 186 187 188 189 190 190 191 192 191 192 194 194 195 196 196 211 214 232 236 236 238 239 240 242 247 255 262 269
1 Present address: Faculteit Wiskund, Katholike Universiteit, Nijmegen, The Netherlands. * University of Toronto, Toronto, Ontario, Canada.
175
176
R. D. BERGERON et a/.
6.7 Compile-Time Facilities . 6.8 LSD Extensibility . . 6.9 Implementation . . Annotated Bibliography . References and Additional Bibliography
. 273
.
.
.
.
.
.
.
.
.
.
. . . .
276 278 279 283
1. Introduction
The purpose of this paper is to state the rationale for systems programming languages, to enumerate some criteria for judging them, to subject some existing systems programming languages to close scrutiny, and finally to describe an extensible language currently being implemented which is addressed specifically to our criteria. As with so many topics in computer science, the matter of choosing which language to program in or what language features to emphasize is largely a matter of personal taste (bias) and experience. The criteria which are applied here to systems programming languages are a product of five year’s collective experience of the software technology group a t Brown University. This experience includes the design and implementation of production systems, particularly in the areas of interactive computer graphics, time sharing, operating systems, information retrieval, and software paging. Furthermore, our viewpoint is colored by having extensive personal experience almost exclusively with IBM hardware and software, specifically with assembler language, PL/I, and several proprietary systems languages for the /360 (both of the assembler language and machine language producing varieties). Thus we really do not offer absolute and incontestable statements in this paper; while some points will be obvious, many others are controversial, and open to vigorous debate. Some of our judgments will not be applicable in some other environment where resources and constraints are widely different. (Consider an industrial environment in which management, because of the frequent changes in hardware systems, has decreed that FORTRAN is the universal programming solvent, or that if programs are inefficient in use of space and cannot run in the available core then it is cheaper to buy more core than to reprogram.) Other arguments may also be inaccurate in the near future, when hardware characteristics change or when newer (and more sophisticated) implementations of languages bear a closer resemblance to their postulated ideals, while retaining the ability to generate (‘good” code. Before proceeding to justify systems programming languages and to enumerate their characteristics, some attempt will be made to define them. Since a systems programming language is a language in which one programs systems, a definition for the term “system” is useful. Webster’s defines a
SYSTEMS PROGRAMMING LANGUAGES
177
system, appropriately enough for our purposes, as an “assemblage of objects united by some form of regular interaction or interdependence; an organized whole. . .” This definition is adapted in common programming parlance to mean that a system program is an integrated set of subprograms, together forming a whole greater than the sum of its parts, and exceeding some threshold of size and/or complexity. Typical examples are systems for multiprogramming, translating, simulating, managing information, and time sharing. Note that the term system program is more inclusive than the term programming system (which is more nearly synonymous with operating system). Having defined the term system program, it is now possible to list some identifying characteristics. The following is a partial set of properties, some of which are found in non-systems, not all of which need be present in a given system. (1) The problem to be solved is of a broad nature consisting of many, and usually quite varied, sub-problems. (2) The system program is likely to be used to support other software and applications programs, but may also be a complete applications package itself. (3) It is designed for continued “production” use rather than a one-shot solution to a single applications problem. (4) It is likely to be continuously evolving in the number and types of features it supports. (5) A system program requires a certain discipline or structure, both within and between modules (i.e., “communication1’), and is usually designed and implemented by more than one person. Being rather facetious, one might say that until recently, an operational distinction between system programs and other (application) programs was that the Iatter were written in high level (i.e., inefficient) FORTRAN by effete applications programmers. On the other hand, system programs, i.e., “real” programs, were handcrafted in assembler language by members of the systems programmer priesthood, to be lovingly tailored to the i d h syncrasies of the actual machine used. (In all fairness to FORTRAN, particularly early implementations, it should be mentioned that great attention was paid to optimizing for the target machine, i.e., taking into consideration drum latency optimization on the IBM 650, generating code to take account of frequency of occurrence of variables, and taking computation out of loops, etc. Furthermore, FORTRAN was not touted for anything but applications programming.) Thus programmers were traditionally faced with these two extremes-potentially efficient, but hard to
178
R. D. BERGERON et a/.
write and debug assembler language, versus inefficient, limited facility, but easy to write and debug high level language. For systems programmers this choice was typically resolved in favor of assembler language. As the size and complexity of systems increased over the past decade,’ systems designing, writing, and debugging time increased alarmingly, while the comprehensibility and maintainability of the assembler code correspondingly decreased. At the same time, cost per memory bit and per logic unit was going down, making substantially faster and bigger machines economically viable, and making the efficiency argument for assembler language less attractive. This was particularly true because, as programmers’ salaries rose, software began to cost more than hardware. Furthermore, it was found in many efficiency experiments that programmer productivity, as measured in lines of code written and debugged per day, was about the same for the more expressive, yet compact high level languages as for assembler language [lo]. Thus the greater productivity of high level programming, coupled with lessening requirements for efficiency, has led to increasing use of high level languages that were originally intended primarily for applications programming (FORTRAN and PL/I). A better compromise, of course, would be to design a special purpose language with the systems programmer in mind. This type of effort began with JOVIAL and NELIAC, which were based on ALGOL58, and has included languages such as EPL,2 MOL360, ESPOL, and others. These systems programming languages were usually created for use by professional programmers to write large, complex data structure manipulation programs. The goal of a systems programming language is to provide a language which can be used without undue concern for “bit twiddling” considerations, yet will generate code that is not appreciably worse than that generated by hand. Such a language should combine the conciseness and readability of high level languages with the space and time efficiency and the ability to “get at” machine and operating system facilities obtainable in assembler language. Designing, writing, and debugging time should be minimized without imposing unnecessary overhead on systems resources. The question arises why a modern implementation of a general purpose language like PL/I would not suffice. Simply put, if the language had a special purpose systems programming implementation, then it would. Commercially available compilers, however, tend to be still too general In ref. [lo](somewhat dated now) for example, the author states that the Multics System Programmer’s Manual ran to 4000 single-space typewriten pages, while the system itself was represented by some 3000 pages of (high level, PL/I-like) source language code. An early version of PL/I designed for implementing MULTICS [ l o ] .
SYSTEMS PROGRAMMING LANGUAGES
179
purpose, all inclusive, and inefficient in terms of code they generate or the run-time environment they invoke. Space efficiency is still an issue; while the amount of core on medium and large scale computers has increased markedly during the last five years,3 so have the users’ expectations and requirements. I n fact, a variation on Parkinson’s law has taken place. Although the additional core was sometimes made available to larger programs, more commonly it was used to support multiprogramming, “making better use of the system’s resources” to improve throughput. Consequently, many users found themselves, for example, running in a 128K byte region of the larger machine instead of in a 32K 36-bit dedicated machine (less 6 to 8K for the operating system), experiencing almost no space advantage. As more core was attached, the machine supported more simultaneous users, and only a few privileged users were able to get more space for themselves. Thus the ordinary systems programmer still found himself concerned with overlaying his program and software paging (swapping) his data. Only the few hardware paged machines such as the Atlas Titan 11, Sigma 7, GE 645, RCA 70/46, PDP10, or IBM 360/67 have facilities for letting programs run in conveniently large (“virtual”) address space, and even then, not without restrictions. To summarize, space is still limited (and is likely to remain so for a few years despite the application of bulk core). Therefore, space efficiency is still a concern for the systems programmer today. Also, as satellite computers with their smaller core become more prevalent, their systems programmers shouId have the benefit of being able to program in a high level but efficient language. Furthermore, time considerations, as always, may be of even more concern. Critical loops in a multi-user interactive program must be coded as efficiently as possible to allow real-time responses. Another consideration affecting time efficiency is the compatibility of a program and its execution environment. For example, code which allows address pointers to cross hardware or software page boundaries indiscriminantly can severely overtax the hardware with excessive 1/0 activity to retrieve pages. Such “thrashing” could occur by executing a “pointer chasing” program (such as an operating system) in a virtual (paged) memory, although it was designed for contiguous core. A language which allows the responsible systems programmer to address his hardware/operating system environment (and its constraints) judiciously is therefore still required within today’s state of the art. For example, many IBM 7090/7040s with 32K 36-bit words were replaced by IBM system 360s with 256 or 512K 8-bit bytes (64K or 128k 32-bit words).
180
R. D. BERGERON et a/.
2. Criteria for a Systems Programming Language
A systems programming language should be designed to be general enough to be used for writing as many different types of systems as possible. The term “system” encompasses such a wide range of constructions that the requirements imposed on a language by two different systems may be quite diverse-in fact, some may be contradictory. For example, a system which is to be used by an industrial organizatoin for computer-aided design should be machine-independent so that it need not be rewritten when the company alters its computer configuration (possibly changing to a completely different machine). On the other hand, an operating system can not afford to have any unnecessary inefficiencies, and therefore must be hand-tooled to reflect all the peculiarities of the particular machine. Design decisions in a systems programming language will affect the applicability of that language to a particular project, making one language better suited than another. In fact, most existing systems programming languages were designed for a particular class of systems, and then enlarged to provide for other types of systems. Therefore they cannot be classified as general systems programming languages. The criteria which should be involved in designing a systems programming language or in evaluating an already existing language are enumerated below. The considerations begin with those which influence the running of systems written in the language, and end with those which influence the writing and designing of such systems. Naturally, the most important considerations from our point of view are listed first. Later five languages will be examined in light of these criteria. 2.1 Target Code Efficiency
A systems programming language must produce efficient code for the machine on which the system is to run. The problem of inefficiency exists at all installations in one or more forms, due to limitations on processor speed, core size, or 1/0 activity. Each systems programmer should know the restrictions of his particular machine configuration and job mix, and tune his program to minimize the demand on the resource which is most scarce. For example, where a machine is predominantly I/O-bound, all input and output must be made as efficient as possible, at the expense of other resources such as memory or processor cycles if necessary. The problem of inefficiency is compounded considerably in a multitasking environment. Consider a program executing in a paging environment. If the program uses its data areas in an inefficient manner, the extra paging activity will certainly slow down the program’s execution. In ad-
SYSTEMS PROGRAMMING LANGUAGES
181
dition, other programs will be slowed down because some pages which they require will have been swapped out of core to make room for those of the inefficient user. Therefore, each program in the system will require more than the usual amount of paging and the degradation of execution times will snowball, reducing batch throughput and response time for interactive programs. The target code of a systems language should make full use of all the capabilities of the machine. For example, a systems programming language should not be limited to conventional addressing schemes on a machine which has a hardware mechanism for indirect addressing, only because other machines are not so equipped. A serious abtempt should be made to have every statement produce optimal code for the particular machine in the environment in which it is used. While some constructs in the language must produce a large amount of code, generate a large work area, or do a significant amount of input or output, the user of the language must take care that his system does not consist of unnecessary occurrences of this type of construct. Thus a user should be made aware of the fact that compilation of a given statement results in an expensive section of code. For example, the generation of implicit (ie., not explicitly specified by the user) calls to run-time conversion routines seems like a great freedom for the programmer, yet the use of these routines might be impractical in the system, due to the time and storage space necessary to perform the conversion. For cases such as these, the user should be made aware of the cost of his code. The Users’ Guide for the language should list the estimated storage requirements and execution overhead of all constructs. If a runtime routine will be invoked, he should be informed that something is being done “behind his back.” When an expensive construct is actually detected, the compiler should signal the user by printing out warning messages and a cross-reference table of such routines, including their sizes, in the listing. Through such feedback a systems programmer can have full control over the consequences of his coding techniques. 2.7.7 Programmer Control of Compiler Facilities
For the majority of his program, the user of any high level language should not have to worry about particulars of the implementation of his task, but rather he should be able to describe that function and let the systems language work out the details. On the other hand, there are usually several critical points in the program where a running system’s overall performance can be greatly improved through better code which only the programmer himself can describe. For this reason it is important
182
R. D. BERGERON
ef a/.
that the systems programming language have facilities for the user to “help it do the right thing.” The following example will illustrate the point: Suppose a particular set of variables is accessed by each module in some large system. Since the routines are compiled separately, no global optimizer could discover this condition. (Good optimization of a general purpose high level language is as yet an unsolved problem.) Either passing a pointer to each routine or denoting that pointer to be external may be too inefficient, since i t would need to be accessed very often. The user should be given the option of informing the compiler that he will require that particular pointer in most of his code (perhaps by specifying that i t should be kept in a register, thus eliminating redundant instructions).
In this case the user knows a peculiarity of his program that the compiler could not have known. The execution time for the system could be reduced (especially in the case of critical loops). Furthermore, when this situation occurs frequently in one program, there may be a significant saving of space due to the deletion of superfluous LOAD and STORE instructions. I n addition to register usage, local efficiency with respect to space could be gained if the user were permitted to indicate that a common parameter area be used for all subroutine calls instead of each call establishing its own area. Efficiency requirements may also be specified globally, in less low level detail. I n PL/I14for example, the user may make use of the OPT option, through which he may indicate whether he wantjs compile-time, run-time, or core usage optimized. 2.7.2 Optimizing in Assembler language
It is a highly controversial point whether or not a systems programmer should implement his system on only a high level. There are many who feel that since an algorithm can be coded completely from within a high level language, the programmer has no need to depart from the purely symbolic source code to consider the target code produced. That is, little value may be gained compared to the amount of time and energy needed to code in assembler language. A truly ideal environment is one in which all systems software including the operating system and perhaps even the machine’s “instruction set” is on a high level (i.e., PL/I-like). Thus, all designing, writing, and debugging can be done a t this level. (Such an environment might evolve when there exist firmware translators for languages like PL/I.) Certainly in a completely bug-free environment, where a programmer can put all his faith into the operating system, the compiler, the interactive supervisor, and the debugging package, all programs can be written and debugged on a Hereafter, unless otherwise noted, PL/I refers to the IBM implementation of PL/I (F compiler).
SYSTEMS PROGRAMMING LANGUAGES
183
purely symbolic level with none of the users ever needing to see the target code generated. Unfortunately in today’s world, too many systems tools are not ideal: debugging packages lose control over the testing program, compilers generate inefficient code, operating systems affect programs where they should not, and even the hardware occasionally (‘goes down.” For example, a t Brown University two perplexing 1/0 problems (dealing with the updating of more than one record per track and “negative track balances”) were solved only after painstaking examination of the assembler language listings of IBM-written access method subroutines. As in this case, it is sometimes necessary to examine a system at its lowest level in order to completely determine its action. A systems language compiler should be able to generate an (optional) assembler language listing in order to help satisfy the criteria described for target code. First of all, an assembler language listing is an ideal method for obtaining feedback on the efficiency or inefficiency of a particular statement. The amount of space required for implementing a given statement can immediately be determined from the assembler language produced. Furthermore, implicit subroutine calls could also be easily detected. For example, the expense of initializing a large array in automatic storage in PL/I can immediately be seen by the pages of assembler language in the listing. In addition, an assembler language listing is a good debugging tool when an obscure bug is detected six months after the compiler has been finished. It may be claimed that producing an assembler listing is too expensive in that it introduces an extra compile-time pass, i.e., an assembly step. However, this step is unnecessary if the compiler generaks an assembler listing along with the machine code. One extra lookup in a table of opcodes and one extra 1/0 operation per instruction is well worth the expense when it can eventually save programmer time or the execution time or space needed by the system itself. Others say that a disassembler which translates from object code back to assembler language is much more practical. Besides the expense, other drawbacks of using this post mortem translation are that mnemonics representing lengths cannot be recovered and the substitution of component names for the displacements into structures is impossible since these are indistinguishable from other constants in the machine code. In a few cases, it is impossible for the user to describe a peculiarity of his system. Furthermore, every compiler will have features which restrict some systems programmers. These unsatisfactory conditions can be relieved by allowing the insertion of assembler language as in-line ‘(open subroutines.” The section of code thus produced will contradict some of the rules of a systems programming language, especially syntactic clarity,
184
R. D. BERGERON et a/.
but at times the advantages of low level coding are great enough to compensate for this loss. When an available facility is expensive in the systems language, judicious recoding of critical portions in assembler language may also be valuable. By taking the machine environment into consideration, a programmer may modify the compiler’s target code to raise the overall efficiency of his system appreciably. The following example will show that a tight algorithm on the source code level does not necessarily ensure an efficient program : A linguistics group at Brown University was coding in an early version of PL/I which did not include the TRANSLATE and VERIFY functions. I n a key routine, they needed to scan a character string for the first occurrence of one of several eight-bit configurations, returning the index of that character in the string. To implement this, they coded a very tight PL/I loop (which compiled into a large amount of machine code). Finally, a systems programmer realized that they had just simulated the single /360 TRT instruction. By recording that small, key routine in assembler language, he was able to reduce the execution time of the whole program two orders of magnitude. It was only through low level knowledge of the machine that the program could be made more efficient.
The trade-off between the efficiency of assembler language and its lack of clarity is easily resolved since such inserts should occur only infrequently when the relevant section of the program is completely thought out. Furthermore, documentation should be provided in great detail to explain the meaning of the code (in fact, the original code in the systems language provides excellent documentation in such a case). 2.2 Run-Time Environment
Since many systems have common run-time needs, such as 1/0 and dynamic storage allocation, it is helpful for a systems language to provide a set of run-time routines. The storage saved by not duplicating the in-line code every time a construct is used must adequately compensate for the extra instructions needed to link to this run-time environment. (The threshold point is a function of the number of parameters and the length of the code in the subroutine.) As stated above, the disadvantage of a run-time environment provided by the system is that the user is unaware of the cost of the routines he invokes because he does not know the implementation details. He should be informed of the approximate cost, in execution time and core, of each routine so that he can judge whether it is worth using. The routines may be made available to the user through the use of the subroutine library from which he can select those which are applicable to his program. Needless to say, the systems language should not burden the
SYSTEMS PROGRAMMING LANGUAGES
185
user with all the system-generated run-time routines, but rather only the few that are actually needed. In this way, provisions in the language for a particular complex routine do not weigh down a user who does not need that facility. By making the run-time routines modular, the system can maximize storage use by allocating and freeing seldom used routines at execution time. 2.3 Error Checking
In general, a system which is implemented using a high level language will have three basic types of error checking. The first of these occurs at compile-time when, for example, the compiler checks for legal syntax or for matched data types in an assignment statement. The second (implicit run-time error checking) occurs if the compiler generates code to check for legal contents of some variable, for example, one being used as an index into an array. Finally, there is explicit, problem-oriented error checking, that is, checks actually specified by the user. These might be needed in a time-sharing system to test the legality of its own users’ input. Implicit run-time error checking in a systems programming language should not be automatically generated. When a complicated system is running, it cannot afford to be slowed down by unnecessary error checking in each instruction; the system may have to run very quickly, especially in an interactive environment where response time is crucial. By the time the system is in production use, “the program bugs have surely been removed.” Thus most of the implicit checking would be just time-consuming without any positive effect. In those cases where system integrity would be in jeopardy without run-time checking, the user may specify explicit error checking. The next question that might arise is whether implicit error checking should be done at compile time. First of all, normal symbol table lookup and type checking must be done at compile time to determine which mode of instruction is to be generated. For example, in a /360 ,the compiler must decide whether to generate a LOAD FULLWORD or a LOAD HALFWORD for a given instruction in the high level language. Second, the time element involved is of an extremely different nature. At compile time, the running speed is not crucial since the compilation will not be done as often as the system is run. Thus as long as the program is not significantly slower in compilation (that is, even as much as 50% or more) error checking is manageable. Although the compiler should not terminate the user’s job, it can check for abnormal conditions (such as branching into an iterative loop) and print an appropriate warning message. The word “warning” is important here, because in a systems programming language (intended for knowledgeable users) the compiler should not
186
R. D. BERGERON et ol.
attempt to correct the user’s mistakes. At one time or another, almost every systems programmer has to get around some restriction placed on him by the design of the language. For example, a user may wish to pass a character string of length four to a subroutine which expects a full word. Since its users are aware of what they are doing, the compiler should allow such a “mistake” and merely issue a warning message, as long as it can generate code.5Note how this differs from PL/I’s handling of warnings. The PL/I compiler will attempt to correct syntax errors; will generate conversions where there are actually errors (without generating warnings) ; will not allow the user to get around restrictions such as printing pointers and offsets; and (in the F-compiler) will assign assumed attributes to undeclared variables without generating warnings. If a compiler does not automatically provide run-time error checking, a user of the language must be able to specify explicit error checking from within the language. Provision must be made, for instance, for the user to determine that on a given a.rray he does want a check to be made for an illegal subscript. This facility can be provided in the form of actual instructions that the user must code after the access to the array, or it can be done through a statement which means “whenever the array ARRAY is coded, generate a check for an invalid subscript.” Thus the user has the option to have error checking a t those few places a t which he actually needs it, instead of every place the compiler feels it is needed. 2.4 Debugging Facilities
Simple 1/0 should be provided for the user so that he may debug with a minimum amount of excess work. Facilities for printing out the core contests of small areas of memory or for dumping values of given variables should be included for this purpose. Such facilities can be provided on demand (that is, the user indicates the point a t which he wants the contents of the variable printed), or universally (that is, the user states that whenever a certain identifier is used, the value is to be dumped). Since a systems language should be designed for today’s interactive environment, aids should be provided for debugging on-line as well as off-line. One such technique is to allow the user to “snap” selected variables and to have them printed on his console. In order to remain completely symbolic, a systems language could also provide a run-time symbol table (only on request) in order to print out the contents of any variable by name (i.e., the user issues the command PRINT VARl). In many cases all of the users’ core may be dumped symbolically, so that he never Again the contradiction to semantic clarity is inherent here, but occasionally the tradeoff must be settled in favor of “tricky” (but well-documented) code.
SYSTEMS PROGRAMMING LANGUAGES
187
needs to worry about such low level considerations as the hexadecimal representation of a character string. 2.5 Syntax Considerations
After the provision of efficiency and cost information, an important consideration in designing a language is to make it as readable as possible. While this seems terribly obvious, it is surprising how many languages violate the principle with awkward or contrived syntax. The instructions themselves should read as easily as if they were part of the documentation; comments within the program should not be the sole method for explaining the algorithm or logical flow, but rather they should enhance the meaning of a set of instructions. One aid in accomplishing self-documentation is to allow long names for identifiers, in order that the labels be self-explanatory. Furthermore, a general character set should be employed in order to avoid trumped-up notation. For example, in specifying that a pointer should be set to point a t a ring element, the *1 language construct
is certainly not as mnemonic as GRAPHICS-POINTER
=
ADDR(R1NG-ELEMENT)
even though they may both do the same thing. In order to understand a program, the statements must not only be readable, but the use of each should clearly convey the corresponding semantics.6The language should be rich enough to provide all the facilities a systems programmer wants. A user must be able to obtain the desired facilities and efficiencies through straightforward use of the commands of the language. However, these must exist without violations to widely accepted notations. For example, a common symbol such as the plus sign (+) should be used only for arithmetic addition and not for something else such as logical or. One of the biggest advantages in using a high level language rather than assembler language is that the programmer can state what he wants to do without having to resort to “misusing” constructs. In a high level language an instruction should be used for the purpose for which it was designed, rather than to accomplish some side effect of that command. 6L)esigners of languages should be aware that due to past experience with other programming languages, its users will have a background in which certain syntactic constructs have associated semantics. The constructs of a language should avoid contradicting these “natural” meanings.
188
R. D. BERGERON et a/.
This feature seems quite obvious at first glance. However it is often missing from low level languages. For example, in 0s Assembler Language, a branch and link instruction, which was intended to branch to a subroutine saving the return address in some register, might be used to load the address of a parameter list into register one before a call is made to a subroutine. BRANCH AROUND PARM LIST; LOAD ADDR IN R1 DC A(PARM1) FIRST PARM DC A(PARM2) SECOND PARM LABEL BAL 14,SUBROUTN CALL THE SUBROUTINE
*
BAL
1,LABEL
Since such trickery tends to accumulate making the logic of the program difficult or impossible to follow, it should be avoided in a systems programming language. It is important to remember that often the code must be examined by other individuals who interface with the system as well as by the author himself. The readability of a language also tends to make it easier to learn. After an initial, short learning period, a programmer should not have to refer to a manual for determining the uses of various constructs. The rules of syntax and of semantics should be natural to the user, and there should be few special cases (i.e., few exceptions) to be memorized. Furthermore, simple commands should be provided for common tasks. All keywords must be meaningful to the user in his application; the semantics should be unambiguous; and misstating a construct should not yield another construct, but rather should be recognized as a mistake. For example, the IBM CMS program editor has both LOCATE and FIND commands which mean about the same thing in common parlance, but provide different facilities, thus resulting in confusion; 0s Assembler Language has bot,h SLR and SRL which look nearly the same, but represent a subtract (Subtract Logical Register) and a shift (Shift Right Logical), thus allowing keypunch errors to become programmers’ nightmares. 2.6 Adaptability
No compiler can be expected to provide for every facility that the programming world will require. (The idea of a universal language has finally been discarded by most people interested in systems programming languages.) A language designer cannot even predict all the constructs needed within a particular environment. Thus the compiler ought to support definitions of new types of operations and operands.
SYSTEMS PROGRAMMING LANGUAGES
189
A method for achieving a measure of adaptability is through compiletime statements. A good general purpose macroprocessor can aid considerably in making the language adaptable to the user. Furthermore, ability to include strings of text from user or system libraries (e.g., the %INCLUDE statement in PL/I) is almost essential in an environment in which several programmers work on a large project. Another answer to the problem is for the systems language to be extensible. In this way, a user can include in his version of the language a construct that is somewhat unusual, but very useful to his application. Furthermore, he need not be concerned that another user’s extensions are included in his version, possibly making his system more bulky and less efficient. Also, a user should be allowed to change keywords in order to make their meaning more natural to him. This almost trivial type of extensibility makes lexical changes possible in order to provide the capabilities as described in the section on syntax above. Substitution for operators or delimiters should also be allowed to aid in the legibility of the syntax to the user. 2.7 Program Modularity
Whenever a complex system is built, sections will be compiled separately, even though they will run at the same time. Parts of a large system may be written by many people who will have to link their modules together. Furthermore, all systems undergo transformations in time as techniques are improved and sections are redesigned, thus making the system’s modularity vital. In addition, some parts of a system might be coded in a different language in order to use each language to its best purpose, and thus improve the performance of the final system. For these reasons, the systems programming language must provide facilities for linking the object code modules it produces to each other, and to those produced by FORTRAN, assembler language, or any other language that may be used. In many cases it is not sufficient to only provide for a CALL to a subroutine which will return when completed. The expense of modularity is such that a systems language should provide linking mechanisms which are as flexible as possible. Components of large systems may require subroutines to be loaded dynamically (provided in the /360 by LINK). Occasionally, components may link in such a way that the calling routine does not require a return (provided by the /360 XCTL). Thus, the storage space for the calling routine may be returned to free storage. In a multiprogramming environment, routines are called to run asynchronously to the “mainline” (as /360’s ATTACH). Furthermore almost all systems have to use some routines which are invoked through the supervisor or
190
R. D. BERGERON et a/.
monitor. For these and other cases, various types of run-time linking must be provided by a language, so that the systems may take full advantage of the many modes of call. 2.8 Machine Dependence
Many people have promoted the idea of complete machine independence of high level languages in order to allow portability of programs between machines. Furthermore, machine independence a t the source level insures that the programmer is not merely forced to manipulate code, but allowed to program an algorithm. On the other hand, systems programmers need to use the full capabilities of the machine at hand. Such functions as input and output, though they exist on all machines in one (or more) forms, are completely different in their facilities and effects and thus the source code should reflect these differences. The language must provide facilities which will generate target code which takes advantage of all the strong features of the machine in order to allow the programmer to use that machine to the best if its abilities. For example, a language designed for the System/360 should provide attributes which correspond to half words and bytes and constructs that make use of the assembler language instruction T R T (a character scanning instruction). Hence complete machine independence cannot be a major design criterion of a’truly efficient systems programming languages. [See Section 2.1 on target code efficiency for a related discussion.] 2.9 Multiple-User Systems
Certain facilities are needed by systems that will have many users simultaneously. Some of these should be included as part of the language itself, whereas the rest should be available to the user only through “hooks” in the compiler. The hooks may be accessible either through syntactic and/or semantic extensions to the language, or by subroutining facilities. The systems programming language should therefore be only one part of a systems programming environment. The systems programming language must provide a compile-time option to generate reentrant code.? This feature is essential for writing an interactive system. By allocating the data areas dynamically, the system will be able to support multiple users who are likely to use the same routine Even for non-reentrant code, there should be no facility for the code to modify itself. Self-modification loses the algorithmic clarity that is desired in a program which might have to be understood by many people, and usually gains very little on today’s machines.
SYSTEMS PROGRAMMING LANGUAGES
191
concurrently; any location which is likely to change must be allocated separately for each user. A facility which should be included as part of the systems programming environment is a software paging mechanism. Even on a machine with hardware paging, software paging is necessary to facilitate handling of large files and to swap users and data in and out of core in a multiple-user system. Of course, this should be an option for the user to specify, so that other systems need not carry that overhead. Another such facility is to allow the user to include protection in his system. In a multiple-user environment, there are certain restricted files and particular areas of storage that can be accessed by only certain individuals. In some cases files need to be protected from modification, at other times from scrutiny or execution by other users. The environment might be able to provide mechanisms in the paging system for setting and inspecting keys for protection from reading, writing, execution, or any combination of these.
2.10 Miscellaneous
A language designer must remember that the language must satisfy the users’ needs and still be pleasant to use. The users of the language will judge the language on syntactic and semantic clarity, ease of writing and debugging, and e6ciency of target code. A proverb due to A. J. Perlis is: “One man’s constant is another man’s variable.’’ Since all programmers have their own ideas of clarity and ease, as many default options as possible must be left unbound until the user indicates his preferences. There is a great variation between the ideals of programmers who implement systems. As an example, one group may prefer undeclared variables to assume some default attributes, whereas another prefers to have them flagged as errors. As many parts of the system as possible should be adaptable to the user; Le., implemented either as a set of parameters changeable at each compilation of the system’s components, or set in the compiler by the project manager for the duration of the coding of the project. For those parts of the system which cannot be adjusted to the peculiarities of the user, the designers of a systems programming language should obey the “Law of Least Astonishment.” In short, this law states that every construct in the system should behave exactly as its syntax suggests. Widely accepted conventions should be followed whenever possible, and exceptions to previously established rules of the language should be minimal.
192
R. D. BERGERON et a/.
3. Specific Constructs
The previous section enumerated the global requirements of a good systems programming language. It considered the overall criteria of such a language both at the source and at the target levels. This section lists the specific facilities needed by programmers for implementation of systems which will function nearly as efficiently as if hand-coded in assembler language. 3.1 Data Facilities
Fixed point variables are used in every program to serve as counters, to contain quantities such as the current length of a varying length character string, or to be used as pointers to reference other data items. Naturally, all the basic arithmetic operations must also be available for manipulating these quantities. Some systems need floating point variables and operations as well. These might be used to contain such numerical data as graphic coordinates. Both bit and character data types are required: the first might contain flags and masks; the second, parsed data items and, in a translator writing system or compiler, generated code. 3.2 Storage Allocation
In order to perform data manipulation efficiently, the systems language should provide various methods of storage allocation on three basic levels as discussed below: program (static), segment (implicit dynamic), and statement (explicit dynamic). A data item which is allocated on the program level is accessible by that program (i.e., “module” or whatever name is given to an independent unit of an executing system) at any time. That is, space is reserved for the item for the duration of program execution. A value defined there during compile time or stored at any time during execution may still be obtained by the system at the end of execution or at any time in between. Data items requiring this static type of allocation include any constants used by the system. Many variables are only needed in a single routine, and therefore they need not be accessible throughout the whole program. In order to conserve space, the variables could be allocated on the segment level (implicit dynamic), that is, the space could be obtained from a free storage mechanism upon entry to the block or subroutine in which it is needed and then freed again (returned to free storage) upon exit, after which the variable may not be referenced.
SYSTEMS PROGRAMMING LANGUAGES
193
Since segment level allocation is performed on entry to the routine, such variables may be used in a reentrant or recursive environment. If the routine calls itself, either directly or through another routine, the values of the variable will be saved (e.g., in a push down stack) and the variables themselves will be reallocated. Upon exit from the second instantiation, the second copy of the variables would be freed, and the first set reestablished as the current values. Similarly, if two separate users access the system at the same time in a reentrant environment, each would be allocated a separate copy of the variables on entry to the routine. The final mode of storage allocation (explicit dynamic) is done on the statement level and is directly controlled by the user. That is, the user issues a command to allocate a variable and at some time later, he issues an explicit command to free it again. During that period of accessibility, the variable is “based on” (referenced by) a pointer (whose value is an address) that has previously been allocated in the program (on the program, segment, or statement level). Such a variable is accessible regardless of segment nesting, as long as the pointer on which that variable is based is accessible. Every reference to the new variable is evaluated by first locating the basing pointer, then using its value to determine the location of the based variable. Thus the declaration of a based variable specifies the attributes of a template which may be located anywhere in core at run time since the value of the basing pointer is undetermined at compile time. In fact, one may reference any section of core as if it were that variable by changing the contents of the basing variable or by specifying another pointer as the base for the variable. Through the use of the second of these options, a user may have “multiple copies” of the variable in core at once, each one referenced by a different pointer. This type of allocation would be used typically to store the blocks of a linked list. The first copy of the variable (head of the list) would be based on an already existing pointer, and each new element would be referenced by a pointer in the previous block. Elements could be added to the list by allocating a new copy of the variable and basing it on the pointer contained in the last element of the list; others could be deleted by altering the pointer of the previous block to reference the following one, and then freeing the variable itself. The allocation of storage is closely related to the concept of scope. The name of a variable must be made known to the compiler whenever a user wishes to reference it. Local variables are those which are known only within the block in which they were declared (including all statically nested blocks). On the other hand, global variables are those which are known in any block in the system. The option of using variables globally requires that the language provide
194
R. D. BERGERON ef a/.
two constructs. One of these is used to declare that a variable is to be allocated in the declaring block, but may be referenced in any block in the system. (This is often called entry or global.) The other construct specifies that a variable that is allocated in another block (as a global variable) is to be known by the declaring block. (This is usually referred to as external.) 3.3 Data Manipulation
All systems must organize data and subsequently access that data as efficiently as possible. The most common method is conventional list processing (“pointer chasing”) in which the data is organized into ordered lists, i.e., blocks (“structures” or “beads”) randomly stored but linked together in some order by a chain of pointers. Typical operations which must be provided from within the systems language are constructing lists (establishing the pointers), searching lists for a particular data item, and deleting items from a list. Another method is to use arrays for storing data with list processing replaced by operations on indices. Additional types of data organizations may be implemented through hash coding techniques [13]and set theoretic structures [8]. String manipulation facilities must be provided to manage bit and character data. Typical operations are pattern matching, accessing a substring, concatenating two (or more) strings, inserting, deleting, substituting, rearranging, etc. 3.4 Program Control and Segmentation
Flow of control must be maintained within individual modules, as well as between separate components of the system. A clear program logic, which is probably the most crucial aspect of systems design, can be maintained only through the use of iterative loops and conditional and unconditional jumps. Typical constructs available for this purpose are I F . . . T H E N . . . ELSE, WHILE, FOR, and CASE statements. Others suggested by Dijkstra provide for the elimination of GOTO’s through the use of DO loops and escape expressions. A common method of program segmentation is through the use of subroutines. Control is passed to a subroutine (or procedure) which saves the environment of the calling routine (return address and register or accumulator contents). Upon entry, a procedure must allocate all its segment level (local) variables. This routine may invoke several other procedures in turn, but eventually must regain control. Finally the procedure frees its local variables, restores the environment of the calling procedure, and
SYSTEMS PROGRAMMING LANGUAGES
195
returns control to that routine. Subroutining requires CALL and RETURN commands for maintaining control. Other useful facilities for run-time linking are the /360 commands ATTACH, XCTL, and LINK described in Section 2.7 on program modularity. I n some cases, a user prefers to communicate between two equally important routines. The systems language may provide the facilit,ies of coroutines, where control can pass from one routine to another and back again without freeing the segment level storage of either. An additional method for segmentation is block structure, such as found in ALGOL. A block is similar to a subroutine in that it signifies a logical unit both a t compile time and a t run time. Variables and labels may be declared and are addressable within this block and all of its inner blocks. Thus, as is the case for subroutines, data local to the block may be overlayed in storage by that of another block when it is not being used. Whereas a subroutine can only be entered by an explicit “call”, a block is entered by passing sequentially through the block head. Thus a block has no need for facilities for returning to the point of entry, but merely passes control sequentially to the next segment. Another method of segmentation is to provide routines which will receive control after an asynchronous interrupt (one which occurs from some outside source a t an unpredictable time). A typical example of such an interrupt is a machine check (such as an addressing error) or a lightpen detect in a graphics system. These routines behave similarly to subroutines in that they are usually not entered sequentially and they return to the point of execution a t which the interrupt occurred. (It is often convenient to use a synchronous device for “calling” this type of segment, such as in PL/I’s SIGNAL facility, which simuIates an appropriate asynchronous interrupt.) 3.5 1/0 and Debugging
Data may be transmitted to and from files in either stream or record format. Stream I/O may be treated by the user as a continuous flow of information. Although this type of 1/0 is usually associated with devices which do either input only or output only (such as a card reader, card punch, or printer), it can also be used to access sequential files on disks or tapes. On the other hand, record 1/0 requires that the data be segmented into logical units, having a fixed or a varying length. This type of 1/0 may be used for direct access as well as sequential files. I n order to allow 1/0to overlap program execution to improve efficiency, the systems environment should provide a means for buffering the data to be transmitted. In the case of input, this implies that several records be read into core before they are actually required by t
196
R. D. BERGERON et a/.
records are temporarily stored in core before actual transmission, thus permitting the user to continue execution before the “write” is completed. Debuggingfacilities, as described in Section 2.4, must be provided for both interactive and batch systems. (Naturally the debugging tools should be such that they do not affect the rest of the program, but can be inserted and deleted without fear of causing havoc to the functioning of the system.) 4. Reviews of Several Systems Programming languages
Although many languages were investigated in our search for a good systems programming language, only four are summarized here. The languages were chosen because they satisfied a fair number of our criteria, and because both their compilers and documentation are generally available. The descriptions demonstrate the facilities of the languages measured against the criteria listed above. Each description is divided into the following sections: data attributes, program segmentation, control statements, data manipulation facilities, 1/0 and debugging facilities, macro facilities, and trouble spots. Several of the languages (e.g., BLISS) were still in the design and development stages when this article was submitted for publication (November, 1970); the current versions of these evolving languages may therefore deviate somewhat from their descriptions. 4.1 PL/I
This section will attempt to evaluate the implementation of PL/I introduced by IBM at the time that 0s was introduced. IBM hoped that the language implemented could become a universal language with adequate facilities for the purposes of both science and business. Because of its attempts at universality, IBM’s /360 PL/I satisfies most, but not all, of our requirements. 4. I . 7 Data Attributes (a) Type Attributes The type attributes available for variables in PL/I are arithmetic, pointer, label, logical, event, task, offset, and area. Arithmetic data provides the systems programmer with the necessary facilities for fixed and floating point arithmetic. Allowance is made for such nonessential operations as decimal and complex number arithmetic. Although these features cost the user nothing unless he uses them, their inclusion adds to the complexity of the language, making it even more difficult to learn.
SYSTEMS PROGRAMMING LANGUAGES
197
A variable with the POINTER-type attribute may contain the core address of a variable. The displacement of a variable relative to the start of a reserved portion of storage can be assigned to a variable with the type attribute OFFSET. However, in order to access a data item referenced by a variable of type OFFSET, the offset must be assigned to a variable of type POINTER which can then be used to access the data item. An AREA can be reserved in which variables with the storage attribute BASED can be allocated. An AREA can be transmitted as an entity in assignment or 1/0 operations, allowing the programmer to handle large segments of data easily. Elements within an AREA may be referenced by variables with the type attribut,es OFFSET or POINTER. An example of this type of data is: DCL A AREA (2000): which declares A to be an AREA of 2000 bytes. The following declarations DCL 00 OFFSET(A); DCL XX FIXED BINARY BASED(PTR); define 00 as an OFFSET relative to the beginning of the AREA A and XX as an integer which, when allocated, will be referenced by the pointer PTR. [The phrase “BASED(PTR)” contextually declares a variable PTR which has the type attribute POINTER.] If XX is allocated in A with the statement ALLOCATE X X I N (A); storage is reserved in the AREA A for XX and PTR is set to the address of the reserved storage. In order to establish the position of XX relative to the start of A, the following assignment statement must be written:
00 = PTR; Tht compiler generates code to perform an implicit conversion (subtracting the address of the start of the AREA from PTR) and to assign the remainder to 00. If A were written out to disk and read back into core at a different location, PTR would refer to the old (invalid) position of X X since it contained only a core address. However, 00 could still be used to reference XX after obtaining the address of the start of the new AREA. It is possible to assign a label in the source code to a variable with the type attribute LABEL. This feature can be used to implement a FORTRAN-like computed goto as described in Section 4.1.3 on control statements. Bit and character strings may have the length attributes fixed (the default specification) or VARYING. When a VARYING length string is
198
R. D. BERGERON et a/.
specified, the current compilers will allocate the maximum amount of storage requested in the declaration. However, unlike a fixed length string, a VARYING string is not padded on the right with blanks when its value is a string shorter than the maximum length. EVENT and TASK data are used by the multiprogramming facilities of PL/I. TASK variables are used to control the relative priorities of tasks demanding execution. EVENT variables are used to coordinate the concurrent execution of a number of procedures or the overlap between a record-oriented 1/0 operation and the procedure which invoked it. (See Section 4.1.3 on control statements.) Data elements can be organized into logical groups by using arrays and structures. Arrays are groupings which contain data elements with identical type attributes. Arrays may have up to fifteen dimensions in the Optimizing Compiler and up to thirty-two dimensions in the F-Compiler. For example, the declaration : DCL STRINGS (10,3,5) CHARACTER (20) ; defines the variable STRINGS which is a ten by three by five matrix of character strings of length twenty. The individual character strings can be accessed by indexing with ordinal numbers from the beginning of the array. The first such element would be STRINGS(l,l,l). Unlike arrays, structures allow grouping of data elements with different type attributes. The following structure named ELEMENT could be used as an element in a doubly linked list. It contains pointer-valued variables as well as a data area consisting of a fixed point binary number (NUMBER) and a character string of length twenty (STRING). DCL 1 ELEMENT 2 BACK-CHAIN POINTER, 2 FWD-CHAIN POINTER, 2 DATA, 3 NUMBER FIXED BINARY, 3 STRING CHARACTER (20); Elements of a structure are addressed by name. ELEMENT.DATA. NUMBER or ELEMENT. NUMBER references the fixed point data and ELEMENT. DATA references the entire substructure DATA. It is possible to have structures as elements of arrays and conversely arrays as elements of structures. Therefore the user can create quite complex data items. (b) Scope Attributes Variables have the default scope attribute INTERNAL, which means that, as in ALGOL 60, they can be accessed
SYSTEMS PROGRAMMING LANGUAGES
199
only in the block8 in which they are declared. The scope attribute of an identifier can be declared EXTERNAL, which means it will have the same meaning in all blocks in which it is declared. This feature can be used for communication between blocks. (c) Storage Class Attributes The storage classes in PL/I are STATIC, AUTOMATIC, CONTROLLED, and BASED. Identifiers whose storage allocation attribute is STATIC are allocated on the program level, but are accessible globally only if they are declared to be EXTERNAL. Storage for a variable with the attribute AUTOMATIC is allocated on the segment level (ie., on entrance to the block or procedure in which they are declared). CONTROLLED and BASED storage attributes allow allocation and freeing of variables at the user’s explicit request, i.e., on the statement level. Declarations of variables with the CONTROLLED or BASED storage attributes serve only to specify templates; no storage is actually reserved until an ALLOCATE statement is executed. Storage for BASED and CONTROLLED variables is reserved with the same ALLOCATE construct, but only BASED variables have an explicitly associated pointer. The structure above could be allocated as a BASED variable by changing the first line of the declaration to read
DCL 1 ELEMENT BASED (P), This declaration also introduces a pointer P which will be used to reference the structure. The pointer has the default storage attribute AUTOMATIC and is therefore allocated on entry to the block. However, the allocation of the structure ELEMENT is deferred until execution of either the statement ALLOCATE ELEMENT; or ALLOCATE ELEMENT SET (Q); The phrase “SET(Q)” like the phrase “BASED(&)” is a contextual declaration of the variable Q which has the type attribute pointer. The first statement would result in the allocation of a copy of ELEMENT 8 The usual form of a block is either PROCEDURE. . . ; or BEGIN. . . ; statement. . . statement. . . statement statement END ; END ; Blocks form logical divisions in the user’s program and provide a method by which incremental storage allocation can be done. Identifiers declared within the block are said to be local to that block.
200
R. D. BERGERON et a/.
and the setting of P to point at ELEMENT. The second statement would also allocate a copy of ELEMENT but set Q to point at it. The pointers Q and P can be thought of as qualifiers which distinguish between two copies of ELEMENT. The PL/I phrases “P-ELEMENT” and “Q -+ ELEMENT” refer to the copies of ELEMENT pointed to by P and Q, respectively. The ability to associate the storage attribute BASED with structures gives the programmer a powerful data manipulation tool. In our example, a single declaration specifies t.he form of all members of a linked list. Assuming that two copies of ELEMENT were created by executing both statements above, the following instructions will create a two-member linked list, with the first copy (pointed to by P) as the head of the list. Q + ELEMENT. BACK-CHAIN ELEMENT. FWD-CHAIN = Q;
=
P; P
BACK-CHAIN
\
FWD-CHAIN
The first statement sets the BACK-CHAIN pointer in the copy of ELEMENT referred to by Q to the value of P (the address of the first copy of ELEMENT). The next statement assigns the value of Q (the address of the second copy of ELEMENT) to the FWD-CHAIN pointer in the copy of ELEMENT referenced by P. The pointer qualification phrase “P -+” need not be written, since P is implicitly associated with the structure. With this organization, it is possible to find the head of the list by chaining through the BACK-CHAIN pointers. If another allocation of ELEMENT is performed without using the SET clause, P will be changed to refer to this new copy and will no longer point to the head of the list. In order to make this copy the new head of the list, the following sequence must be executed:
SYSTEMS PROGRAMMING LANGUAGES
20 1
DCL R POINTER; R
=
Q 4 ELEMENT. BACK-CHAIN;
R -+ ELEMENT. BACK-CHAIN
=
P;
P -+ ELEMENT. FWD-CHAIN = R ;
BACK-CHAIN FWDJHAIN
An extraneous variable R has been introduced in order to reference back through two levels of pointers. This must be done using the F-Compiler since only one level of pointer chasing is allowed in a single statement. This unnecessary restriction has been dropped in the Optimizing Compiler which allows the use of statements such as
Q + ELEMENT. BACK-CHAIN
+ ELEMENT. BACK-CHAIN =
P;
In this way the user can save two source statements and eliminate the needless introduction of R, making the program easier to read and more efficient. Unlike BASED variables, only one copy of a CONTROLLED variable is normally accessible at any given time. Successive allocations of a CONTROLLED variable produce a last-in first-out stack for that variable. The FREE statement deletes the current copy and “pops” the stack (Le., reactivates the previous copy). Each allocation of a CONTROLLED variable requires a dope vector and a dope vector is created for a BASED structure or array whenever it is passed as a ~ a r a m e t e rDope . ~ vectors are created by copying a skeleton A dope vector is an internal control element which is created for each copy of an array, structure, and varying length bit or character string. It provides a means to access elements of the more complex data types, determine the current length of varying length strings, and to perform error checks for the possibility of exceeding the length of a string or the upper or lower bounds of an array. The user has no control over the creation and use of dope vectors.
202
R. D. BERGERON et
a/.
dope vector kept in the static storage area. If the variable is complex, the dope vector and the skeleton dope vector may be of significant size so that use of the CONTROLLED or BASED attribute may be quite expensive. The problem of efficient allocation and compaction of the free storage area (i.e., avoiding fragmentation) is handled by run-time routines provided by the compiler. Each area has a free area queue associat)edwith it. This queue is a linked list of the free storage within the area. The list is ordered by size. When storage is requested for a variable in an area, a run-time routine is called which chases down the linked list until an element of free storage is found which is not large enough to fill the request. Storage will be allocated from the previous element which represents the “best fit” (i.e., provides the minimum amount of storage available to satisfy the request). Each time a variable allocated in an area is freed, its storage is returned to the free area queue associated with the area in which it was allocated. The elements of the free area queue are compared to determine if the storage just freed can be merged with any storage which is already free to form a larger free element. Whether or not any compaction is performed, the new free storage element is linked into the list in the appropriate place. 4.7.2 Program Segmentation
The principal syntactic unit of PL/I is the statement. With the exception of assignments, all statements begin with keywords. Statements may be grouped in procedures, BEGIN blocks, DO groups, and ON-units. Procedure blocks are used to separate major routines. They can be activated only by a procedure call or function reference and may contain declarations for formal parameterslo and local variables. Procedures are of two types, those which are explicitly called in a CALL statement and those which are implicitly called as functions. Functional procedures are invoked by the appearance of their name in place of a variable in a statement. Thus, they must return a value whose type is specified at the procedure entry point. The RETURN statement may have as its argument an expression which is the value of the function. The BEGIN block in PL/I is identical to that of ALGOL 60 in that it can be used to group multiple statements into a single, simple syntactic unit as well as to declare local variables. It differs from the procedure in lo A formal parameter is a variable that represents a value which is passed be., made known) to the called procedure, but whose storage has been reserved in the calling procedure.
SYSTEMS PROGRAMMING LANGUAGES
203
that the flow of control passes through it (i-e., no call is needed since it is “in-line”). DO groups, unlike those constructs mentioned above, may not have local variables associated with them. DO groups are most often used to group several statements into a single complex statement. It is like the BEGIN block in allowing the flow of control to pass through it. The DO statement also functions like an ALGOL 60 for statement, since it can contain an iteration clause. Error and interrupt handling routines can be coded as ON-units. An ON-unit consists of the keyword ON; followed by the condition which, when raised, will cause execution of the ON-unit; followed by a single statement or a BEGIN block containing several statements. In some ways an ON-unit resembles a conventional subroutine. Like a subroutine, the code for an ON-unit is not in the line of execution. If the condition mentioned in the ON-unit declaration occurs,11 control is passed to the statement associated with the ON-unit (somewhat analogous to a subroutine call). When the ON-unit has finished execution, control is returned to the point from which the ON-unit was invoked. The primary difference between an ON-unit and a subroutine is that an ON-unit need not be explicitly invoked but may be invoked asynchronously whenever the specified condition occurt~.Furthermore, explicit parameters may not be passed to ON-units. For example, the user may write: ON ZERODIVIDE A
=
B;
If division by zero occurs, control is passed to the ON-unit and B is assigned to A. Conditions which raise ON-units can be divided into two basic categories: those which can be detected by the hardware such as ZERODIVIDE and OVERFLOW and those which must be checked by the software such as STRINGRANGE and SUBSCRIPTRANGE. The conditions detected by the hardware are initially enabled in PL/I (e.g., ZERODIVIDE is initially enabled). The second type, that which must be checked by software, is initially disabled. This type of condition can appear as a prefix to a source statement to request that the compiler generate code to check for l1 A programmer can force the execution of an ON-unit (usually foi debugging purposes) through the use of the SIGNAL statement. For example,
SIGNAL ZERODIVIDE;
will simulat,e an interrupt, caused by division by zero, and the appropriate action will be taken.
204
R.
D. BERGERON et a/.
the condition. For example, (STRINGRANGE) : statement; will cause code to be generated which checks if an attempt has been made to access before the beginning or past the end of any string used in the statement. The ON-unit is a “dynamic declaration” in that execution must pass through it before it is in effect. Once in effect, an ON-unit will be passed control whenever the specified condition occurs. This saves the programmer coding checks for the condition at several points in his program. On-units can be disabled by execution of another ON-unit for the same condition. In effect, the new ON-unit supercedes the previous one. If the statement of the new ON-unit is SYSTEM as in ON ZERODIVIDE SYSTEM; the pre-defined system action will occur for the indicated condition. 4. I .3 Control Statements
Program control is most often carried out by IF statements and DO loops with an iteration clause. The GO TO is mostrly used as an unconditional branch. However, since label arrays of several dimensions are allowed, a GO TO may act as a switch. After statement labels have been assigned to the members of the array, the programmer may branch to different points in his program by varying the subscript of the array. PL/I has a CALL statement which allows synchronous and asynchronous operations. Both types of CALL statements may pass parameters. The asynchronous operat,ionsmade available through the multitasking facilities of the language can be used to fashion a multiprogramming system which provides for the execution of a number of procedures concurrently. If the amount of overlap possible between procedures with different amounts of 1/0 is significant, the multiprogramming environment can permit more efficient use of the central processing unit and 1/0 channels by reducing the waiting time for each procedure. Two procedures can be made to execute concurrently by including the TASK parameter on the CALL statement. The procedure called in this way is known as a subtask; it is attached to the procedure which called it (the attaching task) and remains in core only as long as it is active. The continued execution of a task may be made dependent on the completion of a subtask by using the name of the EVENT variable associated with the second task in a WAIT12statement. Although the tasks are said to l2
The WAIT statement prevents the currently executing task from proceeding until
a specified event has occurred.
SYSTEMS PROGRAMMING LANGUAGES
205
execute concurrently, they are not necessarily executed simultaneously. Each task has associated with it a PRIORITY parameter (default value 0) which regulates the order in which tasks are executed. The task with the highest priority of those currently awaiting execution (i.e., not waiting for the completion of an 1/0 event or a subtask) will continue execution. The following statement demonstrates these facilities: CALL PROCA TASK (T3) EVENT (E3) PRIORITY (-2); This statement calls PROCA as subtask T3 which has lower priority than the attaching task. The argument associated with the PRIORITY parameter specifies a priority relative to (and lower than) that of the attaching task. The statement also associates an event variable E3 with the subtask. At any point in the attaching task the statement WAIT E3; may be executed to postpone execution of the attaching task until the attached task has completed execution. 4. I .4 Data Manipulation Facilities
Besides the usual “and”, “or”, and “not’) bit-string operators, PL/I has the built-in function BOOL which allows up to sixteen different userspecified Boolean operations to be applied to two specified bit strings. Character and bit-string manipulation are provided by the following operations : 1. Concatenation 2. INDEX searches a string for the first occurrence of a specified substring and returns the location of the first bit or character of the substring relative to the beginning of the string. 3. SUBSTR allows a substring to be specified within a given string. 4. LENGTH returns the current length of a specified string whose length attribute is VARYING. 5. TRANSLATE will examine all the characters of a given string and replace each one with another character as specified by the user. 6. VERIFY will scan a given character or bit string from left to right to check that every character or bit is represented in another string specified by the user. If a character or bit in the first string is not represented in the second string, its index into the first string will be returned. Although all these functions are available in PL/I, there is no simple method provided for the deletion of a particular substring from a character string. Assuming STRING-TWO has been declared a VARYING length
206
R. D. B E R G E R O N et a/.
character string, the following example deletes STRING-TWO from STRING-ONE :
I
=
INDEX(STR1NG-ONE,STRING-TWO) ;
STRING-ONE
=
SUBSTR(STR1NG-ONE,l,I - 1) I I SUBSTR(STR1NG-ONE, I LENGTH (STRING-TWO)) ;
+
The deletion must be accomplished by concatenating (denoted by the 1 ' operator) two substrings because PL/I only allows replacement of charart,er strings of the same size. The first argument of the first SUBSTR function is the name of the original string. The next two arguments are the starting position and length of the desired substring, respectively. In Fig. 4, the STRING-ONE STRING-TWO
STRl NG-0 NE
length of the first substring is found by searching STRING-ONE for an occurrence of STRING-TWO with the INDEX function and subtracting one. The second substring begins after the occurrence of STRING-TWO and runs to the end of STRING-ONE as there is no third parameter. The built-in LENGTH function returns the appropriate size of the VARYING length string. PL/I provides additional options that may be helpful in data manipnlation applications. Among these are (1) DEFINED provides that the data element with this option should occupy the same storage locations as the data element named in the DEFINED clause of a declaration. However, there are severe restriction? concerning the correspondence of types of the variable and the data el+ ment in its DEFINED clause. In the example below, the first ten characters of LINES and all LINE1 occupy the same storage locations. ( 2 ) For bit or character data, POSITION provides that a data element DEFINED a t another data element may start a t any displacement from the start of the storage reserved by the latter. For example, the following declarations: DCL LINES CHARACTER (20); DCL LINE1 CHARACTER (10) DEFINED LINES; DCL LINE2 CHARACTER (10) DEFINED LINES POSITION (11); result in the allocation of only twenty characters called LINES. However.
207
SYSTEMS PROGRAMMING LANGUAGES
the first ten characters can be referred to as LINE1 and the second ten characters as LINE2 (3) ALIGNED specifies that a variable should be aligned on its natural boundary. For example, a fullword should have its storage allocated on a fullword boundary. In order to align bit strings on a fullword boundary DEFINED must be used, since the natural boundary of a bit is a byte. This can be done in the following manner: DCL FLAGS FIXED BINARY (31,O) ALIGNED; DCL STRING BIT(32) DEFINED FLAGS;13 (4) INITIAL allows the user to specify values of variables with the storage attributes STATIC and AUTOMATIC prior to execution. A character string could be initialized to the value ‘ABC’ with the declaration:
DCL STRING CHAR(3) INIT (‘ABC’); (5) The asterisk notation allowed by PL/I enables a user to process a cross section of an array. The phrase ARRAY(*,lO) references all the elements in the tenth column of ARRAY. As can be expected, this type of accessing though syntactically compact is sometimes expensive. 4.7.5 I / O and Debugging Facilities (a) Stream 1/0 The modes of stream transmission are list, data, and edit directed which all use the GET and PUT statements. These three modes provide the user with several methods of performing an 1/0 operation with varying degrees of specification of the variable name and format of the data being manipulated. The data in files to be used in the list directed mode consist of sequential constants delimited by commas or blanks. Those in the data directed mode are of the form
variablepame
=
constant
and are also delimited by commas or blanks. Files to be used in the edit directed mode are sequential characters which are deconcatenated through format specifications of the type and length of the data e1emen.t which is receiving or outputting its value. Examples of card files in each of the modes are the following: list directed :
5, ‘ABCDEFG’ -8.7
data directed: A edit directed: l3
=
5, STRING
=
‘ABCDEFG’, C
=
-8.7;
5ABCDEFG-8.7
FLAGS is a thirty-two bit word consisting of thirty-one bits and a sign.
208
R. D. BERGERON et a/.
Since the GET and PUT statements must contain the names of the variables being operated upon, appropriate conversions are done on data in the stream. This is especially convenient for outputting numerical data. Although systems programmers sometimes must deal with highly formatted output, it is unlikely that they would need the three different methods provided by PL/I. (b) Record 1/0 Record 1/0 is not hampered by conversions as is stream I/O, and consequently large segments of information such as AREAS can be transferred much more quickly this way. Depending on the type of file being used, READ and WRITE commands perform operations either directly on a file or through a buffer. (c) Teleprocessing Support PL/I has teleprocessing features which can be used for terminal 1/0 between the user and a time-sharing system. All teleprocessing files are declared TRANSIENT, which specifies that records may be added to the file by one user while another user is removing records from the file. The file is considered to be a queue through which records pass between the control program (i.e., the time-sharing system) and the user’s program. The PENDING condition is raised when there is an attempt to read a temporarily unavailable record. Because there is a condition raised, the systems programmer can code an ON-unit to prevent the time sharing system user from accessing a file that is currently being used.
(d) Debugging Aids I n addition to the ON-unit and SIGNAL statement already described, PL/I also provides for a trace of statement numbers (only with the optimizing compiler) and the CHECK option. Whenever the value of a variable occurring in the user’s CHECK list is altered, control is passed to the appropriate ON-unit. If no ON-unit exists, the name and value of the variable are printed in the data-directed format. This option can only be attached to PROCEDURE and BEGIN statements and is valid over a whole block.
(CHECK (A,B)): PROCEDURE. . . Other facilities available are SUBSCRIPTRANGE and STRINGRANGE. The first causes an interrupt to occur when a user tries to access a data element outside the declared boundaries of an array. STRINGRANGE causes an interrupt when a user attempts to access a character before the beginning or past the end of a string. In both cases, the user may specify an appropriate ON-unit. The debugging aids provided by PL/I are still missing key facilities.
SYSTEMS PROGRAMMING LANGUAGES
209
There is no convenient way to print variables of type POINTER or OFFSET. It is not possible to access variables which have been CHECKed and no way to allow physical asynchronous interrupts (e.g., for a lightpen) which the ON-units have been designed to handle. The language provides no means of examining the contents of variables after abnormal termination of the program (e.g., a symbolic dump). To do this, the user must recompile some portion of his system with output and/or CHECK statements. If he has not selected the correct portion of his system or relevant variables, the process must be repeated making PL/I debugging both time consuming and expensive. 4.7.6 Macro Facilities
PL/I has macro facilities which allow the inclusion of strings of text from user or system libraries, conditional compilation, and substitution for any identifier in the source program. A compile-time statement has a yobefore the keyword. Strings of text in libraries can be incorporated into the source program with the statement %INCLUDE filename; Conditional compilation can be achieved with the %GOTO and YOIF statements. The preprocessor can be interrupted in its scanning and told to continue at the place specified by the label in the %GOTO statement. The %IF statement allows testing of compile-time values to determine where the preprocessor should resume scanning. Also, any identifiers in the source program may be changed using the %DECLARE and %ASSIGNMENT statements. For example, if a constant were to be changed from compilation to compilation, the following statements might be executed : %DECLARE A FIXED; %A
=
27;
All occurrences of A in the source program would be changed to 27; thus M=N+A; would become M =N+27; In order to change the constant for the next compilation, only the %ASSIGNMENT statement need be changed.
210
R. D. B E R G E R O N ef a/.
4.7.7 Trouble Spofs
Some of the major deficiencies of PL/I have been discussed above. These included inadequate pointer chasing, character manipulation, and debugging facilities. Another major drawback of PL/I is the way in which it attempts to “help” the user by allowing the use of undeclared variables and correcting syntax errors. Undeclared variables or “new” variables which result from typographical errors take a set of default attributes determined by the first letter of the name of the variable. (These defaults can be set by the user in the Optimizing Compiler, but not in the F-Compiler. Furthermore, in the F-Compiler, no warning is given to the user that he is using an undeclared variable.) Through inserting semicolons and deleting portions of source code, the compiler attempts to produce a statement it can compile. It is very rarely that this feature aids a systems programmer. Often, an error will be interpreted by the compiler as a valid request for an implicit conversion. Of course, the most important criteria to be met is that a compiler must produce efficient code. In one project it was found that code written by programmers inexperienced in PL/I was five to ten times worse than that which they would have written in assembler language [ I O ] . Another study [ I ] found that code written by experienced PL/I programmers using the F-Compiler was four to five times worse than assembler language. Furthermore, the Optimizing Compiler claims to generate code that is only approximately 20% better than the F-Compiler [.@I. Although the manual warns of the cost of some of the constructs in general terms, no precise figures are given and it is not easy to determine the cost from the poor assembly language listing which is generated. Other studies have also shown that, although PL/I is fairly machine independent and legible, it is such a rich and complex language that it is difficult to learn [SZ]. For example, it is almost impossible to remember the details of the data conversions which will be done in a statement with variables of mixed data types. The language is so complex that a user may not be able to determine the effect of a statement unless he has previously used it. It is not obvious that the expressions 5 < 6 < 7 and 7 < 6 < 5 are both true. (In the first expression, 5 < 6 will produce a Boolean 1 (true) which is converted to binary 1 which is compared to 7. In the second case, 7 < 6 produces Boolean 0 which when converted to binary 0 is less than 5.) PL/I does receive wide support and some efforts are being made to make it more efficient. The ORDER and REORDER options are available to let the compiler do some controlled optimization by factoring common
SYSTEMS PROGRAMMING LANGUAGES
21 1
expressions. Values are being retained in registers as long as possible and inline code is being generated for conversions in the Optimizing Compiler. Also, the option OPT has three values which specify the mode of compiling.
0 requests fast compilation and favors reduction in storage space over faster execution time. I requests fast compilation and favors faster execution time over storage saving. 2 requests faster execution time at the expense of core and of compilation time. 4.2 AED
The AED (Automated Engineering Design) System was developed at M.I.T. in the early sixties under Douglas Ross. The base of the AED System, the AED-0 language, was designed to facilitate the development of machine-independent programs. Based on ALGOL60 with extensions for character data, logical operations, and complex list processing, AED-0 runs on the /360 and the Univac 1108. There are /360 versions that produce code for the Raytheon 706 and the PDP-10; earlier 7094 versions for both batch and CTSS are still in use. AED was planned for use in environments in which changes of hardware are not unusual. In such an environment, portability, from machine to machine and even from installation to installation, is of major importance. The possible loss of efficiency is considered insignificant in comparison to the number of man hours saved in recoding. AED is designed to facilitate and separate the jobs of coding an algorithm and of creating and manipulating data structures. For maintaining separability of the two, the “extended referent system” was designed; for coding algorithms clearly, “phrase substitution” is used. Extended referents allow the same AED referent (e.g., A(B)) to invoke a variety of mechanisms, depending on how its parts are defined. For instance, A(B) might refer to data element A relative to pointer B, the Bth member of array A, the Bth member of switch A, procedure A with argument B or macro A with parameter B. Phrase substitution permits any operand in an expression to be replaced by an expression of the same type. All expressions, including statements, have both types and values (see Section 4.2.4.a). The freedom of allowing any pointer expression to be used for data access is one of the most important consequences of phrase substitution, as it allows multiple levels of “pointer chasing.” Two other interesting results of phrase substitution are multiple assignment statements and the use of a conditional expression within a statement.
212
R. D. BERGERON et a/.
The AED System consists of the AED-0 language and a set of run-time subroutine packages that deal with problems common to all systems programming, such as list processing, storage allocation, and I/O. The efficiency lost because of the generalized nature of these packages may be offset by the debugging and coding it saves. The AED packages allow the user to select either high level, generalized subroutines, or lower level subroutines from which he may build his own higher level ones. While AED lacks PL/I’s data structure elegance, it permits the straightforward packing of data within a word. AED permits multiple levels of pointer chasing, but its bit facilities are more restrictive than PL/I’s. AED does not perform implicit conversions and offers elaborate stack facilities. The Free Storage Package offers a choice of storage allocation strategies and does not require a maximum size for zones. 4.2.7 Data Attributes
There are four basic data types in AED-0-INTEGER, REAL, BOOLEAN, and POINTER. Character strings are declared as POINTERS and are described more fully below. More complex data types are defined by COMPONENTS and the referent system, as well as by ARRAYS and STACKS. Unlike PL/I AREA’S, zones are not treated as variables (e.g., they cannot be the object of an assignment statement). Zones are discussed in Section 4.2.4(g). (a) Basic Data Types A REAL value is specified, as in FORTRAN, in normal E-format (e.g., 1.7E-3). An INTEGER value consists of an octal or decimal number, a letter specifying the base of the number and an optional binary shift number. Character data is declared to be of type POINTER. The value that is stored in the variable is in fact a pointer to an arbitrary length character string of . C . format (see Section 4.2.4(b)). While such a declaration does not clearly state what the user is doing, it does clarify what the compiler is doing. (Compare this implementation to the hidden dope vector of PL/I.) This is the only format for character data for which any string functions are defined. A literal character string is specified by . C . and the literal string, delimited by any character not within the string. Thus, .C . /HELLO/ defines a character string of the literal HELLO.
(b) Components An “n-component element” or “bead” is the AED term for a block of contiguous storage, with one property or value per COMPONENT. A COMPONENT is a template describing a word or part of a word that has a certain data type and is always accessed relative to a POINTER. Unlike a PL/I structure element, the COMPONENT does
SYSTEMS PROGRAMMING LANGUAGES
21 3
not refer to an element of a particular bead but refers to a displacement past a pointer. A COMPONENT’S data type is declared as follow-s: REAL COMPONENT A $, ($, is the statement delimiter). The COMPONENT’S position within a bead must be declared in a separate statement which uses the compile-time assignment operator $ = $. For example,
A $
=
$ 2 3,
locates A(P) at the second word past the location referenced by P. While one can manufacture any data structure with these facilities, it requires more work than in PL/I. On the other hand, a forward pointer in the first word of several different types of bead need only be declared once. If one word is to contain several COMPONENTS, a PACK statement must be included. A COMPONENT will occupy the contiguous bits of the word specified by an INTEGER mask. Although the mask already indicates the position of the COMPONENT, the user must also specify the number of shifts needed to right justify it. On certain machines, field names may be used to specify that particular parts of the word are being used and thus permit more efficient code to be generated. Because of excellent macro facilities, the user may avoid writing out all the details of the PACK statement. This, however, is a detail that the compiler should provide automatically. For example, the following statements define a structure whose first word is a pointer called NEXT, the second word contains two integers, I D and MAX, and the third word contains two integers, MIN and CURR: POINTER COMPONENT NEXT $, INTEGER COMPONENT ID,MAX,MIN,CURR $, NEXT $ = $ 0 $, I D $ = $ MAX $ = $ 1 $, MIN $ = $ CURR 96 = $ 2 $, PACK 177777C16,16,SPECIAL COMPONENTS ID,MIN $, PACK 177777C,O,SPECIAL COMPONENTS MAX,CURR 46, The first PACK statement specifies that the COMPONENTS ID and MIN occupy the leftmost sixteen bits of the words in which they are contained; the second specifies that MAX and CURR occupy the right sixteen bits in their respective words. (The C indicates that the mask is specified in octal.) One drawback of PACKed data is that it cannot be modified as a parameter passed to a subroutine. This is because only word addresses can
214
R. D. BERGERON et a/.
be passed as parameters and since a PACKed COMPONENT is not a full word, its value must be stored in a temporary location whose address is passed to the subroutine. If the subroutine contains the proper packing information, the pointer can be passed and the PACKed data accessed this way. (c) Arrays Arrays are restricted to one dimension and all of their elements must be one of the four basic types. This means that all arrays must have unPACKed elements; there cannot be an array of halfword INTEGERS. A typical declaration is
INTEGER ARRAY A(9) $, which defines a ten member array of integers A(0) through A(9). (d) Stacks The final data type is the stack. A variable declared in a STACKS statement is treated as being of type POINTER, but since no type checking is performed, a stack can have variables of any type stored in it. A STACKS declaration simply causes an entry to be made in the symbol table; the procedure CREATE is used for defining the actual mechanism of the stack. The following call specifies that the stack S calls the procedure GEN when it tries to pop or read from an empty stack, that it initially contains the element PTR1, and that the stack itself (i.e., the components of the stack) takes storage from the free storage zone, Z (see Section 4.2.4(g) for the definition of a zone).
STACKS S $, CREATE(S,GEN,PTRl,Z) $, (e) Storage Classes All variables declared in a program, with the exception of OWN variables in recursive procedures, are allocated on the program level (i.e., similar to PL/I STATIC variables). OWN variables are allocated on entrance to the procedure in which they are declared, as PL/I AUTOMATIC variables. (Note that this usage of the term OWN is directly contrary to the ALGOL definition.) Statement level allocation is discussed in the section on data manipulation. Data declared and assigned storage in one compilation and referenced in other compilations is referred to as remote data. The two types of remote data in AED are EXTERNAL, which is accessed by name, and COMMON, which is located in a system-defined area and accessed positionally (by the order in which it is declared). The user may also use ALIASes to assign two variables the same storage location.
SYSTEMS PROGRAMMING LANGUAGES
215
4.2.2 Program Segmentation
The two types of program segmentation are blocks and procedures. Both blocks and procedures may be valued, the former being referred to as valued blocks and the latter as functions. Block structure is introduced for logical divisions and to introduce lexical scope (i.e., to allow a name to be reused as a local variable without destroying the global one), but it does not affect storage allocation. AED blocks and procedures are equivalent to those of PL/I in their composition and scope of variables. Each block creates its own symbol table. A PROCEDURE definition includes the name of the PROCEDURE, whether or not the procedure is RECURSIVE; the parameters and their data types; the data type of the vaIue returned by the procedure if it is a function; and the body of the PROCEDURE. The most important and interesting part of AED procedures is the elaborate parameter passing facilities. Parameters are passed by referencethat is, by the location that they occupy in core. In order t o allow the user to write a procedure that can accept a varying number and even varying types of parameters, AED provides a number of system routines that let the user access the addresses and values of parameters as well as check the number and types of parameters passed. Such facilities allow the user to change procedures to accept a different number of parameters without changing all calls to it or to design procedures that act differently depending on the form of the parameter list. A procedure can be used to set up data areas for procedures nested within it. These nested procedures can then be declared EXTERNAL and thus be called from outside the outer procedure (called the set up procedure). The user must call the set up procedure before any of the nested procedures are called so the proper data area may be allocated. This allows the nested procedures to access common data. 4.2.3 Control Statements
The most commonly used methods of flow of control are the procedure and function calls, GOTOs, and conditional expressions (see Section 4.2.4(d)). AED also provides the user with SWITCHes to perform computed GOTOs and the DOIT package. The latter allows the user to store program points, which are pointers to beads that contain the required information, in data structures and then activate them. The procedure DOIT picks up the address from this bead and transfers control. While DOIT works on procedures and labels already loaded into the program space, LDOIT and LDOITS will load a program dynamically. That is, the specified procedure name (a character string) will be used to locate
R. D. BERGERON
216
ef al.
the procedure on secondary storage, allocate space for the program, load it and then begin execution. LDOIT reloads the procedure on every call so that the space for the procedure may be reallocated after execution is completed. On the other hand, the initial call to LDOITS replaces the pointer to the character string provided in the call with a pointer to the procedure itself. Subsequent calls to LDOITS for the same procedure will be equivalent to a call to DOIT. This facility for dynamic loading is clearly an important asset of the language. For an example of its use, consider a general 1/0 system that makes use of “state beads”-beads which contain pointers to the read and write procedures to be used.14 Assume a set of beads with two components -one pointing to the read procedure and the other to the write procedure: POINTER COMPONENT READ $, POINTER COMPONENT WRITE $, Then the following statements set these pointers to the appropriate procedures for the TAPE bead. POINTER TAPE $, READ(TAPE)
=
WRITE(TAPE)
.c. /TREAD/
=
$,
. C . /TWRITE/ $,
Given any device whose bead is referenced by the pointer DEV, its read procedure can be activated by the statement LDOIT(READ(DEV),ARGl, . . .,ARGn) $, LDOIT is a special “transparent” procedure which loads the appropriate procedure and then generates a call to it; the procedure is completely unaware that it was called using LDOIT. A procedure return transfers control to the statement following the LDOIT call. Since the 1/0 routines will be loaded when they are called, only those actually used will ever be loaded. Without LDOIT, the user would have to load all possible 1/0 routines and use a series of IF statements to check what device was specified and then explicitly call the appropriate procedure.
IF DEV = TAPE THEN TREAD() $, ELSE IF DEV = CARD THEN CREAD() $, i4
..
This example is based on one in [ S I ] , p. 153-154.
SYSTEMS PROGRAMMING LANGUAGES
217
AED iterative loops consist of a set of conditions for which the statement specified is to be executed. FOR A
=
10
*
B,
C STEP 2 UNTIL D, DO
..
For example, the above statement will cause the statement after DO to be executed as follows: First it will be executed with A set equal to 10*B, and then with A set equal to C. A will then be incremented by two and the statement re-executed until A is equal to D. 4.2.4 Dafa Manipulation ( a ) List Processing For pointer chasing, AED provides the referent-system. The referent system allows the user to write a component name referenced by any pointer-valued expression and thus arbitrary levels of pointer chasing can be accomplished. Suppose NEXT is a POINTER component referencing the next member of a doubly linked list and PREV is a POINTER component referencing the previous element. The following instructions will insert the bead referenced by the POINTER NEW after that referenced by the POINTER OLD (see Fig. 5).
POINTER NEW, OLD $, POINTER COMPONENT NEXT, PREV $, NEXT $ = $ 1 $,
PREY $ = $ 2 NEXT(NEW)
$,
=
NEXT(0LD) $,
PREV(NEW) = OLD $, PREV(NEXT(0LD)) = NEW $, NEXT(0LD) = NEW $, AED notation allows the user to abbreviate these four statements by using “phrase substitution,” which allows nested statements. Every statement has a type and value, and therefore a statement may be used anywhere an expression of that type is permitted. Thus the last four lines of
218
R. D. BERGERON et a/.
pointer chasing in the above example may also be written as follows: N E X T (OLD) = PREV(NEXT(NEW) = NEW 98,
=
NEXT(PREV(NEW)
=
OLD))
NEXT PREV
__ before
___
after
The user may access the address of a variable by using the LOC function. Inversely, the value stored a t an address can be accessed by using a COMPONENT defined to be the zeroth word relative to the pointer. For more elaborate list processing the Generalized String Package is provided. It is intended to handle any type of linked list, such as a pushdown stack, a queue, a table, a ring, or an ordinary list-in fact, i t can be any set of beads that are somehow connected with an orderly mechanism. Each linked list must contain a first element and a mechanism for finding this element as well as functions that specify how t o follow the linking mechanism, how to access the value of each bead, how to search through the list, and how to create, copy, and destroy beads. (b) Character Manipulation The character functions access . C . character strings via the declared POINTER. The POINTER references a bead consisting of a header and an array of characters. The header includes the length of the array (i.e., the character string) and possibly a POINTER to a continuation bead. This indirectness is reasonable if all strings used are of varying length. For character strings where the length is known a t compile timc, :t may introduce unnecessary overhead, since there is no reason for addl (Lshing these indirectly. While AED does provide a character type without thi.; indirectness, i t can only be manipulated as a n INTEGER. The CEQLS procedure permits the user to test if two character string> are the same, but there is no reasonable way to alphabetize a series of character strings of varying lengths. The COMPAR procedure compares two character strings, but it first compares the length of the two strings, and only if they are of equal length does it compare the strings themselves. Thus, for example, “Z” would be alphabetized before “AA”. COPYC is a function that copies a series of blocks of characters, combining them into
SYSTEMS PROGRAMMING LANGUAGES
219
a single block and inserting the number of blanks the user requests between each block. Using these procedures and those provided for list processing in the Generalized String Package, the user can, with difficulty, create most character processing functions. To search for a character substring within a block one must use the SPRAY and GLUE functions, which unpack and pack character strings into word blocks. These, however, do not provide adequate character manipulation facilities. (c) Boolean Operators Boolean values may be assigned the values TRUE and FALSE directly, or their values can be computed using the Boolean operations NOT, AND, OR, and 1MPlies. Because the BCD character set was all that was available when AED was begun, this minimal set has been used and the relational operators are EQL, NEQ, LES, LEQ, GRT, and GEQ.
(d) Conditional Expressions Conditional expressions may be used for defining alternative statements or expressions to be executed. The objects of the THEN and ELSE must be consistent (i.e., both of them must be expressions of the type REAL, INTEGER, BOOLEAN, or POINTER, or both statements). The conditional can be used in any run-time computation, but cannot be used in any compile-time expressions. The following procedure call passes the larger of the two INTEGER expressions to the procedure BIG :
IXTEGER A,B $,
BIG(IF X
=
A*B/2 GRT Y
=
2*(A
+ B)
THEN X ELSE Y) $, (e) Stack Manipulation While stack manipulation may not be a major consideration for all systems programmers, the AED facilities (despite the clumsy syntax) are an excellent example of compact and powerful operators. By using appropriate graphic symbols rather than characters, the operators would be meaningful and easy to use. STACKS may be used as LIFO (Last In First Out) push down stacks or as a simple variable (that is, the top element is read from nondestructively and written into destructively). As stated earlier, the stack mechanism is specified by a call to CREATE. The stack functions are infix operators whose operands are two stacks. Operators are formed by concatenation of lethers that define the transfer of the top elements of each of the stacks. The transfer can be from right
220
R. D. BERGERON ef a/.
to left, left to right, or in both directions, and can be either destructive or nondestructive. For example, if the .LBERB. operator is used, it would perform as follows. (Note that symbols to the left of the E refer to the left stack, those on the right refer to the right stack.) The L and R indicate that transfers are to be made both from the right stack to the left stack and from the left stack to the right one. Because there is no B at left extreme of the operator, the left stack is pushed (written) normally; the B at the right end specifies that the right stack is written destructively, overwriting the top element. The first B indicates that the left stack is to be read nondestructively. Since there is no B directly to the right of the E, the right stack is popped (read) normally. Assume X and Y are STACKS containing ABC and DEF, respectively. (For the purpose of this example, the leftmost character is on the top of the stack.)
X .LBERB. Y has the following effect:
X
Y
ABC
DEF before operation
ABC
EF
read X nondestructively, pop Y
DABC AF
push X, store in Y destructively
(f) Bit Manipulation There are two forms of bit manipulation facilities in AED. For INTEGERS, there are operations for negation, logical and, right and left shifts and inclusive and exclusive or. Varying length bit strings are handled by the .B . package with facilities similar to those of the .C . character string package. These facilities are not convenient since there is no simple way to work with individual bits (flags) without using PACKed COMPONENTS or subroutine calls. (9) Storage Allocation AED uses a hierarchical “zone” model for free storage. A free storage zone is a contiguous block of core storage which is managed by special dynamic storage allocation mechanisms, which may vary among zones even in a single program. The infinity zone is the root zone and has access to all core that the operating system and the AED run-time support allow. When a son zone (a zone within a zone) cannot satisfy a request, it obtains an extension from its ancestor zone. No zones can run out of storage if storage remains in some other parent zone, unless specifically indicated by the programmer. These extensions are of user-
SYSTEMS PROGRAMMING LANGUAGES
22 1
specified fixed size, which may be overridden by a request for a bead larger than that size. The user may specify a procedure to be called whenever an extension is required so that storage management strategies are completely under his control. AED offers the user three different storage allocation strategies: SPEC, REG, and GARB. The SPEC (special) strategy may be used if all beads to be allocated from the zone are of the same size. An arbitrarily ordered ring of available (i.e., used and returned) beads is maintained and the first bead on the ring is used to satisfy the request. This strategy keeps execution time at a minimum but requires the user to create a more complex zone structure if he wants different size beads. The REG (regular) strategy is the most appropriate for many programming problems. This strategy allows various size beads to be allocated from the same zone. In this case, two rings are maintained; the first orders unused extensions by location and the second is a ring of master beads ordered by size. Each master bead heads a ring of available beads of that size. If there are no available beads of the size requested, the storage is taken from the ring of unused extensions. The user can explicitly have all available beads returned to the unused extensions with garbage collection being performed. This strategy minimizes the time spent in the “typical case,” when an available bead of the right size exists. The last strategy, GARB (garbage-collecting) performs garbage collection on the fly. It is mainly used in zones at the top of the hierarchical structure which receive requests for large extensions of storage. The available beads are ordered by location and a first-fit rather than best-fit strategy is used (ie., the first bead larger than or equal to the size requested is used). While a best-fit strategy (finding the smallest bead that will satisfy the request) conserves core, it requires more execution time and the fragmentation that occurs when working with large size beads is not usually significant. The user is given the ability to hand tool any strategy to his needs by means of “help procedures.” A help procedure is one that is called when any exceptional condition occurs. For example, in the REG strategy user-defined help procedures may be called if there are no available beads of the proper size or if a requested bead is larger than the largest unused extension. The help procedures are an important innovation of the AED System and are a basic method by which the user can control his program while using the run-time support facilities to their fullest. In statement level allocation, a bead is obtained by specifying the number of words it is to contain and the zone from which it is to be taken. The various procedures determine if the bead will be given as it is (FREE), whether it is to be zeroed first (FREZ) or whether it will contain a copy
222
R. D. BERGERON ef a/.
of a bead the user specifies (FREC). FRET is used to return the bead to free storage. (h) T y p e Conversion The only implicit conversion in AED is in the assignment of REAL and INTEGER values. If an INTEGER value is assigned to a REAL value or vice versa, automatic conversion takes place. AED also provides explicit conversion routines between INTEGER and REAL numbers and character strings. 4.2.5 I / O and Debugging
AED provides standard 1/0 routines for most users. These are GIN and GOUT for simple input and output, the ASEMBL package for writing highly formatted output, and the RWORD3 package for reading free format input. For more specialized I/O needs using sequential files, IOBCP can be used. IOBCP, the Input-Output Buffer Control Package was designed to help the systems programmer interface with the operating system when working with sequential files. The package attempts to allow the user to take advantage of all system facilities without constant attention to details. For instance, such tasks as blocking and unblocking records and defining buffers are handled by IOBCP. I n the /360 version, the user may work with fixed or varying length records, in either “locate” or “transmit” mode, in machine words or in bytes, etc. AED provides fairly powerful debugging facilities that may be invoked on entry to or exit from any subroutine. All subroutine linking passes through a single AED system routine. A debugging version of this routine permits the user to keep statistics, perform a trace or partial trace of subroutine calls, or to call a user specified routine. The user routine call can be dependent upon the particular subroutine being entered or exited or whether it is an entry or exit. A special system bead must be filled in by the user in order to establish these conditions. However, in an interactivc environment, the user may specify on-line what action he wishes to take. User or system defined “help” procedure can also be useful for debugging. Such procedures can be called when there is an error or a t strategic points in the program’s execution. A generalized alarm-reporting package can also be incorporated into the user’s systems for error treatment. 4.2.6 Macro Facilifies
AED provides a powerful MACRO facility which allows MACROs to be redefined within a given block as well as in different blocks. The usual rules of scope apply. The user can DEFINE MACROs within MACROS,
SYSTEMS PROGRAMMING LANGUAGES
223
thus having one MACRO not become active until another is encountered. The user is also provided with the ability to perform compile-time actions when defining a MACRO. The major difficulty with this facility is that it consists only of procedure calls. While this permits the usage of the whole AED language in compile-time facilities, it does so in an inconvenient manner. Other facilities include the SYNONYMS facility which permits the user to specify that whenever the compiler recognizes a specified symbol, it should be replaced by another. Since this is done before syntactic recognition is performed, the user may change the “spelling” of any run-time operators or keywords. The following statement permits the user to write semicolons as sentence delimiters with no syntactic problems. SYNONYMS BEGIN $,
=
; END $,
This specifies that whenever a ; is encountered, it should be replaced by $, for syntactic analysis. Unlike a MACRO a SYNONYMS may not be used recursively. 4.2.7 Trouble Spots
While many of AED’s constructs and concepts were exciting innovations ten years ago, several of these have since become commonplace and have been worked into more elegant forms in other languages. Much of the AED syntax remains quite awkward. While this can be circumvented by the use of macros, the user should not have to design his own syntax. The use of non-mnemonic names, unconventional jargon, long keywords, similar procedure names, large numbers of procedures, and the use of characters instead of other symbols all detract from the pleasantness of using the language. The relational operators (e.g., EQL and NEQ) and the stack operators (e.g., BLEBR. and .LBERB.) would be much more convenient if graphically specified. The names of the storage allocation procedures (e.g., FREZ, FREC, FRET, or FREE) are so similar as to be easily confused. FREE is particularly confusing: only in AED does it specify a request for free storage rather than returning a bead to it. The AED approach to bead definition prohibits the use of default basing parameters as in PL/I. This is especially inconvenient when there are several levels of pointer chasing since all levels must always be specified explicitly. While such default parameters can be simulated by user-defined MACROS, the user should not be forced to deal with such details. AED contains no constructs for multiprogramming or multitasking, there is no direct access or partitioned 1/0 available, and the language does not permit the handling of asynchronous interrupts. Although the
224
R. D. BERGERON et al.
debugging facilities are superior to those of the other systems languages, they are accessible only on subroutine entry or exit. The only debugging facilities available within a subroutine must be specified at compile time. Although AED produces an assembly listing, it is soinscrutable as to be of little or no use to the programmer. Other deficiencies in the language include the lack of powerful character string manipulation facilities and a clearcut way to implement based arrays and multi-entry procedures.
4.3 BLISS The Basic Language for the Implementation of System Software is a language for writing production software, such as operating systems and compilers. It was designed and implemented for Digital Equipment’s PDP-10 by Professor William Wulf and his colleagues at Carnegie-Mellon University during 1969 and 1970. The language has PL/I-like block and procedure structure, but is more low-level than PL/I. Among the more prominent features of the language is that each symbolic name in the language represents a pointer to a value rather than a value. In order to perform an assignment such as A = B in PL/I, it is necessary to write A + . B in BLISS. The “t” is the assignment operator in BLISS and “.” is an operator which refers to the contents of the location specified by B. Unlike most languages which use the statement as the basic structural unit of a program, BLISS uses the expression. The BLISS expression is like a normal statement except that it yields a value and can therefore be used in a more elaborate expression. For example, an IF expression may appear as the right-hand side of an assignment operator. Only declarations of variables have no value. Also, declarations do not permit a type attribute to be associated with a variable. 4.3.7 Data Elements
(a) Type and Length Attributes The basic data element in BLISS is the PDP-10’s 36-bit number. Although no specific type attribute is associated with any wordl there are three types of data any word can hold: decimal integers, octal integers, and character strings. Strings may contain as many as five 7-bit ASCII characters. Strings of the form:
‘ABC’ are left-justified in a word while those of the form: ‘‘ABC” are right-justified in a word. In either case, the remaining bits of a word are set to zero. Longer character strings can be associated with variables
SYSTEMS PROGRAMMING LANGUAGES
225
whose storage elements are OWN or GLOBAL (see Section 4.3.l(b)). Strings of this type are placed in consecutive words, five to the word and the last word is padded with zeros. Although no floating point facilities are available, the programmer is allowed to work in the PDP-lo’s assembly language, which provides instructions for handling this type of data. The special pointer notation in BLISS allows the programmer to reference an arbitrary contiguous portion of a word, as shown below. A pointer does not need to be an absolute address and may involve index register and indirect addressing specifications. In the expression
X(.Y13) the name X selects the location of X. (Remember symbolic names stand for addresses rather than values.) The expression in brackets selects a field within X which is three bits long and is positioned a number of bits from the right of X. This positioning number is the run-time contents of the location referenced by Y. Additional parameters in the brackets could specify one of sixteen index registers and indirect addressing. The ability to have the origin and length of a bit string specified at run-time gives the language bit manipulation facilities much like DSPS [35] or L6 [27]. Identifiers can be declared to be associated with an integral number of words. In this case the identifier is considered to be a pointer to the first of these words. Identifiers associated with blocks of words are used with accessing algorithms to achieve data grouping. However, operations are defined only on words or portions of words. Therefore, data having a length greater than a single word must be accessed and processed a single word at a time. This clearly represents a problem for handling character strings. (b) Storage Allocation and Scope Attributes Storage is allocated only on the program and segment level. No facilities exist for allocation on the statement level. This can be a significant drawback. Variables whose storage allocation and scope is on the segment level may be declared to be LOCAL or REGISTER. The declaration: LOCAL V1 [ 6 ] ; will reserve six words of core storage referenced by V1 on entry to the block in which the declaration occurs. Similarly a variable may be associated with a register in which case no core storage is reserved for it. The compiler must be able to evaluate the expression in square brackets at compile time, making dynamic length arrays impossible. Variables whose storage allocation is done on the program level may be declared to be OWN or GLOBAL. The difference between these classes is in scope. OWN variables are known only in the block in which they are
R. D. BERGERON ef a/.
226
declared. GLOBAL variables are known throughout the procedure and may be accessed by other procedures which declare them to be EXTERNAL. (c] Data Grouping Data elements may be organized much like COMPONENTS in BED in that storage is reserved in advance and the user defines a mapping function over it. A programmer may organize data in this way with two declarations. The first of these is the structure definition which specifies an algorithm by which the elements of a structure may be accessed. One might define the addressing function for a one-dimensional array with the declaration:
STRUCTURE VECTOR [INDEX] = .VECTOR
+ .INDEX;
No storage is reserved by this declaration. Instead, the declaration defines a procedure with two parameters, VECTOR and INDEX, which will be called to compute the address of the desired array element. When the procedure is invoked, the contents of VECTOR are added to the contents of INDEX to produce the address. In order to use the structure just declared, it is necessary to associate the accessing algorithm with storage which has been reserved by another declaration. For example, if the user wishes to associate the structure with the variable V1 which has been declared above, he will write the expression
MAP VECTOR: VI; When an expression such as V1 [ N l t 5 1 6 ; occurs in the program, the accessing procedure is invoked with the parameters Vl and .N. This is equivalent to VECTOR + V1; lNDEX t . N ; The first statement stores the address of V l into VECTOR; the second stores the contents of N into INDEX. Thus the mapping function will add the address of V1 and the run-time contents of N to locate the proper core address. Although the mapping function is referred to as an accessing procedure, it does not necessarily involve a procedure call. Simple accessing procedures may be handled by the compiler with inline code. It is possible to access a given block of memory in more than one way by giving it an alias with a BIND declaration and mapping a different structure on it. The declaration BIND V2
=
Vl;
SYSTEMS PROGRAMMING LANGUAGES
227
reserves no storage and merely makes V2 reference the same location t3hat V1 references. 4.3.2 Program Segmentation
BLISS has both BEGIN and procedure blocks. BEGIN blocks, like those of PL/I, give lexical scope to variables and allow control to pass through them. The two types of procedures available are FUNCTIONS and ROUTINES. Both types of procedures are potentially recursive and reentrant, and return a value. The main difference between them is that a ROUTINE requires much simpler preliminary linking, but may access neither nonlocal variables nor call functions. A ROUTINE can also be declared to be GLOBAL in which case, it is as if its declaration occurred in the outermost block of the procedure. Both types of procedures may be activated as coroutines. They may also be activated as asynchronous processes if the run-time environment or machine configuration will allow this type of execution. All parameters in BLISS are passed by value. However, since names of variables in BLISS refer to addresses, a parameter passed by value is equivalent to one passed by reference in most other languages. 4.3.3 Control Expressions (a) Conditional Expressions BLISS contains the IF construct with the usual six relational operators. It also has DO . . . WHILE and DO . . . UNTIL expressions for the conditional repetition of a section of code.
(b) Escape Expressions A distinctive characteristic of the flow of control in BLISS is the absence of the GOTO statement. The designers of BLISS felt that repeated use of the GOTO command and labels obscured the logic of the program and hindered code optimization.15They felt that portions of code which were logically related should be grouped in control sections. Control may be passed out of sections by executing an exit expression which causes control to be passed to the first statement of an outer control section.
Expressions of the form
EXIT [21 . X
+
Y;
may be used to leave blocks, loops, and rompound and vnditional expres15 The (jccurrence of a label at, a point in a program usually mearts that, thr point’ niay be reached by more than one execution sequence. If this is the case, the compiler cannut. retain values in registers. Each expression used by the programmer must be evaluated rather than “remembered” in a register.
R. D. BERGERON et a/.
228
sions. The expression above terminates the number of innermost control sections indicated by the bracketed expression (i.e., 2) and returns a value (i.e., the contents of X added to the contents of Y). Bracketed expressions must be known at compile time. The other allowable keywords are EXITBLOCK, EXITCOMPOUND, EXITLOOP, and EXITCOND. These indicate that the innermost N of the control sections specified as part of the exit keyword (blocks, loops, compound, or conditional) will be terminated. Unfortunately, it is impossible to exit from a control scope by naming a label instead of a number of levels. BLISS has retained the RETURN expression to terminate procedures. It allows an operand to be appended which can be passed back as the value of the procedure. (c) Coroutine Expressions Any FUNCTION or ROUTINE in BLISS may be activated as a coroutine.16There are two primitives in the language for handling coroutines, CREATE and EXCHJ (exchange jump). The CREATE command sets up a run-time stack for the procedure named and stores the parameters and return address in the stack. When more than one of these commands has been issued, it is possible to pass control between them with an EXCHJ expression.
(d) Parallel Structures Expressions may be grouped between SET and TES keywords instead of between BEGIN and END keywords. The difference is that expressions between SET and TES may be evaluated in any order the compiler decides, perhaps even in parallel, while expressions between BEGIN and END keywords are executed sequentially. The value of the SET-TES construct is the value of the last expression executed and is, therefore, quite unpredictable. There is no escape expression from the SET-TES environment. BLISS has two selector expressions which can function in a parallel processing environment. The first of these is the CASE expression which has the following syntax:
CASE V[l],V[2],.. .V[M] O F SET EPI, EPI,
E"1 TES; l6 A method for having two active processes a t the same time. Unlike the usual tasksubtask relation, these are both tasks on the same level.
SYSTEMS PROGRAMMING LANGUAGES
229
First the V[i] are evaluated. Each of the V[i] selects an expression E[V[i]]. Finally all the expressions selected in this manner are executed as if they were in SET and TES brackets. This is a generalization of the CASE statement in other languages which allows only a single V and selects a single statement to be executed E[V]. The other selector construct in BLISS is the SELECT expression. Its syntax is
SELECT V[l],V[2],. . .V[N] OF NSET
U[N]:E[N] TESN; First the V[i] are evaluated. Next the U[i] are evaluated. Finally, if V[i] = U[z], then the expression E[i] is selected for execution. All the expressions selected in this way are executed in any order the compiler desires. This is similar to the case expression except that the statements being selected are not sequentially numbered, but associated with a value. One use for this statement might be to select a proper action after recognizing an arithmetic operator.
SELECT .CHAR OF NSET
+
” :E[11, “-”:E[2], “*”:E[3], ‘I
“/”:E[4] TESN; In the example, the contents of “CHAR” is retrieved and matched against the four “labels.” The expression whose “label” matches the contents of “CHAR” is selected and executed. Although this is very similar to the CASE expression, the SELECT construct is much more flexible as the use of “labels” implies that there need not be N expressions even if the range of the values of V[i] is from 1 to N.
230
R.
D. BERGERON
et al.
4.3.4 Data Manipulation
Since the user can define pointer chasing methods, BLISS has very flexible operations. In addition the ability to arbitrarily insert machine code allows the programmer access to the PDP-10 assembly language instructions which include pushdown stack operations and floating point arithmetic. The character manipulation facilities of BLISS are quite low level and machine dependent. The implementation of these functions is based on a pointer. There is a command which increments the scanning pointer to reference the next character. If there is not enough space for a character remaining in the word being processed, the scanning pointer is updated to point a t the next word. There is also a command to decrement the scanning pointer. I n addition, each of the three primitives has one form which moves the pointer and another form which does not. The primitives are 1. SCAN picks off the next character of a designated string. 2. REPLACE substitutes the character a t which the scanner is pointing by a designated expression of no more than seven bits. 3. COPY allows the user to copy a single character from one string to another.
While it is possible to implement many string operations with these primitives, the fact remains that a programmer must implement them himself. 4.3.5 1/0 and Debugging Facilifies
All I/O for BLISS must be done in PDP-10 assembly language. There are no debugging aids provided for the user. Indeed i t would be quite difficult to implement totally symbolic debugging in a language which does not allow type attributes to be associated with variables in their declarations. While i t would be possible to dump core locations which correspond to identifiers, it would only be possible to display them in octal format. -4.3.6Macro Facilifies
BLISS has a very simple macro processor which allows string substitution and parameter passing. This facility may be used to extend the syntax in simple ways, such as adding an UNLESS clause to the conditional expression with the following definition: MACRO UNLESS(X)
=
IF NOT(X) 3;
SYSTEMS PROGRAMMING LANGUAGES
231
It is also possible to use macros to introduce new functions to the language such as absolute value:” MACRO ABS(X)
=
BEGIN REGISTER TEMP; IF NEG(TEMP +- X) THEN - .TEMP ELSE .TEMP END $;
This macro causes storage to be reserved in a register named TEMP when the BEGIN block is entered. The parameter is assigned to TEMP in an expression which tests the sign of the variable which TEMP references. As with all expressions in BLISS, the block returns a value which is the value of the last statement executed in the block. If the sign of the contents of TEMP was originally negative, the block returns the opposite of the contents of TEMP. Otherwise it returns the contents of TEMP. There is a major restriction in that macro calls may not appear until lexically after the corresponding declaration so that recursive macro calls are not possible. 4.3.7 Trouble Spots
Although the pointer notation is interesting and the parallel processing constructs in BLISS are impressive, the language in general is too low level. Based on the available documentation, one of the serious limitations of BLISS is the lack of operations on more than a single word, especially for character strings. The absence of any debugging facilities would seem to make the language very difficult to use. Since there is no storage allocation on the statement level, it is not possible to have an arbitrary nuniber of instances of a variable. This is a serious problem for systems programming applications which use data structures such as a linked list of indefinite length. In order to group data into PL/I-like structures, a mapping function must be defined for each subfield of data. As has been mentioned above, procedures may be used as asynchronous processes if the run-time environment or hardware configuration of a machine supports parallel processing. Although coroutines were included to facilitate parallel processing, they are hindered by this form of execution; no allowance is made for synchronization of processes which operate on the l7
Examples from (401.
232
R. D. BERGERON et a/.
same data base and for operations which should not be interruptable. This is a significant drawback for the EXCHJ operation. 4.4 PL360
PL360 was developed on the Burroughs B5500 and bootstrapped for the System/360 a t Stanford by Nicklaus Wirth. The language was designed to provide programmers with full use of the facilities of the /360 hardware along with a method of writing programs conveniently and comprehensibly. The introduction of program structure with the begin and procedure clauses and control statements with the if, for, and while clauses help improve program logic and avoid the tedious problem of addressing. However, the designers realize that since “PL360 allows (and requires) the programmer to denote almost every machine instruction explicitly, its use inherently bea.rs some of the tediousness of assembly language programming” [38]. 4.4.7 Data Types and Attributes
The intent of the designers of PL360 was that no storage element of the computer, including registers, should be hidden from the programmer. Therefore, the language allows two basic storage types, registers and memory cells. Both of these may be given the data attributes integer, logical, real, or long real. The types integer and logical are treated identically by the compiler. In addition, memory cells may have the data type attributes byte or short integer (actually halfword). The only data structuring facilities provided are arrays of a single dimension. Array elements must be addressed by the displacement in bytes from the origin of the array, and not by the ordinal number of the data element. To access the fourth element of an integer array named “line,” the following phrase would be used since integers have a length of four bytes :18 line (12) Elements of arrays may be of different types and lengths. In fact, an array may be used as an element of another array. However, since the programmer must specify the displacement of the elements in bytes, this facility is nothing more than a DS or DC in assembler language. Memory cells and array elements may be initialized to values that are numerals, strings, or addresses. Elements of a single array can be initialized l8
Note that the first element is referenced by a displacement of zero.
SYSTEMS PROGRAMMING LANGUAGES
233
to different values. Some examples of declarations are19
short integer i, j, k = 5, baddr = @B5; array 132 byte line = 132(“ ”), buff = 33(“ ”,2((‘*”),(‘ ”); array 15 integer x = 3(@line,“ABCD”,3(5)); The first declaration identifies the short integer variables “i”, “j”, “k” (initialized to five), and “baddr” (initialized to the address of “B5”). The next declaration identifies “line” as an array of 132 single bytes, which are all initialized to 132 repetitions of “ ” and “buff” as the same type, initialized to 33 repetitions of the pattern ‘ I ** ”. The final variable declared is “x” which is an array of fifteen integers initialized to three repetitions of the pattern containing the address of “line,” the string “ABCD,” and three repetitions of the numeral five. It is possible to associate several data types with the same cell through synonym declarations. These declarations act as DEFINED in PL/I or EQU in assembler language, reserving no storage. The default scope for identifiers is local to the block in which it is declared. As in BLISS, the scope may be specified as global or external. Although most identifiers in PL360 have the storage attribute static, it is possible to obtain storage at run time by including the SVC for GETMAIN20 in a function statement.
function svc(l0) 4.4.2 Program Segmentation
The block structure which the language exhibits is like that of PL/I. BEGIN blocks can be used to associate a group of statements into an entity which is syntactically a simple statement and to introduce new identifiers. Routines which can be activated only by a call are designated by the keyword procedure. Since the programmer rarely has a feel for how much code will be generated from a source program thus making addressing hard to predict, the designers have allowed for addressing to be provided by the compiler. The programmer need only specify which register the compiler should use. (Since ROZ1cannot be used for addressing in assembler language, it cannot be used in PL360.) If the keyword segment precedes a procedure, the routine will be compiled separately. Despite this structure, the programmer must supply his own addressing in assembler Examples from [39]. A macro used to request storage from the operating system. 21 RO is used here to represent general register zero. 19 *O
234
R. D. B E R G E R O N et a/.
language to the main block of data allocated a t the end of the outer procedure. 4.4.3 Control Statements
PL360 does have some advantages over assembler language. In addition to the usual six relational operators, an i f statement may test the overflow condition of the previous arithmetic operation. Furthermore, if the Boolean clause of an i f statement consists of a relational operator without operands, the decision of which clause to execute is made on the basis of the condition code set by the previous instruction. The language also contains for and while statements as well as a case statement which permits the selection and execution of a single statement in a BEGIN block (the statement itself might be a BEGIN block) corresponding to the value of the index register specified after the keyword. Its form is case R l of begin statement; statement; end ; 4.4.4 Data Manipulation Facilities
Any data manipulation facilities of the System/360 which cannot be expressed in terms of LOAD and STORE must, unfortunately, be written in PL360 as a function statement. This uses an identifier (an assembler mnemonic) to designate a single instruction whose operands are given in a parameter field enclosed in parentheses, for example : function MVC(15,line,buffer) ; This inst,ruction translates into a single /360 machine instruction :
MVC LINE(15),BUFFER which moves fifteen characters from BUFFER to LINE. 4.4.5
110 and Debugging Aids
PL360 provides some basic 1/0 operations for ease in debugging. “read,” write,” and “punch” are the names of several external procedures which are implicitly declared in PL360. They can be used in the same manner that PL/I record-oriented 1/0 can be used. The data to be transmitted is placed in a buffer of a fixed length specified by the user for the entire program and the appropriate procedure is invoked. L i
SYSTEMS PROGRAMMING LANGUAGES
235
4.4.6 Trouble Spots
The semantics requires all sequences of operators, including the assignment operator, be applied left to right. This means that the normal precedence rules do not apply. Therefore,
R l := R2
+ Rl;
is not equal to
R1 := R1
+ R2;
but to the sequencez2
R1 := R2; R1 := R l
+ Rl;
This is the case as the contents of R2 are placed in R1 before the addition occurs. When the addition finally takes place, the R1 which is added contains the contents of R2. The language allows the programmer access to the registers in statements, but the programmer pays for this with assembler language-type headaches. For example, an odd numbered register must be specified for multiplication and division and the side effect is that the higher register is changed by these operations although it is not explicitly mentioned in the statement. The basic drawback of PL360 is that it is too low level. It provides only those facilities found in System/36O Assembler Language and little more. For example, the string manipulation facilities are no better. In this case, the compiler does little more than translate a single source instruction to a single target instruction. Many of the features of the language could be provided with a good set of macros. 5. Extensibility and Systems Programming
In the previous sections we have proposed a considerable number of constructs and facilities that might be required by a systems programmer. Some of these facilities, such as data structure definitions will be used in most systems. Others, however, are far more specialized and may be needed only in certain types of systems. The next problem, then, is to determine how best to accommodate all these requirements in a general purpose systems programming language. Our answer to this problem is the Lana Examples from [St?]:
236
R. D. BERGERON et a/.
guage for Systems Development which is discussed in Section 6 . LSD does not provide all facilities for all systems programmers. However, the language is equipped with an extensibility mechanism which will allow the user to define and implement his own facilities. The reasons for providing this type of systems programming language and a brief description of extensibility mechanisms are presented below. 5.1 Universal Language or Universal Processor
One solution to providing a general purpose systems programming language is to take the universal language approach-provide everything that any systems programmer might ever want. This approach has all the advantages and disadvantages mentioned earlier in the discussion of PL/I. The theoretical convenience of having one language that supplies the programmer with all the facilities he needs is offset by the actual impracticality of such a language. It is analogous to the politician who promises to be all things to all people. To many a t election time, he may sound like the desperately needed panacea, but once in office, he is so bound by conflicting promises and Southern commitments that he cannot function effectively. Another solution is to determine, by some means, which facilities are most commonly used by syskems programmers and incorporate only those into the language. This is the most common approach among currently existing systems programming languages. As we have seen, the criteria for deciding which facilities should be implemented vary widely from one language designer to another. Because the scope of these languages is restricted, their compilers are able to generate relat,ively efficient code (which is an important criterion for a systems language). However, the facilities available to the systems programmer are also restricted ; he must be satisfied with a language that may be inadequate for him, or else he must revert to using assembly language for some of his programming. A third approach is to provide a universal language processor in order to develop special purpose languages that are individually designed for each particular application. Work in this direction can be classified into two categories-translator writing systems (TWS) and generalized macro processors. The TWS accepts the definition of a language in a suitable meta-notation that allows specification of the syntax and semantics of the language. Using these specifications, the TWS accepts a program written in the defined source language and compiles it into an appropriate target language. The number of meta-notations for defining the language are almost as varied as the many translator writing systems that have been implemented.
SYSTEMS PROGRAMMING LANGUAGES
237
Feldman and Gries’ survey [I51 is an excellent review of the various techniques used in TWSs. Generalized macro processors, which are sometimes referred to as metaassemblers, have also been proposed as compiler writing tools. Proponents maintain that compilation is simply a special case of macroprocessing; that a compiler is given as input a statement in the source language and produces, as output, appropriate machine language. The most persuasive criticism of macro-defined languages is that the code produced is usually very poor. There are two primary criticisms that apply to the basic concept of providing a universal language processor. Most significant is that much work is constantly duplicated. Often many different languages have a considerable number of similar characteristics which are redefined by each language designer. This would be especially true of a group of languages that are created for use by systems programmers. The other drawback is that generation of efficient code is very difficult using a special purpose language processor. d y the nature of their generality, the designer must program most code generation himself and the burden is upon him to produce good code. Even when the language processor can generate the code, it could not possibly do as well as a compiler that is specifically designed and tooled for only one language. In other words the flexibility which provides the most powerful facilities also creates a major obstacle for systems programmers. The extensible language system attempts to eliminate the duplication and inefficiency while still providing the programmer with the power of a language processor. The extensible language concept might be considered a hybrid of the universal language and universal language processor concepts. It is often referred to as the “core” approach to the language problem. An extensible language system consists of two primary components-a base language and an extension meta-language. The base language contains all the facilities and constructs that compose the primitives for the language. The extension meta-language attempts to incorporate the powerful tools of translator writing systems and macro processors to allow definition of new constructs in terms of the primitives. It may allow for definitions of new syntactic types, new data types and operations upon them, or for altering the appearance of the language. The effect of applying extension mechanisms to the base language is to create a family of related languages. If a systems programming language is extensible, the base language should include the most commonly used facilities. If these are not sufficient, or convenient, for a particular application, the systems programmer would use the extension mechanisms to enrich the language with only those facilities he needs. In effect he will be creating his own personal version
238
R. D. BERGERON et a/.
of the systems language. The significant advantages of this procedure over the language processors is that the vast majority of the language is already designed and implemented for him. Furthermore, the code generation of the base language can be hand tooled to any desired degree. (Since this is done only once, the expense would not be excessive.) Hence, since the majority of the programming would usually be done in the base language the inefficiency introduced by flexible language design would probably not be significant. 5.2 Facilities of Extensible Languages
Extensibility mechanisms can be divided into three basic categorieslexical, syntactic, and semantic. Although the term ‘‘extensible” has only recently become fashionable, many of its basic concepts have existed for some time. The term lexical extensibility will apply to mechanisms that are implemented before syntactic analysis and primarily implemented by means of text substitution. Syntactic extensions are either implemented during syntactic analysis or include a syntactic analysis phase. Syntactic extensibility is very closely related to the concepts developed by translator writing systems (compiler-compilers). Semantic extensibility will refer to such facilities as adding new data types and defining operations upon them The earliest and most widespread extensibility mechanism is the macro facility which is available with nearly all assemblers. I n their simplest form, macros can be used to add new (‘operations’’ to those provided by the assembler. These “operations’’ will be implemented in terms of the old operations either by generating in-line code or by sctting up parametrrq and a call to a run-time subroutine. Once an (‘operation” is defined, i t becomes an “old” operation and can be used to define subsequent “operations.” This mechanism is powerful enough to define entire languages in terms of macros. (L6 [27],*1 [fZ],etc. are languages which exemplify this technique.) For a time macro facilities were available only with assemblers However, people soon began to appreciate the usefulness of including macro facilities in higher level languages [6, 181. If a macro processor has the ability to test parameters to insure that they are of a specified syntactic type, then the macro processor must have a syntactic analysis phase. Macros of this type were called syntax macros by Leavenworth [28]and have become a primary vehicle for implcmentation of syntactic extensions. Leavenworth’s macros allow macro dr6nitions that could define new statements or in-line functions. This 0 d 3 7 allows augmentation of the two syntactic types statement and przmary. A general syntactic extensibility facility rcquires thnt the programmer be able to augment any syntactic t y p or even dcfinf JPW typw T h e programmer must have a fairly good knowledge of the syntactic structurr
SYSTEMS PROGRAMMING LANGUAGES
239
of the base language. Extensions might be defined in a BNF-type notation in terms of previously defined syntactic types. The programmer must be sure that his new syntactic names do not conflict with those of the base language and, more important, that his new syntax rules obey all the requirements imposed by the particular syntax analysis technique used internally by the compiler. This facility is the basis for translator writing systems. (In fact, a TWS might be thought of as an extensible language system with a null base language.) The practical drawback to using such facilities as extension mechanisms is that the programmer must have an intimate knowledge of the compiling system and the base language implementation in order to prevent disastrous results while making extensions. For this reason, most syntactic extensions limit the semantic definition to some form of syntax macro. Another method for syntactic extensibility might be desired to add new infix operators to a language. Although these can be implemented using syntax macros and a traditional BNF notation, it is very difficult to specify operator precedence relations in this manner. However, a modified BNF can be defined to simplify this task considerably. For example, in PROTEUS [4],Bell describes an extension to BNF, called Priority BNF, in which each production has a priority associated with it. During syntactic analysis, productions with the highest priorities are invoked first. In this way, the syntax macro approach can be used to define new operators and establish any desired precedence relations. In GPL [19], Garwick allows operator definition as the only form of syntactic extensibility. His base language is basically an operator precedence language and new operations need only be given a precedence value to be parsed correctly. The precedence of a new operator can be defined to be the same as an existing operator, or between two existing operators. Semantic extensibility refers to the ability to introduce new meanings for old constructs and symbols. The most important device for accomplishing this is the ability to define new data types and to redefine old operators so that they may be applied to these new types. A common facility for defining new data types is to allow combinations of previously defined data types [37,6,18].Each component may be accessed separately (usually by name) or the entire combination can be accessed. In the 1att)er case, the programmer must specify the action to be taken upon encountering a variable of this new data type. 6. language for Systems Development
The Extensible Language for Systems Development (Extensible LSD) attempts to provide the systems programmer with a programming environ-
R. D. BERGERON ef al.
240
ment that will significantly reduce the effort required to produce a system, without impairing the effectiveness of the final result. The LSD base language, which will be discussed in detail later, is more complex than that of most existing extensible languages. However, this is justifiable in view of the programming tasks it is expected to perform, i.e., systems programming. By providing a fairly rich base language that is implemented by a sophisticated compiler, it is hoped that (a) the majority of programming will be done in the base language, thus reducing the number of extensions the programmer will have to make; (b) since the extensions will have a broader base to build upon, they should be defined with less effort; (c) there will be greater similarity between the various dialects, thus reducing the learning effort required to understand or use another version of the language; (d) the generated code will satisfy the efficiency requirements imposed by systems programming. The LSD base language has been designed with the intention of implementing it on the IBM/360. Hence, some of the terminology, and even facilities are oriented to this type of machine. On the other hand, a concerted effort has been made to limit the number of machine-dependent constructs and to design all constructs in such a way that they may be interpreted in some reasonable manner on another machine. The extensibility mechanisms provided by LSD will be essentially similar to those discussed in Section 5. Data-type extensibility and syntax macros will both be implemented. Furthermore, the programmer will be provided with compile-time facilities that will allow him to access compile-time functions and variables in order to create efficient syntactic extensions. 6.1 Overview of the Base Language 6.7 .I Basic Philosophy
The LSD base language is a procedure-oriented language, without full block structure, whose appearance and syntax are similar to PL/I. The decision to avoid the conventional block structure approach was based on its potential run-time disadvantages. This will be discussed in more detail in Section 6.3 on program segmentation. LSD can be viewed as a dialect of PL/I that is specifically designed for systems programmers. Although much of the syntax is the same (and most of the differences are merely a matter of personal taste), the basic
SYSTEMS PROGRAMMING LANGUAGES
24 1
philosophies of the two languages are considerably different.Z3The LSD approach is to let the programmer “get at” the machine, rather than to hide the machine from him.One example of this is the ability to explicitly specify the general registers of the machine in source code. Another example is the ability of the programmer to exercise considerable control over the code generated by the compiler. The language is to be used by a sophisticated and knowledgeable class of programmers, who should be aided and not restricted by the compiler. Control over generated code is manifested in two principal ways. The most obvious way is provided by the compile-time commands and constructs that allow the programmer to aid the compiler in generating the most efficient code. These will be discussed in Section 6.7 on compile-time facilities. A more subtle, but no less powerful facility is provided by the handling of “error” conditions at compile time. In general, any specification of a construct will be valid as long as the compiler can generate code. Thus, it is not an error to use a fixed point variable as an argument of a character string function. The compiler will generate code to perform the function as if the argument were a character string. (Note that this is far different from generating implicit conversions.) On the other hand, since even systems programmers are subject to carelessness and keypunch errors, all “unusual” conditions, such as the above example, will generate “warnings,” so that if the condition were a mistake, the programmer could correct it. The base language constructs were chosen on the basis of their usefulness to systems programmers. A conscious effort was made to adopt the best characteristics of other languages while avoiding their drawbacks (from the viewpoint of systems programming). Thus, for example, the storage allocation and data structuring facilities are patterned after those of PL/I, whereas the character string fadities are influenced more by SNOBOL. The definition of each of the constructs was designed to provide as straightforward and efficient an implementation as possible. For example, the arguments for the string operations cannot be general complex patterns as in SNOBOL, but must be simple text strings (or a special function which selects any one of several designated characters). In this way, simple but powerful facilities can be provided in an efficient manner. If the programmer requires more complex functions, he can use the extension mechanisms to define them. 23 Some specific improvements of LSD over PL/I include more powerful pointer chasing methods, more flexible structure definitions, far more powerful character facilities, better debugging facilities and coroutines.
R. D. BERGERON et a/.
242
6.7.2 General Format of fhe Language
Input to the LSD compiler is free form; each statement is terminated by a semicolon. A statement label is specified by placing a colon immediately following it. Variable names can be as long as 24 characters and may include ‘-’ as well as alphanumerics. Constants are assumed to be decimal, but can also be specified as character (‘ABCDE’), hexadecimal (X‘09A45’), or binary (B‘00111000’). Comments are delimited by placing them between double quotes (”) and can be placed anywhere. The machine’s registers are explicitly accessible to the programmer. He need only specify the name of the register as if it were any other variable. (In the IBM/360 implementation, the general registers are named RO, R1, . . . , R15, and the floating point registers are FO, F2, F4, and F6.) 6.2 Variables and the Declaraiion Statement
The declare statement (indicated by the keyword DCL) is used to define variables and their attributes before their use. The following attributes can be specified for any variable: storage type, data type, scope, and length. Furthermore, additional declaration fields can specify initial value, alignment, basing parameters, and DEFINED parameters. (These will be described in more detail below.) Undeclared variables are flagged with a warning but are given default attributes which may be set by the programmer. 6.2.7 Variable Attributes
The basic data types provided by LSD are FIXED, FLOAT, CHARACTER, BIT, POINTER, and AREA. FIXED refers to an integer variable which may have a length of BYTE, HWORD (halfword), WORD, or DWORD (double word). FLOAT refers to a floating point (real) number which may be of WORD or DWORD length. The length attribute will also align the variable on an appropriate boundary. CHARACTER refers t o any number of consecutive character positions (or bytes). A CHARACTER variable may be of fixed or VARYING length. A VARYING character string may have a maximum size specification in order to conserve run-time storage allocation, but if no maximum is specified, the variable will be reallocated at run time as needed. BIT refers to any number of consecutive bits which may be defined to start a t any position within another length class. (For example, a BIT string of 24 bits may be defined to be located beginning a t the ninth bit of a WORD.) A POINTER specifies that the variable is to be used as an address and is included primarily to facilitate symbolic debugging (since a variable of any type can
SYSTEMS PROGRAMMING LANGUAGES
243
be used as a basing parameter) and to aid program clarity. AREA represents a variable that is to be treated as a logical data space. That is, space allocation can be performed within the AREA, and addressing is performed with respect to the beginning of the AREA. LSD variables may have one of three scope attributes-LOCAL, GLOBAL, or EXTERNAL. A LOCAL variable is accessible only within the procedure in which it is allocated. A GLOBAL variable may be known to all procedures in the system. EXTERNAL indicates that the variable has been defined as a GLOBAL in some other procedure which is external to the current procedure. The distinction between GLOBAL and EXTERNAL allows the programmer to associate a variable that is known to the entire system with a particular procedure or, in the case of an AUTOMATIC variable, with a particular activation of a procedure. Thus, a GLOBAL AUTOMATIC variable is known throughout the system but will be reallocated upon entry to the procedure in which it is defined. Any reference to the variable will refer to the most recent allocation. LSD variables can have one of seven different storage classesAUTOMATIC, STATIC, ENTRY, CONSTANT, BASED, STACKED, or FILE. As in PL/I, STATIC variables are allocated at compile time and AUTOMATIC variables are allocated a t run time on entry to a procedure. (AUTOMATIC allocation is not exactly the same as in PL/I. For more information, see Section 6.3 on program segmentation.) The storage type ENTRY specifies that the variable is a procedure name and the associated attributes indicate the characteristics of the value returned by a procedure invoked as a function. The storage type CONSTANT allows the user to name a compile-time constant symbolically. (A CONSTANT must have an initialization value specified.) LSD BASED variables are a generalization of PL/I based variables. In PL/I a variable may be based on only a single variable that must be of type POINTER.24In LSD a variable may be based on any number of variables which can be of any type. The STACKED storage class is more like a real pushdown stack than PL/I’s CONTROLLED. (BASED and STACKED variables will be discussed in more detail later.) The FILE storage class defines a variable that represents data on secondary storage (see Section 6.6 on 1/0 facilities). 6.2.2 The Declaration Statemenf
A single declaration statement can be used to declare any number of variables. The declaration for each variable contains three basic parts : 24 This is not true of Multics PL/I [lo],which allows an offset to be specified in addition to a pointer. However their notation and implementation are specifically designed as an adjunct of their hardware paging.
244
R. D. BERGERON et
01.
the name field, the attribute vector, and the modifying parameters. The attribute vector defines the scope, the storage type, the data type, and the length of the variable whose name precedes it. The elements of the attribute vector can be placed in any order and any or all of them may be omitted. If an attribute for a variable is missing, the corresponding element of the preceding variable (in the same DCL statement) is used. For example, in the statement DCL A LOCAL AUTOMATIC FIXED WORD, B STATIC FLOAT; since the length attribute for B is missing, the length attribute for A (WORD) will be used as the length of B. Thus, B will be a local, static, floating point variable whose length is WORD. The assumed values for the first variable in each DCL are LOCAL, AUTOMATIC, FIXED, and WORD. (These may be changed by the programmer.) Thus, the previous example could have been simplified to DCL A, B STATIC FLOAT; In order to make this scheme of variable default options useful and clear, the declaration statement has to be formatted so that the attributes for each variable can be quickly and easily discerned. Thus, the compiler will re-format the user’s DCL stat,ementin the listing so that each attribute will appear in its own column. Furthermore, the initial assumed attributes will be printed at the beginning of each DCL. Hence, the following input to the compiler: DCL A, B HWORD, C, D FLOAT DWORD; would appear in the listing as: LOCAL AUTOMATIC FIXED DCL A, B C, D
WORD, HWORD,
FLOAT DWORD;
6.2.3 Defined Variables
The storage attribute DEFINED specifies that the location of the variable is defined at some other point within the scope of the procedure. This can be useful in three different situations.
SYSTEMS PROGRAMMING LANGUAGES
245
(1) A variable may be defined by the programmer in assembler language code (see the CODE and ENDCODE instructions), but used in an LSD statement. Hence, the variable and its attributes must be made known to the compiler. (2) A variable may be defined to be located a t the same physical position as another variable (or as part of another variable, such as the sign bit). (3) A variable may have to be located a t an absolute core address. For example, in 0s location 16 contains the communications vector table pointer, from which all the system control blocks can be referenced. In the last two cases the programmer must additionally specify the defining variable or constant. This is one of the modifying parameters that is placed to the right of the name in the declaration statement. The DEFINED parameter is indicated by the a t symbol, @, which precedes the defining term. For example, consider the following statement: DCL A AUTOMATIC FIXED B@A DEFINED C@A. 2, CVT@16
DWORD, HWORD, WORD;
A is a fixed double word, B is a fixed halfword located a t the same address as A; C is a fixed halfword located two halfwords beyond the start of A; and CVT is located at the machine address 16. I n the declaration of C, the number to the right of the period specifies an offset that is a quantifier for the length associated with the variable being defined and not the variable on which it is defined. Thus if C had been declared as having a length of BYTE, the expression
[email protected] would mean that C is to be located two bytes beyond the start of A. Unless a ‘ ’ appears in the expression to the right of the the expression is assumed to refer to a machine address. Thus, an alternative method of writing the definition of C is
‘a’,
DCL C@A
+4
DEFINED FIXED HWORD;
+
A 4 refers to the fourth addressing unit beyond A. (In the IBM /360 this is, of course, 4 bytes beyond A.) The first notation is more machine independent than the second since it will refer to the same location regardless of the addressing unit of the machine. However, the second notation may be more meaningful to a programmer. 6.2.4 Initialization and Alignment
Other modifying parameters can be specified to initialize variables and to align them. Initialization is specified by either INIT= or simply =
246
R. D. BERGERON et a/.
followed by the initializing value. STATIC variables are initialized a t compile time and AUTOMATIC and BASED variables are initialized upon every allocation of the variable. Alignment parameters are placed after the keyword BDRY = . Parameters can be BYTE, HWORD, WORD, and DWORD. Some examples of initializing and aligning specifications are DCL A=O B
STATIC
C
FIXED WORD BDRY = DWORD, HWORD BDRY = WORD, INIT = 5,
D = 12 CONSTANT; A is aligned on a double word boundary and is initialized to zero, B is aligned on a word boundary, C is initialized to 5, and D is a compile-time constant whose value is 12. (Note that modifying parameters are defined only for a single variable; the specifications are not inherited by subsequent variables.) Since all types of modifying parameters have a unique keyword or symbol, they can appear in any order to the right of the name field in the DCL statement. However it is expected that the BASED or DEFINED parameters will usually appear first since this will create a format which is easier to read. 6.2.5 Arrays and Structures
There are facilities in LSD for defining structures and arrays. Arrays consist of a collection of data elements with the same attributes and are defined simply by placing the maximum bounds of indices in parentheses following the name in the declaration statement, An example of t,he dcclaration of an array is DCL MATRIX(l0) STATIC FIXED WORD; which defines a one-dimensional array named MATRIX whose index may range from one to ten. A variable bound may be specified as the index and is implemented by allocation of storage upon entry to the procedure in which the array is declared. The ability to group data of unlike types into logical entities is an essential requirement for a powerful systems language. LSD provides this facility by means of structure definitions similar to those of PL/I. However, unlike PL/I, the levels of data definition represent a true hierarchy, regardless of the respective attributes. In other words a variable defined on level n 1 is located within the variable defined at level n above it. Thus the programmer can use the structure facilities to subdivide his data fields as well as to associate a group of data fields. For example, consider
+
247
SYSTEMS PROGRAMMING LANGUAGES
the following statement: DCL 1 PAGEHEAD 2 PAGE-NAME 2 PAGE-SIZE 2 PAGE-CONTROL-INFO 3 PAGE-FLAGS 4 IN-CORE-FLAG 4 FULL-FLAG 4 ANOTHER-FLAG 3 IN-CORE-INDEX 3 PROTECTION-KEY
*
1
CHAR@), FIXED WORD,
*
1
FIXED BYTE,
BIW), BIT(1)l BIT(1), FIXED BYTE, FIXED HWORD;
The programmer can access and manipulate any of the fields. A field such as PAGE-FLAGS which is not at the lowest level will simply be treated by the compiler as a variable of whatever type and length it has been declared. Note that the * in a length field specifies that the compiler is to calculate the correct length. If a structure similar to this had been declared in PL/I, the byte allocated for PAGEFLAGS and the bits for the flags would not be overlapped. In order to specify in PL/I, that PAGE-FLAGS is to group the elements below it, a * is used in place of data attributes. However, an assignment to PAGE-FLAGS would be interpreted by the PL/I compiler as a separate assignment to each of the subelements. Thus, the assignment of a simple variable t'o PAGE-FLAGS will set each of the subelements to that value and assignment of another structure or substructure will make assignments between subelements that have the same name. In LSD the following statement: PAGE-HEAD .PAGE-CONTROL-INFO . PAGE-FLAGS = B'10100000'; will set IN-CORE-FLAG and ANOTHER-FLAG to one, and will set FULL-FLAG to zero. Furthermore t,he code to do this is a single assignment to PAGE-FLAGS rather than three separate assignments to each of the subfields. If a subfield name is unique in the program, the programmer can eliminate the preceding qualifiers since the specification is unambiguous. Thus, in the above example, we could have written PAGE-FLAGS
=
B'10100000';
6.3 Program Segmentation and Storage Allocation
As mentioned previously, LSD does not have PL/I-like block structure. There is no facility corresponding to PL/I BEGIN. . .END blocks in
248
R. D. BERGERON et a/.
which local variables are allocated on block entrance and de-allocated at block exit. Automatic variables declared in LSD BEGIN blocks will be allocated on the procedure level as described below. However, the compiler will reuse space that is allocated but no longer active whenever possible. 6.3.7 External and Infernal Procedures
The principal instrument for program segmentation in LSD is the procedure, which may be either internal or external. An external procedure is the basic independent unit within a system. In other words, each external procedure in a system can be compiled separately from all others with no loss of information. With the exception of the names of other external procedures, the only variables that are known within an external procedure must be defined within that procedure or declared to be EXTERNAL. An internal procedure is defined within another procedure. The internal procedure can only be called from within the outer procedure and can access any variables that are known to the outer procedure. Since the outer procedure can itself be an internal procedure, arbitrary levels of procedure nesting can be defined. For example, the following might represent the input for compilation: PROCEDURE A ; PROCEDURE A1 ; END A1 ; END A ; PROCEDURE B ; PROCEDURE B1 ; PROCEDURE B11 ; END B11 ; PROCEDURE B12 ; END B12 ; END B1 ; END B ; Procedures A and B are external procedures. B1 is an internal procedure of B and all variables known (i.e., accessible) in B are also known in B1. B11 and B12 are internal procedures of B1, therefore all variables known
SYSTEMS PROGRAMMING LANGUAGES
249
in B1 are known in B11 and B12. However, variables defined in B11 are not known in B12 and vice versa. The term external procedure will be applied only to those procedures that comprise the outermost level of a system. If a procedure is to be reentrant or r e c u r s i ~ e , ~must ~ i t be specifically denoted as REENTRANT by the programmer. Thus, the overhead required for a reentrant environment will not be generated unless it is actually requested. The basic requirements for a reentrant procedure are that the program part should not be self-modifying and the data areas should be allocated at run time for each activation (call) of the procedure. The allocation of the data area is done upon entry to the procedure, and the area is freed on exit from the procedure. 6.3.2 Allocation of Automatic Storage
In LSD all STATIC variables are allocated at compile time and AUTOMATIC variables are allocated at run time. The basic program segment for run-time storage allocation is the reentrant procedure. In other words, run-time storage is allocated only upon entry to a reentrant procedure. Hence, a non-reentrant internal procedure will have its AUTOMATIC storage allocated whenever allocation is done for its outer procedure. For example, consider the following program segment : PROCEDURE(REENTRANT) A; PROCEDURE B; PROCEDURE(REENTRANT) C; END C; PROCEDURE D ; END D; END B; END A; 25 As far as code generation is concerned, there is little difference between reentrant and recursive routines. Both require that data areas be allocated upon every activation of the routine. However, a reentrant routine (in an 0s system) further requires that there be no modification of the permanent data or program. If a reentrant routine does such modification, the LSD compiler w ill print a warning message, but will compile the requested code.
250
R. D. BERGERON et a/.
Storage allocation for the AUTOMATIC data declared in A, B, and D will be performed upon entry to A, but allocation of data for C will not occur until entry to C. (Note that declaring the internal procedure C to be reentrant means that C wiIl be reentrant or recursive with respect to B.) All AUTOMATIC variables will be freed on exit from the procedure in which they were allocated. The purpose of this scheme is to minimize run-time allocation by performing it only when it is absolutely necessary (i.e., whenever the programmer’s algorithm requires recursion or reentrancy.) This approach differs considerably from that taken by PL/I where AUTOMATIC storage is allocated upon entry to every block as well as every procedure. Thus, PL/I provides more overlapping of data areas since space is reserved for variables only when the variables are actually known. Furthermore, the programmer is provided with a convenient means for defining the logical scope of temporary variables. However every level of block or procedure nesting requires its own base register, or some other form of addressing pointer, to address the data defined in that level. Furthermore, every level of nesting requires a certain amount of execution time to perform allocation (and freeing) of data areas. Thus, a systems programmer using PL/I would probably organize his procedures and blocks in such a way as to minimize this overhead. He would declare variables of small internal procedures in an outer procedure, would not declare variables inside his blocks, etc. However he has now lost much of the logical organization that PL/I’s block structure is supposed to provide.26 The LSD approach allows for the logical correlation between procedures and their variables without forcing a run-time correlation, unless this is required by the program. However, in the situation where a significant sized work area is needed for a non-reentrant procedure, the programmer could define the work area as BASED or the procedure as REENTRANT in order to conserve space. 6.3.3 Based Variables
As was indicated above, a BASED variable may be based on more than one variable. A modifying parameter in the DCL statement specifies which variables are to compose the base. A period following the variable name indicates that the basing parameters are to follow. One example of the usefulness of multiple bases occurs in most data structure manipulation, where a typical operation is to access a data element that is located in a For example, although the programmers in the MULTICS project were provided with block structure in EPL, they did not use it [ I l l .
SYSTEMS PROGRAMMING LANGUAGES
251
certain block on some page. In this case, the address of the data element is determined by three values-the address of the page, the relative displacement of the block from the top of the page, and the offset of the element from the top of the block. This situation can be described in LSD by making the following declaration: DCL 1 STRUCT (PAGE,BLOCK) BASED 2 FWD HWORD FIXED, 2 BACK HWORD FIXED, 2 DATA WORD FIXED;
*,
This declaration specifies that the structure, STRUCT, will be based on the sum of the values of PAGE and BLOCK. Thus, the components of STRUCT will also be based on PAGE and BLOCK, as well as their displacement from the beginning of the st.ructure. A reference to the variable DATA will sum the contents of PAGE and BLOCK and the constant four in order to access the ~ a l u e . 2 ~ The basing parameters specified in the DCL statement are the assumed parameters used for accessing that variable. However, a t any time in the source program, the programmer can override some or all of the assumed parameters. Thus, if the variable, BACK were declared as above, the following statement : BACK
=
BACK (, BLOCKB);
would use the assumed parameters, PAGE and BLOCK to locate the left-hand variable; but would use the parameters, PAGE and BLOCK2 to locate the right-hand side. The multibasing concept can offer significant object code improvement in many common situations. For example, consider a doubly linked list located in an AREA (data page). Suppow each element of the list is represented in LSD bv structure declared above and that PAGE will contain the address of the AREA. Assume BLOCK contains the offset within the wPa of the first of three elements. In order t o delete the second element of the list, the following two LSD statements are rpquired: FWD
=
FWD . (PAGE,FWD) ;
BACK. (PAGE,FWD)
=
BLOCK;
*'Note that, if this ronstrurt is used on the /360 and if the values of PAGE and BLOCK happen to be located in registers, only a single assembler language instruction may be needed in order tr, access the value of any of the elements of the structure.
252
R. D. BERGERON et a/.
The following six assembler instructions will be generated for these statements:
* FWD
*
FWD.(PAGE,FWD); LH R5,BLOCK L R6,PAGE LH R4,FWD(R6,R5) LH R3,FWD(R6,R4) STH R3,FWD (R6,R5) BACK. (PAGE,FWD) = BLOCK; STH R5,BACK(RG,R3)
FWD BACK
=
EQU EQU
0
2
In order to perform this same operation in PL/I, the following statements are required:
DCL 1 STRUCT BASED(P), 2 FWD OFFSET(A), 2 BACK OFFSET(A), 2 DATA FIXED BIN(31,O); DCL A AREA(2000) BASED(PAGE); DCL BLOCK OFFSET(A); DCL (Pl,P2) POINTER; P = BLOCK; P1 = FWD; FWD = P1 -+FWD; P2 = P1+ FWD; P2 +BACK = BLOCK; When compiled by the IBM F compiler, these would generate the following assembler instructions :
L A ST L L A ST L
8,BLOCK 8,PAGE 8,P 8,P 7,0(0,8)
7,PAGE 7,Pl 8,P
SYSTEMS PROGRAMMING LANGUAGES
253
MVC 0(4,8),0(7) L 8,Pl L 7,O (03) A 7,PAGE ST 7,P2 L 8,BLOCK L 7,P2 ST 8,4(0,7) The LSD compiler produces six assembler language instructions, whereas PL/I produces 17, in addition to introducing two temporary pointers and using full words for offsets. Even if the PL/I compiler had a register history and allocation scheme comparable to that of LSD, enabling it to eliminate extraneous loading of P, P1, and P2, twelve instructions would be required to perform this operation. This indicates that in this case a significant portion of the inefficiency of IBM’s PL/I (F) compiler is due to source-level restrictions (e.g., allowing only a single level of pointer qualification and not allowing variables to be based on offsets). 6.3.4 Stacked Variables
The storage class STACKED denotes that the variable is to be located in a pushdown stack and is to be referenced via pointers as BASED variables are. The basing pointers may be explicitly specified and referenced by the programmer. Declaration of a STACKED variable might look like DCL A. (P) STACKED(10) FIXED HWORD; This statement declares the variable A to be a halfword and indicates that it is to be allocated in blocks of ten halfwords. The specification of the basing pointer P is optional. If it is not specified, the compiler will generate an implicit basing pointer. The variable A is not allocated until the execution of ALLOCATE A; which will first allocate a stack and then set P to point at the first element of the stack. Subsequent allocations of A will simply update P until the stack is full. If an allocation is performed when the stack is full, another stack will be allocated and linked to the old one, and P will be set to point at the first element of the new stack. Freeing A will move the pointer back up the stack. If the FREE command is executed when there is only one element left in the stack, the stack
254
R. D. BERGERON et al.
will be freed and P will be set to the address of the last element in the previous stack. If no previous stack exists, P will be set to a null value. A special stack function, PREVPTR, can be used to access previous allocations of a STACKED variable without changing the structure of the stack. PREVPTR takes an allocation of the variable and returns an address that points to the previous allocation. For example, the statement
P
=
PREVPTR(A. (Q));
will set P to the address of the allocation of A that precedes the allocation of A that is pointed a t by Q. 6.3.5 The Allocate Statement
BASED and STACKED variables are allocated by executing an ALLOCATE statement of the following form: ALLOCATE PAGE. (POINTER); This statement will cause PAGE to be allocated in main storage and will set POINTER to the address of the allocated space. If no basing parameter is specified the compiler will use the basing parameters specified in the declaration of the variable. If a variable is based on more than one basing parameter, the allocation is assumed to be requested from an AREA. The last non-constant basing parameter will be set to the offset within the AREA of the allocated space minus the sum of all constant parameters. The rest of the non-constant basing parameters specify the “frame of reference” of the allocation. That is, they point a t the AREA in which the allocation is to be made. For example, ALLOCATE A. (P,Q,R) ;
+
will allocate space for A from the AREA pointed t o by P Q and will set R to the offset of this space from the top of the AREA. (This would be especially useful for allocation in an AREA that is defined within another AREA. P would be an absolute pointer to the outer AREA and Q would be the offset of the inner AREA.) Furthermore, the assumed “frame of reference” can be overridden by using the IN parameter on the ALLOCATE statement. Thus, the statement ALLOCATE A. (P,Q,R) I N B . (S,T) ; specifies that the allocation of A is to be made in the AREA pointed to by S T. In order to address A subsequently, the allocation will also set P to the value of S and Q to the value of T, as well as setting R to the displacement in the AREA of the allocated space. I n this example there is one more basing parameter in the allocated variable than in the AREA
+
255
SYSTEMS PROGRAMMING LANGUAGES
variable. This is the most useful and meaningful situation in that each basing parameter specifies another level of AREA nesting.28 Multiple basing parameters for a STACKED variable designate that the variable (and therefore its stacks) are to be allocated in an AREA, or nested AREAS. The decision process for determining the AREA for allocation is the same as for BASED variables. Allocations of based variables can later be freed by the following statement FREE PAGE. (POINTER) ;
If the variable to be freed is based on more than one parameter, the FREE operation will assume that the last parameter specifies the offset into an AREA of the space which is to be freed. Thus, FREE A. (P,Q,R);
+
will assume that P Q points a t the AREA in which A was allocated and that R contains the offset of A in the AREA. The FREE operation will return the space for A to the AREA’S free area queue. The IN clause can also be used with the FREE command in order to explicitly specify the AREA in which the variable to be freed is located. 6.4 Procedures and Their Invocation
Procedures can be conventional procedures, function procedures, coroutines, or ON procedures. A conventional procedure must be explicitly invoked (e.g., by the execution of a CALL statement), and upon completion of the procedure, control is returned to the statement following the CALL. A function procedure is called by using the name of the procedure (along with its parameters) in the place of a variable in an expression. Execution of the procedure produces a result which is then used to continue evaluation of the expression. Function procedures have the same format as conventional procedures except the RETURN statement has a parameter which specifies the value to be returned. ON procedures are similar to the ON-units of PL/I in that they are called implicitly (and asynchronously) whenever the associated condition 28 However, other combinations may also be useful. Hence, the following algorithm 1 be the number of non-constant will be used for implementing such allocations: Let n basing parameters for the variable being allocated. Let m be the number of basing parameters for the AREA in which the allocation is being made. If n < m, the first n - 1 pointers of the allocated variable will be set to the first n - 1 pointers of the 1 remaining AREA variable and the nth pointer will be set to the sum of the m - n pointers. If n > m, the first m pointers of the variable will be set to the first m pointers of the AREA and the last n - m pointers of the variable will be set to zero.
+
+
256
R. D. BERGERON et a/.
occurs. In addition, the SIGNAL statement can be used to provide a simulated, but synchronous, condition invocation from within the program. Furthermore, there are facilities within an interactive environment to simulate asynchronous interrupts from outside the program (i.e., from a terminal). This is especially useful in debugging programs that manage the handling of asynchronous interrupts. No explicit parameters may be passed to ON procedures since there is no way of knowing when or where the procedure will be invoked. However, many types of ON conditions will have implicit parameters associated with the procedure, so that the programmer can make effective use of the information associated with the condition. ON conditions will be implemented for both hardware interrupts and software error conditions (e.g., invalid subscript or substring specification) with special attention paid to effective handling of attentions generated by real-time on-line devices such as graphics terminals. At the conclusion of processing of an ON procedure, control may be returned to the “point of call” (i.e., the statement that was being executed when the interruption occurred.) 6.4.7 Dynamic Calls
If a program is to be loaded dynamically when it is needed, the programmer can specify one of three commands: LINK, LOAD, or EXCHANGE. The command LINK foilowed by a routine name and parameter list will search the program job libraries for the indicated routine (which, for the /360, must be the name of a load module), will load the routine into core obtained from free storage and begin execution of the routine. An optional parameter on the statement allows the programmer to explicitly specify the file from which to obtain the module. This would be written as follows : LINK P(A,B) FILE(NAME); Furthermore in order to specify references at run time, the programmer can use a character variable instead of a routine name. In this case the LINK will use the value of the variable as the name of the desired routine. The LOAD command has the same format as LINK and indicates that the specified routine (load module) is to be loaded into core, but control is not transferred until the execution of a CALL statement. EXCHANGE also has the same format as LINK but makes use of the /360 facility XCTL (transfer exclusive control). The routine being called will not return control to the routine issuing the EXCHANGE. If the calling routine was also loaded dynamically, the core it occupies may be returned to free storage.
SYSTEMS PROGRAMMING LANGUAGES
257
6.4.2 Coroutines
In conventional program structure a procedure begins execution (via a call from a higher level program), executes until completion (possibly calling lower level routines), and then returns. A called routine must complete execution before the calling routine may continue. There is a clear hierarchy of procedures in which any given routine can be considered dynamically nested within its calling procedure. The relationship between coroutines has no such hierarchy. A coroutine may continue execution of (resume) another coroutine (even the calling routine) before execution is completed. The allocated storage for the coroutine is not returned to free storage, but is retained for use when execution is continued. When the coroutine is resumed, execution will continue following the last instruction to be executed. Consider the following sequence: Coroutine A calls coroutine B, B resumes A, A calls coroutine C, and C resumes B. At this point the dynamic sequence of execution is A, B, A, C, B, where the two A’s and the two B’s refer to the same activations of the routines k and B, respectively. There is no hierarchy among the three routines. None of the routines needs to complete execution before any of the others can continue. A coroutine is specified in LSD by placing the option COROUTINE in parentheses following the keyword PROCEDURE. A coroutine may also be specified as REENTRANT. There are three LSD commands for invoking coroutine procedures-START, RESUME, and RESET. The START command is identical to a CALL of a conventional procedure. The specified coroutine is entered at the beginning, AUTOMATIC storage is allocated, initialization is done and execution proceeds. (A START may also be accomplished by a CALL, LINK, or EXCHANGE.) The RESUME command causes the specified coroutine to continue execution following the last instruction to be executed. The argument for RESUME can take one of three forms. The standard statement is RESUME A; where A is a coroutine name. This statement will invoke the most recent activation of routine A which will use the activation record (data area) associated with the last START of A and continue execution. Any parameters used in A will be those passed to the routine when it was STARTed. However, the RESUME command may also have the parameters explicitly specified as in RESUME A(P,Q,R);
258
R. D. BERGERON et a/.
In this case, the routine A will replace the old parameter environment with the new one before execution continues. With the above statements the user can only resume the most recent activation of a given coroutine. However, if the user would like to resume a previous activation of a routine, he can explicitly specify the activation record pointer for that activation as the argument of the RESUME.2gA parameter list may also be specified in this type of RESUME. The third form of RESUME does not take any arguments. It specifies that t,he coroutine or procedure that last invoked the current routine (via a CALL, START, or RESUME) is to continue execution. This command is the same as the conventional RETURN except that the data area for the routine is not returned to free storage. The RESET command causes the last activation of the specified coroutine to be terminated. That is, its activation record is returned to free storage and the previous activation (if one exists) is established as current. This command may also take a variable as its argument. The variable must point a t the activation record of a coroutine. 6.4.3 Multiprogramming
LSD provides the same multiprogramming facilities as are available in OS/ 360.30These facilities are divided into three basic categories-activation and synchronization of processes and allocation of resources. Any process (task) may activate another process (which becomes a subtask) by means of the ATTACH command. The ATTACH command specifies the name of a load module (with parameters) that is to be loaded into core and executed. Optional parameters allow specification of task and event variables, alternate entry point, an explicit file specification and a priority decrement. Consider the following example :
ATTACH P(A,B) EVENT(E) TASK(T) FILE(LSD) ENTRY (LABEL) PRIORITY (I) ; This command specifies that the load module P should be loaded from the file specified by LSD and execution should begin a t the entry point LABEL with parameters A and B. The subtask created by this command will have an execution priority equal to the current priority (the priority of the 29 The activation record pointer of any routine is located in register thirteen whenever the routine is executing. The programmer could save this value in a global variable or pass i t as a parameter to the routine containing a RESUME of this form. 30 Therefore these facilities are also similar to those provided by PL/I, except that PL/I has IIU resource sharing commands.
SYSTEMS PROGRAMMING LANGUAGES
259
attaching task) plus I or the current priority, whichever is lower. (A subtask may not have a priority greater than that of its attaching task.) The EVENT parameter defines an event variable E (any full word) that will be used to indicate the completion of the subtask. The first byte of the variable will be set to zero before the subtask is activated and will be posted3] by the system when the subtask is terminated. The rest of the event variable will be set to the completion code of the subtask. If an EVENT parameter is specified, then a TASK parameter must also be specified. The task variable T will be set to the address of the task control block (TCB) for the subtask. Any subtasks that are attached with an explicit event parameter must be detached (removed from the system) before the attaching task terminates. This is done by using the task variable as the argument for the DETACH command as follows: DETACH T;
If no event variable is specified, the subtask will be automatically detached upon its termination. Process synchronization is accomplished by means of the WAIT and POST commands. The argument of the WAIT can be either a single full word variable or a list of full word variables. For example, WAIT A; specifies that processing is to continue only when the variable A is posted. If more than one variable is specified, all of them must be posted before processing continues. However, the user may optionally specify that processing may continue as soon as a number of the specified variables is posted. For example, WAIT 2 OF A, B, C, D; specifies that processing will continue as soon as any two of the variables are posted. If a variable in a WAIT statement has been used as an EVENT parameter in an 1/0 command or in an ATTACH statement, it will be posted upon completion of the specified event. Otherwise, the variable must be explicitly posted. This can be done using the POST command which has the following two formats: POST A WITH N; POST A; The first example specifies that the variable A is to be posted with a com31
A variable is posted if the highest bit is zero and the next bit is one.
260
R. D. BERGERON et a/.
pletion code of N which is stored in the lower order thirty bits of A. The completion code for the second example is zero. The resource sharing commands, EN& and DEQ, are available to take advantage of the 0s facilities for sharing of serially reusable resources. EN& requests that the specified resources be allocated to the current task. If all the resources are not immediately available, the task will enter a wait state until they are. Additional parameters allow the EN& to be conditional on the immediate availability of all the requested resources, or simply to test their availability. DEQ deallocates the specified resources from the current task, making them available for other tasks. 6.4.4 Parameters
The flexibility of the LSD language allows the user to specify his own parameter passing conventions. For example, the user can use registers to pass parameters, or he can set up his own parameter area. However, conventional use of parameters (i.e., specifying them in a calling statement) will result in a parameter list made up of addresses of the parameters (pass by reference). An expression passed as a parameter will be computed and stored in a temporary location and the address of this location will be passed to the routine. Parameters may be omitted in the calling statement and then tested in the called routine. The test is performed by the predefined function PARM which returns zero or one depending on whether the specified parameter wa.s omitted or not. For example: CALL P(A,,C);
PROCEDUREP ( P ~ , P ~ , ;P ~ ) IF PARM(P2) THEN TEMP = P2; ELSE TEMP=O; RETURN; END P;
If thimection of code were executed, PARM(P2) would return a value of 0 causing the ELSE clause to be executed. Furthermore, the user can specify explicit access to the storage location that contains the address of the parameter and test or modify that location in any way he desires. This is
SYSTEMS PROGRAMMING LANGUAGES
261
done simply by declaring the parameter to be based as follows: PROCEDURE P(Pl,P2,P3) ; DCL Q DEFINED FIXED WORD; DCL Pl.(Q) BASED FIXED WORD; Since P1 is a parameter, the use of the variable Q will simply name a location that is known to the compiler (this location is pointed at by register one on entrance to the program). 6.4.5 Procedure linking Requirements
It is not always necessary to create a full save area for the register values of the calling program and save these registers when entering a subroutine. Optional parameters for the PROCEDURE statement can specify which of the general registers are to be saved, whether the floating point registers are to be saved and whether a save area is to be generated. Thus the lowest level subroutine could save space by not generating a save area, and a small internal subroutine may not need to save many registers, if the programmer knows that the subroutine will use only a few registers. If these parameters are omitted by the programmer, a save area will be generated and all registers, except floating point registers, will be saved. The RETURN statement also has an optional parameter that allows the compiler to restore only a specified set of registers. (If the RETURN statement does not include this parameter, the compiler will generate code to restore those registers saved on entry to the procedure.) 6.4.6 The RETURN Sfafemenf
The conventional RETURN statement is executed at the end of processing of a procedure to return control to the calling program. This statement will free all storage that was allocated on entry to the procedure and will restore the general registers to their values when the procedure was called. If the procedure is a function procedure, the value being returned, or a pointer to values being returned, will be loaded into an appropriate register. Furthermore, another parameter of the RETURN statement specifies that register fifteen should be loaded with a particular value (e.g., a return code). A second type of RETURN statement has the form: RETURN TO LABEL; where LABEL may be a program label or an expression whose value is a program address. The statement specifies that control is to be returned to
262
R. D. BERGERON et a/.
a particular location in some routine that dynamically encloses the procedure containing the statement. An important aspect of this form of return is that all storage allocated since the last activation of the procedure containing the label will be freed before processing is resumed a t the specified location. For example, consider the following program segment : PROCEDURE A; LABEL: PROCEDURE (REENTRANT) B; CALL C; PROCEDURE (REENTRANT) C; RETURN TO LABEL; END C; E N D B; END A; Execution of the RETURN statement will free the storage allocated in C and B, restore the registers to their contents a t the call to B and will begin execution at LABEL. Furthermore, as in the standard RETURN, a register may be set to a particular value (a return code) which can be tested a t LABEL. This facility can be especially useful in handling error conditions, since the detection of an error in a low-level subroutine does not require a return path which passes bad return codes back up many levels. It should be noted here that although the LSD GOTO statement can take an expression as an argument, the LSD compiler assumes that the expression will represent an address within the procedure being executed. In other words, there will be no attempt to establish the “correct” environment before the branch is taken as is done in PL/I. However, the RETURN TO statement provides this facility while preventing extra overhead for the GOTO when it is usually not needed. 6.5 Statements and Operators 6.5.7 loop Control
There are three forms of looping commands in LSD: FOR, WHILE, and DO. The FOR statement is of the form:
FOR A
=
B TO C BY D statement;
+
and specifies that the given statement will be executed for A = B, B D, B 2*D, . . . B n*D where n is the largest number such that €3 n.*D
+
+
+
SYSTEMS PROGRAMMING LANGUAGES
263
is less than or equal to C. B, C, and D may be constants, variables, or even expressions and may be either positive or negative. If the BY clause is not present, the increment is assumed to be one. The WHILE sbatement is of the form: WHILE A
statement;
and specifies that the statement is to be executed repeatedly as long as the value of A remains less than that of B. Any of the relational operators < = , > , > =, = ,1<, o r l1 = ,> can , be used in place of the < in the example. To provide logical segmentation within a procedure, DO . . . E N D may surround any collection of statements so that they may be treated syntactically as a single statement. Thus DO . . . E N D is used to define a compound statement. However, it is also a looping statement of the following form: DO n TIMES . . . E N D ; which specifies that the statements represented by the ellipses will be executed n times. Since the assumed number of iterations is one, omitting the TIMES parameter will serve simply to group statements. Another form of the DO statement can specify that the enclosing statements are to be executed continuously. This command has the form: DO FOREVER . . . END: There are two situations that may arise during the execution of any DO group. The programmer may wish to discontinue the execution of the current iteration of the group, or he may wish to exit from the group entirely. These situations are conventionally coded using GOTOs. However, a GOT0 in this case may obscure the true meaning of the action. Hence, LSD provides two commands, ENDOF and OUTOF to perform these functions. To terminate the current iteration of any DO group the ENDOF command can be used. This command specifies that the next E N D statement (i.e., the one which terminates the immediately enclosing DO group) is to be executed. (Execution of the E N D statement implies that the test controlling the loop, if one exists, will be performed. The result of the test will det,ermine whether the loop is executed again.) The ENDOF command can take a parameter which is the label of any enclosing DO statement. I n this case the E N D statement of that DO group will be executed. To exit from a DO group, the OUTOF statement can be used. This command specifies that the statement following the next E N D statement
2 64
R. D. BERGERON et a/.
is to be executed. OUTOF may also take a parameter that specifies the name of any enclosing DO group. With these commands the programmer can terminate an iteration or exit from a DO group without using a GOTO. The resulting program may be more meaningful, easier to follow, and easier for the compiler to optimize code. 6.5.2 Selection Statements
There are two selection statements in LSD, the IF statement and the CASE statement. Depending on the Boolean value of an expression, the IF statement will cause execution of either a THEN clause or an ELSE clause, if it is present. The CASE statement uses the value of an arithmetic expression as an index to choose one statement to execute. For example, the statement CASE N O F statement 1 ; statement 2; statement 3; ELSE statement 4; ENDCASE : specifies that if N has the value one, two, or three, statements one, two, or three will be executed, respectively, and if N has any other value, statement four will be executed. As in the IF statement, the ELSE clause is optional. 6.5.3 Arithmetic Operations
+,
LSD provides the conventional arithmetic operators - , *, and / as well as // which is the remainder (or modulus) operator. LSD also provides or , (I), and (&), exclusive or the following logical operators: negation (l) (X), binary left shift (<<), binary right shift (>>) and comparison (:). The comparison operator returns one, two, or three depending on whether its left operand is less than, equal to, or greater than its right operand. There are also two special system functions that allow the programmer considerable flexibility in specifying the value of an expression. In the normal evaluation procedure a variable name in an expression indicates that the contents of the location defined by the name is to be used in the evaluation. The VALUE function specifies that the value of the argument is to be used as an address to locate another value. This provides an extra level of indirectness. The ADDR function specifies that the address of the argument is to be used as the value of the function.
SYSTEMS PROGRAMMING LANGUAGES
265
6.5.4 Character Manipulation
LSD has a variety of basic character manipulation facilities. Whenever feasible, these functions are performed by code generated in-line. The name of any character string can be modified to specify a particular substring. Thus if A is a character string, A(3,6) refers to the substring of A beginning at the third character and having a length of six. (If the length of A is less than eight, this substring will be undefined.) The first character of any string is located at character position one. Either of the parameters may be a run-time expression allowing for dynamic specification of substrings. The second parameter may be omitted implying that the rest of the string is to compose the substring. The basic character string functions provided by LSD are FIND, DELETE, INSERT, and 11. FIND, DELETE, and INSERT can be either infix operators or commands. The concatenation operator, 11, is an infix operator. Furthermore there are two special functions, ANY and NOTANY, which can be used as arguments of the FIND or DELETE operations, and will be explained in more detail later. The FIND command is of the form: FIND A IN B; where A and B are string variables. If the string A occurs in the string B, then R1 will be set to the index into B of the beginning of the first occurrences of the string A. If the scan is not successful, R l is set to zero. The infix format is simply
A IN B and must be used as an expression. It returns the index of the beginning of the substring as its value. The infix form may also be used as a Boolean expression (zero False, nonzero True), as in the following:
IF A IN B THEN . . . . . ; The DELETE command has the form: DELETE A FROM B; where A and B are once again treated as character strings. Normally, the DELETE operation includes a FIND to determine if A is, in fact, a substring of B. However, by specifying the first operand to be a substring of the second, FIND will not be invoked. For example, DELETE B(4,3) FROM B; will simply delete the three-character substring beginning at the fourth character from B. Note that this operation will “squeeze” B together, rather than replacing a substring with blanks. Thus if B is a fixed length
266
R.
D. BERGERON et
a/.
string, blanks will be padded at the end of B to replace the deleted characters. If the FIND operation is required, R l will be set to the index of the successful deletion or to zero. The infix form of DELETE returns a new string whose contents is the original string with the deletion made; neither of the argument strings is changed. For example, if A = ‘CD’ and B = ‘ABCDE’ then the statement
C
=
A FROM B;
will set C to ‘ABE’ if C has a fixed length of three, or is of varying length. If C has fixed length greater than three, the operation will pad the string with blanks and if less than three, will truncate it. The INSERT command has the form: INSERT A INTO B(1); and will insert the character string A into B beginning after the Ith character. Note that this insert function will not overlay the characters of B, but will “spread” B in order to fit the new characters. Of course, if B is of fixed length, an insertion of three characters will cause the three trailing characters to be lost. Thus if A = ‘AB’ and B = ‘CDEF’ are fixed length strings, then INSERT A INTO B(2); will cause B to become ‘CDAB’. The infix form of INSERT is similar to that of DELETE since it creates a new string and does not affect the original strings. Thus, if A = ‘AB’ and B = ‘CDEF’,
C = A INTO B(2); will set C to ‘CDABEF’ if C is a fixed length string of 6 characters. LSD has special ANY and NOTANY functions for use with the FIND command. They take as parameters a function string which is either a literal character string or a character variable of length 256 that must represent a TRT table. If a literal string is specified, it will be called the function string. The function string is not treated as a single string but as a collection of single characters. When ANY is the first operand of a FIND operation, the FIND will be successful if the string being searched (the argument string) contains any of the characters of the function string. The index returned in R l will refer to the first character in the argument string that matches any of the characters in the function string. Thus, for example, the following statement FIND ANY(‘, ’) I N STRING; will return the index of the first comma or blank in STRING. Furthermore,
SYSTEMS PROGRAMMING LANGUAGES
267
R2 will be set to the index of the character that was matched in the function string. Thus, if STRING = ‘TYPE A,B’ in the above example, R1 would be set to five and R2 would be set to two. The NOTANY function with a literal function string is similar to ANY, except the FIND will return the index of the first character of the argument string that is not contained in the function string. If a character string variable is used as the parameter for either ANY or NOTANY, then the compiler will assume that the variable is already in the form of a TRT table. In order to build a T R T table dynamically, LSD provides two other functions, TRT and NOTTRT, each of which accepts a single parameter and produces a fixed character string of length 256. The TRT function will produce a table whose entries are all zero, except for those bytes that correspond to characters in the parameter. In other words, for each character in the parameter, the index of that character in the parameter string will be stored as the function byte in the TRT table. NOTTRT will produce a table whose entries are all X‘FF’, except the bytes that correspond to characters in the parameter will have zero function bytes. (For the sake of program clarity, it is expected that the ANY function will be used with tables built by TRT and NOTANY for tables built by NOTTRT, although this is not necessary.) 6.5.5 Conversions
Another characteristic of the LSD data manipulation facilities is the minimization of implicit (i.e., unspecified by the user) commands. For example, implicit conversions are sometimes very convenient, but very often they are unintended, unwanted, or unneeded. In LSD no implicit conversions will be performed by the compiler. Hence, assigning a floating point number to a fixed point variable will generate a compile-time warning rather than a run-time subroutine call. Conversion routines will be supplied to the user, but he must specifically request their invocation. Thus if A is a fixed point variable and B is a floating point variable, B could be assigned to A by the following statement: A = FIX(B); Other routines will convert a fixed number to a floating point’number (FLOAT), a decimal number in character format to a fixed number and vice versa (FIX and CHAR), and a hexadecimal number in character format to a fixed point number and vice versa (XFIX and XCHAR). 6.5.6 Machine Dependent Facilities
In every machine there are specialized low-level commands that are normally used only by operating system programmers. These commands
268
R. D. BERGERON et a/.
usually are associated with the lowest level control of 1/0 devices. An operating systems programmer should not be forced to use assembler language simply because of the machine dependency of these commands. LSD provides facilities for accessing such commands of the IBM /360 machine. The facilities are provided by seven statements along with six predefined functions. Because of the specialized nature of the commands and of the caliber of the programmers who will be using them, it was decided to retain the /360 assembler mnemonics. The statements provided by LSD are LPSW (load program status word), SPM (set program mask), SSM (set system mask), SSK (set storage key), WRD (write direct), RDD (read direct) and SVC (supervisor call). The pre-defined functions (which will generate in-line code) are ISK (insert storage key), TS (test and set), SIO (start I/O), TI0 (test I/O), HI0 (halt I/O), and TCH (test channel). ISK takes one parameter which is the address of a 2048-byte block of core and returns a byte containing its storage key. TS tests the first bit of the byte pointed a t by its argument, sets the byte to all ones and returns the value (0 or 1) of the test. SIO, TIO, and H I 0 take arguments that specify the channel (and subchannel) address and device address of the indicated 1/0 event. The TCH function takes a single argument indicating the address of the channel to be tested. These four functions will return a value from one to four indicating the disposition of the specified operation. For example, for SIO the returned value indicates that (1) The 1/0 operation was successfully initiated. (2) The channel status word (CSW) has been stored and may be tested to determine why the operation was not successful. (3) The channel or subchannel addressed by the operation is busy. (4) The channel, subchannel, or device is not operational.
If all else fails, the programmer can drop directly into assembler language to specify exactly what code he wants generated. This is done by enclosing the desired assembler language inside the keywords CODE and ENDCODE. The compiler will generate this code almost as it is written. However, the user has the option of not specifying the exact registers he wishes to use and letting the compiler choose them. Thus, he might write, CODE; L $l,VARl L $2,VAR2 ST R7,0($1,$2) ENDCODE;
A $n indicates that the compiler is to furnish any free register wherever
SYSTEMS PROGRAMMING LANGUAGES
269
that symbol occurs. The compiler might transform the above code to L L ST
R3,VARl R6,VAR2 R7,O(R3,R6).
Furthermore, the compiler will scan the code for the occurrence of variables that are defined within the compiler. Thus, the compiler can substitute the internal name for any variable whose name is not valid in assembler language. For instance, if the variable, VERY-LONG-NAME is to be used in the assembler code, the compiler will substitute its internal name (perhaps VERY0001) at every occurrence. Of course, compiler-defined variable names will not be changed if the internal representation is the same as the compiler representation. 6.6 1/0 and Debugging Facilities 6.6.7 I/O Facilities
The LSD base language has only very basic 1/0 facilities since these are perhaps the most machine- and system-dependent functions. However, since most systems programmers work in an environment with very specialized 1/0 requirements, the LSD 1/0 operations are designed specifically to facilitate user definition of unsupported 1/0 methods. Specification of data on secondary storage is done by means of FILE variables. A FILE variable is actually a pointer to a file control block (FCB) that contains all the system information required for that file.32 Since the programmer can use the FILE variable just like any other variable (e.g., as a basing parameter), he has the ability to access the FCB for any file. A file may be explicitly declared or it may be declared contextually. An explicit declaration is specified in a DCL statement where FILE is the storage type and BSAM, QSAM, BDAM, or any other character string can be the data type. (The default conventions of DCL statements do not apply to FILE variables.) The three names BSAM, QSAM, and BDAM represent the types of 1/0 that are supported in the language. Any other character string following the keyword 'AM = ' will specify a user-defined 1/0 method. (This will be explained in more detail later.) All file declarations define an EXTERNAL variable so that all references to a given file name identify the same file. The three 1/0 methods that are supported by LSD are all record"The name of the FILE variable inust correspond to the name of a DD card (DDNAME) included in the job step in which the file is used.
270
R. D. BERGERON et a/.
oriented access methods. QSAM (queued sequential) and BSAM (basic sequential) refer to files whose records must be processed sequentially Tho system provides buffering for a QSAM file, but not for a BSAM file 77 A direct access file (BDAM) may be processed randomly, but the user nilid explicitly specify the record (block number) he wishes to access. BUAM file processing is unbuffered. There are two pairs of 1/0commands-GET/PUT and READ/WRITF For an explicitly declared file the GET and READ commands and tlw PUT and WRITE commands are interchangeable. For undeclared film q GET or PUT command (with no block number specification) will iiiiply n contextual declaration of the designated file as QSAM, whereas REAP or WRITE will imply BSAM. A block number specification (in parenthrses following the file name) will imply a BDAM file for either type of command For example, if the variables FILE1, FILEB, and FILE3 have not heen declared, the following set of statements will specify the files as QSAM BSAM, and BDAM, respectively : PUT A INTO FILE1; READ A FROM FILEB; GET A FROM FILE3(2);
If no file names are specified for either type of command, the system input and output files will be assumed. Any 1/0 command may have an optionally specified EVENT variable which will be used to signal the completion of the requested 1/0 operation (event). This is written as follows: READ A FROM FILE2 EVENT(E); In order to ensure that the 1/0 event is completed before using the varia hl(3 A, the user must use the WAIT command with E as a parameter. (Howexw, the system will also perform a WAIT if it is asked to do another 1’‘) operation on a file that has not completed a previous operation. This will prevent an 1/0 error, but will not ensure the integrity of the data in thp area used for the operation.) If no event variable is specified, the syFt.xl*: will not return control to the user until the 1/0 event is complrted Before any file can be accessed, the operating system must log1cnlly connect the specified data set to the program. This is called opening the 33 Buffering means that the system will perform the physical I/O operation on t,hr next record before (input) or after (output) the user actually requests it. The dntn is stored in a system buffer until the user executes a read command (input.) or until ?he I/@device can perform the output. There may be several system !)uffers so t’hnt thf= I/O operations can usually s h y ahead of the user’s processing. thereby reducinrr -mi+ time.
SYSTEMS PROGRAMMING LANGUAGES
27 1
file and involves allocating buffers, positioning the volume, etc. A file may be opened implicitly or explicitly. If an 1/0 operation is performed on an unopened file, the system will automatically perform the open. However, the user may explicitly open a file by using the OPEN command which specifies the file, or files, to be opened and the type of processing to be performed on them. For example, OPEN FILEl (INPUT) ; specifies that FILEl is to be opened for input only. In order to use the file for output, it must be closed (using the CLOSE command) and opened again. Other processing options are OUTPUT, UPDATE, OUTIN, INOUT, and RDBACK. (Not all of these are applicable to all types of files.) A second option allows the user to specify volume positioning for sequential files when processing is completed, or when volume switching occurs. In normal processing in the 0s environment, a filename refers to a DD card on which the user has specified the data set he wishes to process and its location. This situation makes it easy to change data sets from one run to another, but it is not convenient for changing data sets on-line. Since the latter situation is a common necessity in systems programming, such facilities are provided in the LSD OPEN statement.34For example, OPEN FILEB(UPDATE,DSN = NAME,VOLUME = VOL) ; will open FILE2 for updating, but will first modify the system control block for the file to specify the data set whose name is contained in the character string NAME and that is located on the volume whose name is in VOL. As was mentioned earlier, a user can specify his own type of access rnetliod by using the name of the method after ‘AM = ’ in the declaration of a FILE variable. This name must be the name of a user written routine to which control will be passed whenever an 1/0 operation is requested for a file of that type. An optional parameter on the DCL statement allows the user to specify the characteristics for the file explicitly. This parameter, which follows the keyword ‘DCB = ’, may contain any specification that is valid in an OS Assembler Language DCB macro instruction except for the DDNAME parameter. For example, DCL USER FILE AM=BPAM DCB = (DSORG = P0,RECFM = FB) ; j4 This is done in 0s by executing a RDJFCB macro instruction that allo.ws the programmer to access (and thus change) the system control block that is associated with any D1) card.
272
R. D. BERGERON et a/.
will declare the file USER to use the access method BPAM and to have the characteristics specified in the DCB parameter. If the DCB option is used for a user-defined access method, the LSD system routines will perform the OPEN operation whenever it is needed. When the file is used as the argument for an 1/0 operation, the routine named BPAM will be given control with a pointer to the FCB for the file as well as any other parameters specified by the command. The FCB contains all the pertinent information for the file including the type of operation just specified (READ, WRITE, PUT, GET, or OPEN) and space for user supplied information. If no DCB parameter is specified in the file declaration, the user’s routine will also be given control when the file needs to be opened (either by an explicit or implicit OPEN). Since many users require some kind of paging system, but are not particularly constrained to rigid requirements, a software paging mechanism patterned after that of DSPS [35] is available as an optional part of the language system.35 6.6.2 Debugging Facilities
Since debugging is a prime occupation of all programmers, the language system provides sophisticated debugging facilities. These debugging facilities fall into two primary categories-those designed for batch processing and those designed for interactive environments. In both environments completely symbolic dumps can be produced. This means that all characters are dumped in their character format, FIXED variables as decimal numbers, PTR variables as the name of the variable to which they are pointing, and so forth. There is also an option which permits the user to specify that some or all variables are to be dumped in their machine format (hexadecimal). Symbolic debugging is accomplished by retaining the symbol tables for each module included in a system. The symbol tables are kept on disk and thus represent no run-time overhead. When the system abnormally terminates, the debug monitor saves the core image a t the time of crash, and symbolically dumps the programs and data specified by the programmer. Furthermore, the core image will not be immediately deleted, so that the programmer may later request further dumping. Hence, he could specify that all his modules are to be dumped upon error or that only crucial or a Note the distinction here between the base language and the language system. The paging and debugging facilities are actually a set of pre-defined extensions to the base language, which are optionally available to the user. Thus, they are not part of the base language, but they are part of the language system.
SYSTEMS PROGRAMMING LANGUAGES
273
newly changed programs are to be dumped. The first case would occur most frequently in the early stages of debugging, and the second case would be more frequent in later stages. However, since the core image is not deleted, the programmer can later dump more modules, if the ones he originally specified prove to be insufficient. Another facility that is available in both batch and interactive mode is the ability to specify checkpoints at an arbitrary statement of any program, without recompiling the program. A checkpoint indicates that before the statement is to be executed a core image of some portion of the system is to be saved on disk, so that the debug monitor may later dump it symbolically, or restart the system at that point, or perform a number of other tasks. (Checkpoints can only be specified for a program that has been compiled in debug mode.36)In an interactive mode the user has several other commands that cannot be performed in batch mode. For instance, he can specify a pausepoint (similar to the specification of a checkpoint) which will stop execution of the program and give control to the debug monitor. The debug monitor will then request a command from the user. The user can specify selective dumping (symbolic), can alter the value of any variable, can request single stepping through the statements of the program with a pause after every statement, can request a backup to the previous checkpoint, a trace of subroutine calls, etc. An interactive user can also simulate synchronous or asynchronous interrupts from his console. Thus, he has the power to enter data from his console, or to enter a simulated interrupt. In other words, he can make his console “look like” a card reader, CRT, or any other physical device. 6.7 Compile-Time Facilities
A significant aspect of the LSD base language is the amount of control the programmer can exert over the code generation of the compiler. This can be done in a variety of ways including register allocation, specification of linking requirements on entry to a subroutine, and control over generation of run-time error checking. Other compile-time facilities will be available to actually specify the code to be generated. These facilities constitute the extensibility mechanisms. A simplified mechanism, ON ACCESS, is implemented in order to allow minimal facilities for simple The debug compiling mode generates no run-time overhead. Its primary functions are to save the symbol table and to maintain and save a table delimiting the bounds of each statement. In this way a preprocessor can locate the designated statement in the load module and insert a supervisor call (saving the contents of the location) that invokes the debug system. Thus, different actions can be specified for different statements at each execution of the program, without recompiling.
274
R. D. BERGERON et al.
exbensions. ON ACCESS will be discussed later in this section, and the principal extensibility facilities will be discussed in the next section. 6.7.1 Register Allocation
The principal code optimization in the LSD compiler is done by means of register optimization. The contents of each register is remembered and re-used if possible. However, register allocation by the compiler can be improved by the use of USING and DROP statements. The USING statement tells the compiler that for the next sequential section of the program (until a corresponding DROP statement is encountered) a certain variable, or expression should be retained in a register. The programmer may specify a particular register, or he may let the compiler choose any available one. For example, the following statement specifies that VARl is to be retained in R8, but that VAR2 is to be retained in whatever register the compiler would like to use (indicated by the keyword REG):
USING R8 AS VARl, REG AS VAR2; Furthermore, if the user wants to use a register within his code he may specify
USING RlO; which tells the compiler not to use RlO as a working register. By issuing a DROP command, the programmer can tell the compiler to discontinue reserved use of the register. Thus,
DROP R8, R10, REG AS VAR2; indicates that VARl is no longer to be retained in R8, that R10 is now available for use by the compiler, and that VAR2 is no longer to be retained in the register that had been allocated to it. Another facility provided by the compiler is the ability to specify if and when code should be generated to perform run-time checking for such conditions as exceeding a subscript or substring range. The checking can be generated for an entire procedure, for some segment of a procedure, or only for certain variables. This is specified by the use of the CHECKON and CHECKOFF statements. Each of these may take as parameters a list of variables which are to be checked (or no longer checked). If no parameters are specified, all applicable variables will be checked. 6.7.2 The ON ACCESS Compile-Time Facility
A useful compile time facility is the ability to specify to the compiler that certain action should be taken whenever a given variable is accessed
SYSTEMS PROGRAMMING LANGUAGES
275
by the compiler. This is done by the ON ACCESS command which has the following format:
ON ACCESS VAR statement; Whenever the variable VAR is encountered by the compiler, the code specified in the statement will be generated. Like other compile-time commands the ON ACCESS has sequential scope. That is, all occurrences of the variable following the statement will be affected. To discontinue the ON ACCESS, the programmer can write ON ACCESS VAR; The parameter VAR, could be replaced by a list of variable names, or the keyword ALL followed by a type name (such as BASED). I n these cases each variable that is specified in the list or has the type specified, will have the ON ACCESS statement generated every time the compiler ?-ccesses the variable. For example, the statement, ON ACCESS ALL FIXED DO . . . END; will be invoked upon access to all FIXED variables. Within the definition of the statement in the ON ACCESS command, the programmer must use the compiler symbol $VAR to reference the variable being accessed. Thus, the programmer might write
ON ACCESS A,Al,AB DO IF $VAR = 0 THEN CALL FIXUP($VAR); $VAR = $TAR 1; END;
+
Then, if the ON ACCESS command is in effect, the occurrence of the statement,
B
=
A;
would generate the equivalent of
IF A
= 0 THEN CALL FIXUP(A); A=A+l; B = A;
Note that code generation will not always correspond to such straightforward interpretation since the ON ACCESS statement will usually be generated in the middle of the generation of another statement. For example consider the following statement B=C+D.(E+A)+16;
276
R. D. BERGERON et a / .
The ON ACCESS code will be generated in the middle of the code to sum E and A which is generated in the middle of the code to evaluate the main expression. To avoid infinite looping the ON ACCESS statement will not be invoked for code which it has generated although it may invoke other previously defined ACCESS commands. After generating the ON ACCESS statement, the compiler will normally generate code to load the value of the accessed variable so that it can continue processing the original statement. Since the ON ACCESS statement may have generated code that altered the value of the variable, this facility can be useful in checking for and correcting special error conditions (and perhaps printing a run-time error message). However it is also useful to let the ON ACCESS statement specify that a different value is to be loaded for the accessed variable. This can be done by assigning the desired value to the system variable $REP within the ON ACCESS statement. For example, suppose the variable PTR is used to store the disk address of a file, the programmer might write
ON ACCESS PTR $REP
=
PAGE($VAR);
where PAGE is the name of a paging routine that reads a file into core (if it is not already there) and returns the core address. Thus, the statement C
=
D.(PTR);
will use the core address of the file referred to by PTR as the basing parameter and not the value of PTR itself. 6.8 LSD Extensibility
The scope of this paper precludes a detailed presentation of the facilities for extending the LSD language. Therefore, this section presents only a brief overview of the LSD extension facilities. The intention is to provide some insight into the nature of these facilities rather than to give specific examples. 6.8.7 Basic Orientation of the Extension Mechanism
The goal of the extensibility system proposed for the LSD language is to provide the user with an open-ended, or extensible, compiler. The novelty of this approach lies not in the syntactic specification of new constructs, but in the types of facilities available for semantic specification. The extension mechanisms will allow the user to access, utilize, and even augment the facilities of the base language compiler in order to define the semantics of new constructs. Furthermore, the programmer will be able to write extensions with a minimal knowledge of the compiler itself.
SYSTEMS PROGRAMMING LANGUAGES
277
6.8.2 The Compiler and Extensibility
The base language compiler consists of two major phases. The first phase generates a tree structure which represents an analysis of the input statements, and the second phase generates object code from this tree. Each node of the tree contains syntactic and semantic information that identifies a construct of the language. In order to simplify the syntactic definition of the base language, the LSD compiler parses an input string into entities which fall into only six basic syntactic types-operator, identifier, primary, expression, statement, and list. The semantic information associated with each piece of the parsed string further specifies the exact construct recognized. The system will provide commands to invoke extension mechanisms at any of three different times-before the parse, after the parse, and at code generation time. For example, a new infix operator and its precedence must be introduced into the compiler before the first, or “analysis,” phase because expressions containing that operator must be properly parsed into the tree. However, the extension mechanisms defining the semantics associated with this new operator will be applied after the analysis phase so that these mechanisms can have access to the semantic information associated with its operands (e.g., data types). 6.8.3 Pre-Parse Extensions
Extensions applied before the first phase will serve merely to insure the proper parsing of the input. These include specification of infix operators, changes and additions to the nature of the symbol table or any of its fields, and the definition of new kinds of delimiters, parentheses, or keywords. 6.8.4 Post-Parse Extensions
Extensions applied after the analysis phase will allow the user to modify the generated tree without requiring knowledge of its exact format. He specifies syntactic and semantic information that will define the construct he wishes to process as well as the process to be performed. His specification will be parsed into tree format and compared with the tree representing the program. For each node of the tree that satisfies the proper syntactic and semantic characteristics, the compiler will execute the specified process, passing it the current node of the tree as a parameter. The process to be executed is specified in LSD itself with additional facilities for symbolically accessing the symbol table, for testing or using any portion of the recognized construct, for testing the context in which the construct occurred,
R. D. BERGERON et a/.
278
and for accessing compiler functions, such as temporary storage allocation. Furthermore, the computation may specify that all or part of the current node is to be replaced. This replacement, written in LSD code, is also parsed into tree format before being added to the tree. The replacing LSD code can make use of subnodes of the (‘current’’ node by way of a formal parameter notation. The syntax for a typical extension might be as follows: A(id) ‘(‘ B(expr) ’)’ IN ((stmt:DCL)) This statement specifies both a construct to be recognized and a context (specified by the “IN” clause) in which the construct must appear. The construct consists of an identifier followed by a parenthesized expression. (The apostrophes around the parentheses indicate that they are literals that must appear in the construct.) Upon recognition of this construct, the element recognized as (id ) and the subtree recognized as (expr ) may be referenced in the associated semantic process by A and B, respectively. The context clause specifies that the associated semantic process will be invoked only if this construct occurs within a declaration statement.
6.8.5
Code Generation Extensions
The extensions applied a t the time code is generated will allow the user to define a code generator associated with a unique type of node in the tree (i.e., a construct). When a node of the proper type is encountered in the generation phase, the appropriate code generator will be given control. The computations allowed in a code generator are similar to those described above. However, there will also be facilities for accessing compiler functions, such as register allocation and history, and other code generators that are active only at code generation time. Furthermore, i t will not be permissable to replace portions of the tree at this time, so the code generator must specify constructs for which assembler language can be directly generated. (These ‘(terminal constructs” must either be part of the base language or defined by means of user-written code generators.) The advantage of the code generation extensions over the tree modification extensions is that with a little more effort, the system designer can make the code generated for the extensions considerably more efficient.
6.9
Implementation
The first implementation of the LSD base language was written in P L / I using APAREL [Z] for syntax recognition. The use of P L / I (IBM’s F-compiler) proved to be very unsatisfactory because of the poor code i t generated. The resultant LSD compiler was much too large. However, the code
SYSTEMS PROGRAMMING LANGUAGES
279
generated by the compiler is very satisfactory. I n a benchmark study conducted a t Brown University for the Army [%I, LSD typically outperformed PL/I (F compiler version 5 with OPT=2) by almost two t o one storage requirements and better than three to one in execution time. Currently a second version is being bootstrapped from the first. The significance of these results is somewhat clouded, however, because the investigators who coded the LSD version of the benchmark program were allowed to use only those constructs in LSD that have parallels in PL/I. Preliminary studies have suggested that had the investigators been able to make use of various constructs peculiar to LSD, significantly better code would have been generated by the LSD compiler. ACKNOWLEDGMENT The research and writing of this paper and the design of LSD were partially supported by a National Science Foundation grant (No. GJ-181) and by Brown University. We would like to acknowledge the help and invaluable suggestions given us by Richard Wexelblat of Bell Laboratories, Robert Balzer of the RAND corporation, John Brackett and Douglas Ross of Softech, Inc., and Robert Rosin of the State University of New York a t Buffalo. Special thanks should be given to Paul Knueven of Digital Equipment Corporation who helped initiate the entire effort and has been a source of help and encouragement through many iterations, and to John Guttag of Brown University whose help and guidance has had a profound effect both upon this paper and upon the design and implementation of the current version of the LSD compiler. We would also like to t,hank Gregory Lloyd, Ken Magel, Elaine Rich, and Joan Zahorjan-all of whom played vital roles in the development and implementation of the current LSD compiler. ANNOTATED BIBLIOGRAPHY
It is naturally impossible to do justice to a language by describing it in one paragraph based on a single paper or manual. This list therefore constitutes only a bibliography and not a definitive review. 1. MOL940 Hay, R. E., and Rulifson, J. F., MOL940: Preliminary specification for an ALGOL-like machine-oriented language for the S I X 940. SRI Project 5890. Stanford, California, 1968. MOL940 is an ALGOL-like language with provisions for bit manipulation. In addition
to the usual logical operations between variables, the language provides for “common bits” or ‘(no common bits” tests between variables. No floating point or character string manipulation facilities have been provided. The only data structure allowed is the array. Other features may be added to a program by insertion of SDS940 assembler instructions. The language has been used to write a CRT-display service system. 2. MOL-360
Book, E., and Schorre, D. V., A user’s manual for MOL-360. System Development Corp., Santa Monica, California, 1967.
280
R. D. BERGERON ef a/.
MOL-360 is an ALGOL-like language implemented on the /360. Like MOL940, i t provides only arrays for structuring data and has no floating point or character manipulation facilities. Access to the registers of the /360 and insertion of assembler code are allowed in the language. One of its principle uses htts been to implement a time-sharing system. 3. EULER Wirth, N., and Weber, H., EULER-A generalization of ALGOL and its formal definition. Commun. ACM 9(1), 13-23 (1969), and 9(2), 89-99 (1969). EULER is a generalization of ALGOL with facilities for handling data structures called lists. These groupings are not rectangular like arrays and may change when operated upon by the LISP-like operators: list which creates a list, taiE which is analogous to the LISP cdr operation "&", which is used to concatenate lists, and length which is used to determine the length of a list. Elements of a list may be arrays, or even lists, thus enabling the programmer to create a complex data structure. No character or bit manipulation facilities are provided by the language and it is not possible to insert assembler language within the source code. 4. XPL McKeeman, W. M., Homing, J. J., Nelson, E. C., and Wortman, D. B., The XPL compiler generator system. Proc. AFZPS Fall Joint Computer Conf. 33,617436 (1968). XPL is a subset of PL/I which was implemented on the /360. Since the language was developed as a translator writing aid, emphasis has been placed on character and bit manipulation rather than data structures and control operations. In fact, the only data structure permitted is a single dimensional array. No floating point facilities are available, but i t is possible to insert.assembler language into the program. A significant drawback of XPL is that routines cannot be compiled separately and then merged into one system.
5. SABRE PL/I Hopkins, M., SABRE PL/I. Datamation 14(12), 35-38. (1968). SABRE PL/I is a subset of PL/I which was used for implementing an airline reservation system on the /360. It has fixed length bit and character strings with the standard PL/I operations: SUBSTR, INDEX, and concatenation. In addition, there is a function called ALPHA which scans for the first non-alphabetic character in a string. Variables may have the storage attribute BASED, but may have only a single level of pointer qualification. It is not possible to insert assembler language within the source code. 6. VERS Earley, J., VERS-An extendible language with an implementation facility. Computer Science Dep., Univ. of California, Berkeley, California, 1969. VERS was designed as a base language with facilities for data structure manipulation like that of *l[ld]. Data is intended to be organized into cells which will be combined into a structure called a V-graph. This machine independent representation of data can be used to write and debug programs. A production version of the program may be implemented by including declarations for a number of cell types and code which specifies how the cells should be accessed. At the time of writing this language is being implemented on the SDS940 and CDC-6400.
SYSTEMS PROGRAMMING LANGUAGES
281
7. ALGOL68
van Wijngaarden, A. (ed.), Mailloux, B. J., Peck, J. E. L., and Koster, C. H. A., Report on the Algorithmic Language ALGOL68. Mathematisch Centrum, Amsterdam, 1969. Vander Meulen, S. G., and Lindsey, C. H., Informal introduction to ALGOL68. Mathematisch Centrum, Amsterdam, 1969. ALGOL68 is both a revision of ALGOL60 and an extensible language. It provides for the type attributes of ALGOL60 and for character, complex, and pointer types as well. New type attributes can even be defined with the mode statement. Data may be grouped into either structures or arrays. Although the language allows for an arbitrary level of pointer chasing, the declaration of a pointer must include the type attribute of the data that it references. Character and bit manipulation facilities are weak, but it is possible to define new operators. Storage can be allocated from a free storage area (the “heap”) independent of block entry. However, there is neither a statement for freeing this storage nor a reasonable garbage collection scheme. 8. I M P Irons, E. T., Experience with an extensible language. Commun. ACM lS(1) (1970). IMP, a language system for the Control Data 1604 and 6600, is an extensible systems programming language in use at Communications Research Division of the Institute for Defense Analyses at Princeton. The base language provides for the type attributes of ALGOLGO. The only data structure defined is the vector, but the language is purported to have basic list processing capabilities. There is no block structure and no character or bit manipulation. 9. SAL
Lang, C. A,, SAL-Systems Assembly Language. Proc. AFIPS Spring Joint Computer Conf. 34, 543-556, 1969. Brown, H., SAL user’s manual. Univ. Mathematical Lab., Cambridge, England, 1968. SAL is a systems programming language developed by the Cambridge ComputerAided Design group for the Titan Atlas I1 computer. SAL provides arithmetic, Boolean, and shift operations for integers, but has no other data types. Hence, there is no character or bit manipulation and no 1/0 facilities. SAL has AED-like components and list processing facilities and permits the specification of multiple levels of indirectness. It has no block structure, has made machine registers explicitly available to the user, and allows embedded assembler code. SAL requires no run-time system and can run within mixed language systems. 10. BCPL
Richards, M., BCPL: A tool for compiler writing and system programming. Proe. AFIPS Spring Joint Computer Conf.,34, 557-566, 1969. Kelly, J., Wilcox, T., and Richards, M., Temporary BCPL/360 reference manual. Cornell Univ., Ithaca, New York, 1969. BCPL (Basic CPL) was originally developed as a compiler writing tool. It has only one data type, the binary bit pattern; and since there is no bit or character data type there are no bit or character manipulation facilities. There is no statement level allocation or data structure definition facilities, but AED-like components can be referenced
282
R. D. BERGERON et a/.
with the pointer operator, “l”. Machine independence is an impartant criterion of BCPL and versions have been implemented on the /360, 7094, GE 635, and the KDF 9 a t Oxford. 11. ESPOL
ESPOL reference manual. Burroughs B5500 Information Processing Systems, Burroughs Corp., Detroit, Michigan, 1966. ESPOL (Executive Systems Problem Oriented Language) is an ALGOL-like language with the addition of primitives to aid in such systems programming problems as interrupt handling, storage allocation, overlay, and full 1/0 facilities. It is a machine-dependent language for the Burroughs B5500 that allows embedded assembler language. I t has full bit manipulation, but character manipulation facilities are limited to input and output. Data types in the language include integers and real numbers, Boolean values, and character strings which are restricted to a maximum length of six. ESPOL provides stacks and multidimensional arrays; there are no data structure facilities like those of PL/I or AED. Debugging facilities include monitoring and dumping variables whenever they are used as well as the monitoring of certain error conditions. 12. MAD/I Bolas, B. J., Springer, A. L., and Srodawa, R. J., The MAII/I manual. Tech. Rep. 32, CONCOMP. Univ. of Michigan, Ann Arbor, Michigan, 1970. MAD/I is a PL/I-like language developed a t the University of lfichigan as a general purpose programming language with facilities for user definit’ion of extensions to the language. The base langnage provides constructs for fixed point, floating point, and decimal arithmetic, bit and character manipulation, and both stream and record I/O. MAI>/I provides both fixed and varying length character strings, but has only minimal operations for them. The definitional facility allows the user to define new operators and data types. The MAI>/I syntax is often inconvenient or awkward. A11 keywords must be quoted, structure definitions are awkward and specification of a pointer for a BASED variable is not straightforward. There are no multiprogramming facilities and there is no equivalent to the PL/I CONTROLLED variables. 13. PL/S
N o references publicly available. PL/S (originally called BSL) is a simple subset of PL/1 developed by IRM as a proprietary systems programming language for internal use. The compiler Iiroduces assembler language as output and the user may include assembler instructioiib as in line code. The general registers may be explicitly specified by the user in instructions and he may control the compiler’s use of them. The data types available to the user are register, pointer (which can have a length of 8, 15, 16, 31, or 32 bits), fixed (halfword or fullword), bit and fixed length character strings. There are no character string operations defined and while there are based variables, there are no facilities for allocating and freeing storage. Structure levels define a true hierarchy, that is, a variable defined on level n 1 is located within the variable defined on level n regardless of the respective attributes. There are no implicit type conversions done and the only run-time environment invoked involves allocating and freeing automatic variables of reentrant procedures.
+
SYSTEMS PROGRAMMING LANGUAGES
283
REFERENCES 1. Arden, B., and Hamilton, J., A study of programming language effectiveness, U.S. Army safeguard systems command, Contract No. DAHC60-70-C-0036. 1970. 2. Balzer, R. M., and Farber, D. J., APAREL-a parse-request language. Comm. ACM 11, 624-630 (1969). 3. Bates, F., and Douglas, M. L., Programming Languagelone. Prentice-Hall, Englewood Cliffs, New Jersey, 1967. 4. Bell, J . It., The design of a minimal expandable computer language, Ph.D. dissertation, Stanford Univ., Stanford, California, 1968. 5. Boettner, D. W., Command (job control) languages for general purpose computing systems. Advanced Topics in Systems Programming, Univ. of Michigan Computing Center, Ann Arbor, Michigan, 1969. 6. Cheatham, T. E., The introduction of definitional facilities into higher level programming languages. Proc. AFIPS Fall Joint Computer Conf. 29,623-638 (1966). 7 . Cheatham, T. E., Fischer, A,, and Jorrand, P., On the basis for ELF-an extensible language facility. Proc. AFZPS Fall Jt. Computer Conf. 33, 937-948 (1968). 8. Childs, D. L., Description of a Set-Theoretic Data Structure, University of Michigan, Ann Arbor, March 1968. 9. Christensen, C., and Shaw, C. J. (ed.), Proc. Extensible Languages Symp. SIGPLAN Notices, Vol. 4, No. 8 (1969). 10. Corbato, F . J., PL/I as a tool for systems programming. Datamation 15(5), (1969). 1 1 . Donovan, J. J., 6.251 Systems programming. (First draft, unpublished manuscript, MIT, Cambridge, Massachusetts, 1970). 12. Earley, J., Newell, A., and Haney, F., *1Manual. Computer Science Dep., CarnegieMellon Univ., Pittsburgh, Pennsylvania, 1967. 13. Feldman, J. A,, Aspects of associative processing. Lincoln Lab. Lexington, Massachusetts, 1965. 14. Feldman, J. A,, A formal semant,ics for computer languages and its application in a compiler-compiler. Comm. ACM 9, 3-9 (1966). 15. Feldman, J. A. and Gries, D., Translator writing systems. Comm. ACM11, 77-113 (1968). 16. Feldmann, C. G. (ed.), Ross, D. T., Rogriguez, J. E., AED-0 Programmers Guide, M.I.T., Cambridge, Massachusetts, 1970. 17. Ferguson, D. E., Evolution of the meta-assembly program. Comm. ACM 9, 190-196 (1966). 18. Galler, D., and Perlis, A. J., A proposal for definitions in ALGOL. Comm. ACM 10, 204-219 (1967). 19. Garwick, J. V., GPL, a truly general purpose language. Comm. ACM 11, 634-638 (1968). 20. Halpern, M., XPOP: a metalanguage without metaphysics. Proe. AFIPS Fall Jt. Computer Conf. 26, 57-68 (1964). 21. Halpern, M., Toward a general processor for programming languages. Comm. ACM 11, 15-26 (1968) 22. Hopkins, M., SABRE PL/I. Datamation 14(12), 35-38 (1968). 23. IBM System/360 Operating System PL/I(F) Version 5 Planning Guide, IBhl Syst,ems Ref. Library, Form C28-3201. 1969. 24. IBM System/36O PL/I Optimizing Compiler, General Informat,inn, Order No. GC33-0001-0. White Plains, New York, 1970.
2 a4
R. D. BERGERON ef ol.
26. IBM System/36O PL/I Reference Manual, Form C28-6594, White Plains, New York. 26. Irons, E. T., Experience with an extensible language. Commun. ACM 13(1), 1970. 27. Knowlton, K., A programmer’s description of L6. Comm. ACM. 9, 27, 616425. (1966). 28. Leavenworth, S. M., Syntax macros and extended translation. Comm. ACM 9, 790-793 (1966). 29. McIlroy, M. D., Macro instruction extension of compiler language. Comm. ACM 3, 214-220 (1960). SO. Newey, M. C., An efficient system for user extendible languages. Proc. AFIPS Fall Jt. Computer Conf. 33, pp. 1339-1349 (1968). S1. Ross, D. T., Introduction to software engineering with the AED-0 language, M.I.T. Cambridge, Massachusetts, 1969. 32. Rubey, R. J., A comparative evaluation of PL/I. Datamation, 14(12), 35-39, Dec., 1968. 33. Standish, T. A., A data definition facility for programming languages. Computer Science Rep., Carnegie Inst. of Technol., Pittsburgh, Pennsylvania, 1967. S4. Standish, T. A., A preliminary sketch of a polymorphic programming language. Centro de Calculo Electronico. Universidad Nacional de Mexico, 1968. 56. Tompa, F. W., Data structures programming system for the System/360. Center for Computer and Information Sciences, Brown Univ., Providence, Rhode Island, 1969. 36. van Dam, A,, Bergeron, R. D., Gannon, J. D., Guttag, J. V., Programming language comparison study, U.S. Army safeguard systems command, Contract No. DAHC60-71-C-0074. 1971. 37. van Wijngaarden, A. (ed.), Mailloux, B. J., Peck, J. E. L., and Koster, C. H. A., Report on the algorithmic language ALGOL68, Mathematisch Centrum, Amsterdam, 1969. 38. Wirth, H., PL360, A programming language for the 360 computers, J . ACM, 15, NO. 1,37-74 (1968). 39. Wirth, N., Wells, J. W., and Satterthwaite, E., OS/360 PL360 Compiler, IBM program information department. Contributed Program Library 360D-03.2.011, Kingston, New York, 1969. 40. Wulf, W., BLISS Manual. Carnegie-Mellon Univ., Pittsburgh, Pennsylvania, 1969. 41. ACM Boston Chapter and Special Interest Group/Real Time PL/I, A language for programming real time and timesharing systems? (SABRE Martin Hopkins, MULTICS Fernando Corbato), January 1970.
Parametric and Nonparametric Recognition by Computer: An Application to Leukocyte Image Processing JUDITH M. S. PREWITT Division of Compufer Research and Technology
Nafionaf fnsfitufes of Health Whesda, Maryland
1. Introduction . . 2. Image Articulation
285
.
301 321 327 338 347 364 368 383
3. Image Description . . 4. Discrimination: Logic and Rationale . 5. Linear Logic and the Evaluation of Performance: Logic and Rationale. . 6. Feature Selection: Logic and Rationale . 7. Experimental Results: Parameter Variability . 8. Experimental Results: Parametric Recognition Using Decision Theory . . 9. Nonparametric Pattern Detection: A Cytotaxonomy for Leukocytes . 10. The Inverse Problem: Human Visual Discrimination Using Computer. Oriented Image Properties 393 11. Perspectives on Automatic Microimage Analysis 400 . 12. Summary and Prospectus . 404
1. Introduction
The leukocytes of peripheral human blood are important indicators in the diagnosis and evaluation of disease, and the differential and total white cell counts have accordingly become vital routine medical laboratory procedures. As customarily performed, these tests require microscopic examination of specially stained blood smears by technicians who select an area of the microscope slide, and draw morphological and quantitative conclusions from a random sample of approximately 100 white cells. Purported visual cues include size, color, texture, granularity or local structure, relative intensity, and shape of the cells and their nuclei. Results, naturally, are limited by the peculiarities of the human visual information processing system. They are sensitive to human factors such as observer attention span, fatigue, sampling bias, self-consistency, and external consistency, and can therefore be both imprecise and inaccurate. 285
TABLE I MORPHOLOGICAL AND PHOTOMETRIC CHARACTERISTICS O F NORMAL MATUREHUMAN LEUKOCYTES STAINED WITH WRIQHT'S STAIN
Normal range
Ave
(%)
Cell type Leukocytes (White Cells) A. Nongranular 1. Lympbocyte 2. Monocyte
Min
Max
Cytoplasm Cell diameter
Cells/cu mm
25-33
2100
1500
3000
3-7
375
285
500
7-8
pb
Color.
Granules
Pale blue
Noneor few
Purple
None or few
Light blue- Medium purple
Blue, bluepurple
Fine, many, dense
Very dark Multilobed or segmented
Red or orange Dark blue
Large, many, dense Large, many, dense None
Violet, blueviolet Dark blue
Medium
Bilobed
Purple
Light
Bilobed or segmented Biconcave disk
1 2 - 1 5 p ~ Pale
gray-blue B. Granular 1. Neutrophil
57-67
2. Esoinopbil
1-3
3. Basophil
0-.75
4300
3150
6200
10-12
200
50
250
25
15
50
10-12 4-6
Erythrocytes (Red Cells)
4 X
106
6 X 108
Other Constituents A. Platelets B. Cell debris
2
x
10'
3
x
p
10-15 p p p
Red-orange
10'
, Color refers to cytoplasm in nongranular leukocytes, to granules in granular leukocytes. 6 8% are larger, about 12 p . c
Some are about 20
p.
Nucleus Color
Red-purple
intensity
Dark
Shape
Round or ovoid Ovoid or indented
Nucleus/cell ratio
.7-. 9 .5
.5
2
o_
2
rv, -0 =
.3-.4 .5
i =I 4
PARAMETRIC AND NONPARAMETRIC RECOGNITION
2 87
FIG.1. Graphic representation of the color values of the nucleus, cytoplasm, and granules of blood cells stained with Wright’s stain. An almost continuous variation in color from red to purple to blue occurs in t.he nucleus, cytoplasm, and granules of blood cells processed with the conventional Wright’s stain. Using the initials to denote the five types of mature leukocytes, the nucleus varies from purple to blue in the following order: M, B, L, N, and E. Cytoplasm varies from red in N, to purple in E and B, blue in M and L. Granules are red in E, red-purple in N, purple in M, blue in B, and absent in L. None of these visual cues were available with the stoichiometric reagents, gallocyanin chromalum and Naphthol yellow S, used in the presently reported study of leukocyte discrimination (Adapted from Daland, 1959.)
Supplanting these traditional visual diagnostic tests with automatic machine analysis offers the prospects of increased reliability and speed, and moreover, the possibility of recasting the qualitative and subjective basis of diagnosis into quantitative and objective terms. Beyond this important .application to public health, leukocyte images possess many photometric and morphological properties with biofunctional correlates, that make them suitable and challenging subjects in their own right for
JUDITH M. S. PREWITT
288
image processing, pattern recognition, and ultimately, for machine intelligence as applied to science. The leukocytes fall into five morphological categories-lymphocytes, monocytes, neutrophils, eosinophils, and basophils-whose distinguishing characteristics and abnormal deviations have been copiously described and illustrated in medical texts and hematological atlases (Wintrobe, 1962; Page and Culver, 1961 ; Daland, 1959; Ham, 1969). Some of the more-often reported morphological and photometric characteristics of normal mature human blood cells have been collected in Table I. To facilitate observation through the light microscope, whole blood, containing individual leukocytes
............-
LEUKOCYTES GRANULOCYTES
--_--_LYMPHOCYTES
-.-.-.-
8Ooo
-I
MONOCYTES
I
FIG. 2. Average cell count for total leukocytes, granulocytes, lymphocytes, and monocytes from birth to young adulthood. The total leukocyte count and the relative proportions of the five cell types vary according to physical health, mental rest, and other factors as well as age. Unlike in adulthood, large and irregular count fluctuations and proportion reversals are common in infancy and early childhood even in the absence of demonstrable disease. The computer programs for leukocyte identification concentrated on recognition independent of these prior probabilities for the various white cell types (Adapted from Kato, 1935.)
PARAMETRIC AND NONPARAMETRIC RECOGNITION
Cell
289
"@@@@@@@ -Extreme
-- -_
"Shift to Left" P
M
o
d
s
n
h "Shift m L & -"
'shiftto R*C
-
FIG.3. Morphologic progressions in leukocytes. Aside from the gross between-type morphologic distinctions, a systematic but pronounced continuum of changes in nuclear shape and texture occur within an individual cell as it progresses through its life cycle from the immature state in bone marrow to the mature state in peripheral blood. For example, subclassifications of neutrophils are based largely on nuclear morphology. Classifications due to several authorities agree in distinguishing immature or nonfdLtc mented neutrophils from mature or fdamented neutrophils, but differ in the attention given to finer details such as nuclear asphericity and number of lobes. The distribution of neutrophils as a function of nuclear morphology varies according to health state. The normal distribution and shifts characteristic of certain disease states are illustrated. (Adapted from Wintrobe, 1962.)
and erythrocytes (white and red cells), is smeared on glass slides and chemically treated. In particular, the specimens are stained to enhance cytological features, and to render the red cells and five prevalent types of white cells visually distinct with respect t o color, texture, and morphology (Fig. 1).The five white cell types do not maintain fixed concentrations or ratios in either health or disease states (Fig. 2 ) , and differences in number and proportion as well as anomalies in morphology are significant factors in diagnosis (Fig. 3 ) . Visual evaluation is further complicated by the presence of debris and the introduction of artifacts and distortions during preparation. Thus a formal study of the leukocyte system confronts us with the realities common t o many nontrivial biomedical data processing
290
JUDITH M. S. PREWITT
....................... ....................... .......................
............. .............
. . . . . . . . . . . . . . . . . . . . . ... . . ... ............................................... ................................................ .............................................. ..................... . .
.~ .......
..............
..... .............. .............. . . . . . .............. .............. ....... .......... ................... ....... ....... . . . . . . .............. ..... .............. ..... . . . . . .............. . . . . . . . ......... .. .... ........ .............. . . . . . . ...... ..... ....... .............. .. .. .......... . . . . . . . .............. ..... .............. ............ ....... .............. ....... . . . . . . . . . . . ....... ....... .... .......... .................... ....... ....... .............. ..... .............. .............. ..... ....... .............. ..... .............. .............. ....... ....... .............. . . . . . . . ..... ..................... ..................... .............. ~~
(ii)
FIG.4a(i) and (ii). FIG.4. Blood cells treated with Wright's stain. The conventionally stained red and white blood cells were scanned at 8OOX by CYDAC using blue-green light (503 nm). The computer printouts represent paired graytone reconstitutions of (top) the original 256-level digitizations and (bottom) the corresponding computer-generated spatial derivatives. The printing technique utilizes 32 graded combinations of standard alphanumeric high speed printer characters to render quantized light intensity or gradient magnitude. (a) Erythrocyte (red cell). (b) Lymphocyte. (c) Monocyte. (d) Neutrophil. (e) Eosinophil. (f) Basophil.
PARAMETRIC AND NONPARAMETRIC RECOGNITION
(ii) FIG.4b(i) and (ii). See Fig. 4a, p. 290.
291
2 92
JUDITH M. S. PREWITT
FIG 4c(i) and (ii). See Fig. 4a, p. 290.
PARAMETRIC AND NONPARAMETRIC RECOGNITION
FIG.4d(i) and (ii). See Fig. 4a, p. 290.
293
294
JUDITH M. S. PREWITT
FIG.4e(i) and (ii). See Fig. 423, p. 290.
PARAMETRIC AND NONPARAMETRIC RECOGNITION
(ii) FIG.4f(i) and (ii). See Fig. 4a, p. 290.
2 95
296
JUDITH M. S. PREWlll
problems, and underscores the desirability of robust, and possibly distribution-free, methods of analysis. Our first paper (Prewitt and Mendelsohn, 1966b) introduced a new approach to automating leukocyte discrimination by coordinating (1) the use of a digital microscanner, CYDAC, for picture generation, (2) the use of quantitative cytochemistry for specimen preparation, and (3) the use of a general-purpose digital computer for analysis of cell images, extraction of characterizing descriptors, and implementation of discriminatory logic. I n contradistinction to earlier conceptions of cell identification as particle analysis (Montgomery, 1962; Tolles, 1962), we emphasized the utilization of photometric information in digitized cell images, deemphasized the role of morphology, and essentially disregarded chromaticity. The present paper discusses the evolution and elaboration of these ideas, and subsequent advances ib leukocyte discrimination by computer. Leukocytes for our experiments were obtained from samples of normal human peripheral blood by a flotation technique which eliminated the red cells (B@yum,1964). Smears on glass slides were then rapidly air-dried, and stained with special stoichiometric reagents, gallocyanin chrome alum and naphthol yellow S, which selectively and quantitatively react with certain cell constituents, namely nucleic acid and protein (Deitch, 1955; Sandritter et al., 1963). With this method, the spectral differentiation of leukocytes produced by the popular Wrights’ stain is absent, and all of the leukocyte types do not duplicate exactly their morphological appearance in the conventional preparation. However, cells on the slide are wellflattened, well-dispersed, and nonoverlapping, and morphological integrity is preserved. Important cytological features are displayed clearly and with high contrast, and although the cells are nearly uniformly colored, the various types maintain distinctive characteristics, even in monochromatic light. Figure 4 illustrates the appearance of a sample of typical blood cells, treated with Wrights’ stain, and scanned a t the same magnification and wavelength as the sample of stoichiometrically stained cells illustrated in Fig. 5. The gray-tone printouts represent computer reconstitutions of 256 gray-level digitized images, using 32 graded combinations of alphanumeric printer characters to render the various quantized light intensity levels. By means of the CYDAC scanner, a microscope field containing a single leukocyte can be converted into a digitized achromatic facsimile (Prewitt et al., 1966). The scanner sweeps a focused blue-green light spot across a cathode ray tube in rectilinear raster, and directs it through the microscope in reverse path. Spot intensity is modulated by passage through a mounted specimen, then sensed by a photomultiplier, corrected for source variability, converted to optical density, sampled, and finally digitized. The
Z
0
5F
( a ) Lymphocyte
( b ) Monocyte
( c ) Neutrophil
( d ) Eosinophil
FIQ. 5. Blood cells treated with gallocyanin chromdum and Naphthol yellow S. The stoichiometrically stained leukocytes were scanned at 8OOX by CYDAC using glue-green light ( 3 3 nm). The illustrations show a coordinated series of (top row) photomicrographs, (middle row) 32 gray-level reconstitutions of the original 256 gray-level scans, and (bottom row) compartmented images in which background, nucleus, and cytoplasm are rendered in a uniform gray tone which matches the corresponding actual mean densities. The photomicrographs were uniformly distorted in the horizontal direction by photographic means, t o match the di6tortion in the computer reconstitutions which results from B 5:6 ratio for horizontal and vertical character to character spacing on the high speed printer. Morphological characteristics are eesentially the =me a6 with Wright’s stain (see Fig. 4), but cytoplasmic granularity is greatly altered and no longer obscures the cell nuclcus. Basophilic cytoplasm is totally transparent to the scanner. (Basophil riot shown). (a) Lymphocyte. (b) Monocyte. (c) Neutrophil. (d) Eosinophil. (From Prewitt, 1966b.)
F0i 2 4 h)
9
-4
298
JUDITH M. S. PREWITT
resultant facsimile takes the form of a matrix of approximately 200 X 200 picture elements or gray values, quantized on a 256-level photometric scale. Each gray value is linearly related to local optical density in the specimen, averaged over a circular area which corresponds to the instantaneous cross section of the scanning light spot in the plane of focus. Operating a t 8OOX magnification, the effective radius of this radially symmetric but nonuniform spot is approximately 0.25 p. The intersample distance both along and between raster lines is also 0.25 p , permitting a 50 x 50 p field of view. At the operating wavelength of 503nm, the spot behaves essentially like an Airy disk. Thus, the digitized images so generated have high spatiophotometric resolution (Mendelsohn et al., 1968), approaching the theoretical limit imposed by light microscopy. As Figs. 4 and 5 show, structural and textural features of the five white cell types are faithfully recorded and rendered by means of distinct tonal patterns rather than b y differences in color intensity and hue. Although this type of digital image analysis may not lead to the most practical automatic recognition system in the long run, our principal goal was t o show that utilization of the full, hitherto untapped photometric potential of a high-performance scanning system, together with computer algorithms appropriate for the subject matter and its mode of presentation, can result in a high-performance recognition system for leukocytes. First to be shown was that descriptors oriented around patterns of light distribution rather than topology (shape) and color can be extracted, combined into a parsimonious vectorial representation of the cells, and used in a robust discriminant function model for recognition. To do this, we adopted the classical approach of statistical decision theory, applied it to samples of typical leukocytes selected and identified by a cytomorphological authority, and designed a reliable and efficient linear recognition logic (Prewitt and Mendelsohn, 1966a). Strategies of four types entered into the decision-theoretic approach : (1) strategies for classification, (2) strategies for affixing values to parameters of the decision process, (3) strategies for the selection and rejection of features or variables, and (4) strategies for the evaluation of performance. For each of these types of strategies, figures of merit and optimality criteria were needed. I n reality, design and evaluation of a diagnostic scheme cannot properly be treated independently. I n order to obtain good algorithms and have confidence in predictions of their behavior, the best use must be made of the available sample information. This means simultaneously optimizing the (1) number of subjects, (2) number of features or variables per subject, (3) measurement resolution of each variable, (4) apportionment of samples among the several categories in both design and test sets, and (5) complexity of the analysis-all this in the light of (i)
PARAMETRIC AND NONPARAMETRIC RECOGNITION
299
any structural assumptions about the underlying probability distributions of subjects and variables, (ii) risks or penalties of error, (iii) the cost of data acquisition and measurement, and (iv) the cost of computation. It is seldom feasible to implement all facets in a logically consistent manner, and justifiable compromises are required. The difficulties are in part computational and due to the combinatorics, and in part statistical and due to sampling variation as it affects parameter estimation. Thus, selection of variables for parsimonious characterization and optimal, efficient decisionmaking, and specific discriminatory achievements are meaningful only in the context of the admissible discriminant functions, the available learning samples, and the methods of statistical estimation which were utilized (Prewitt, 1970b). Second to be shown was that the photometrically oriented descriptors permit articulation of pattern classes corresponding to the five commonly accepted leukocyte types, on the basis of clustering phenomena in the feature space alone, without invoking parametric or probabilistic assumptions about the underlying signal structure. I n addressing this second problem, we anticipated that structural relations among leukocyterepresenting point sets in the multidimensional feature space, rather than the absolute numeric values of the image parameters, may be invariant from one sample of cells to another, and we devised a nonparametric cluster analysis to look for this type of consistency (Prewitt, 1971, 1972a,b). Third to be shown was that the photometrically oriented descriptors that make machine recognition so facile are no more obscure or abstract than the morphological and spectral characteristics on which we claim to rely. They can be assigned meaningful interpretations in natural language, and fall into descriptive families. Moreover, computer-synthesized pseudocell images, embodying the statistical properties of the photometrically oriented descriptors without simultaneously incorporating shape and textural characteristics, can readily be distinguished and sorted by eye, into classes corresponding to the cell types whose photometric qualities they emulate. The human observer can thus appreciate what the computer recognition accomplishes (Prewitt., 1 9 7 2 ~ ). The digital computer clearly plays a decisive role in realizing these extensions of human perceptual and cognitive abilities. Indeed, without digital computation, it is unlikely that the current state of the art could have materialized. Large memory, high-speed arithmetic, and programming flexibility have been the key factcrs. First, current core capacities meet the requirement of storing, in convenient, randomly accessible form, the 32,000 bits extracted from the pictorial information resident in one field of view, When computer-controlled scanning microscopes replace the current generation of fixed-raster
JUDITH M. S. PREWITT
300
digitizers, the microscope slide itself can resume the role of primary information st.ore, and core requirements can be relaxed at the expense of repeated scanning. Second, because it can be reprogrammed again and again, without time consuming and expensive hardware modifications, the digital computer is a natural vehicle for experimenting with and improving image processing heuristics and recognition algorithms. I n other words, the analysis and interpretation of images can be pursued with a minimum of ancillary distractions. Third, except for electrooptical implementation, only the high-speed digital computer can execute the vast number of logical and arithmetic operations that constitute a recognition algorithm with any practicality. I n the application at hand, except for offline input/output, the 6.4 second scan time per field of view, and not the CELLS recognition program, is the rate limiting factor. Finally, a proven digital algorithm can be applied systematically to new data, thereby assuring a high degree of objectivity, reproducibility, and interqbserver verifiability. These attributes of the digital computer-rapid, extensive, and consistent assimilative ability under program control-are primarily responOPTICAL DENSITY PROFILE
0.~1
.,....
OPTICAL DENSITY FREQUENCY DISTRIBUTION
201
POSITION
OPTICAL DENSITY
FIG.6. Fundamental relationship between scan line profiles and frequency distributions for optical density. Runs of relatively uniform optical density in line profiles of a scanned cell lead to modes in the frequency distribution. Boundary gradients involve abrupt but systematic changes in optical density values, and lead to anti-modes.
PARAMETRIC AND NONPARAMETRIC RECOGNITION
301
sible for not only the progress to date, but also for foreseeable advances in automatic microimage analysis and interpretation. Although the computer cannot compete with our heuristic skills and knowledge of relevancy, it can be our scientific companion and assist us in optimally using our knowledge and extending its frontiers. 2. image Articulation
Leukocyte images presented for computer characterization and identification consisted of individual components, which, because of the biochemical
5. 100.-
5.100.. 2 w
, ,
3
0 W
2
,
W 3
' '
0 L
CELL 75 t 100.. 0 2
u 50;. .
L
L
,
'
CELL 79
EOSlNOPHlL
g 100.. W
3
L
NEUTROPHIL
2
W
'
$ 50..
50-
3 0
, '
$ 50,. LL
FIG.7. Optical density frequency distributions for the leukocytes of Fig. 5. The principal maxima, corresponding to background, have been abbreviated for purposes of display, and the curves have been smoothed by local least squares linear filters so that both the number of observations and the integrated optical density are conserved. Morphological features of the cells are paralleled by certain morphological features of the corresponding histograms. Stable differences in the histograms for the four cell types reflect significant differencesin the quantity, variability, and contrast of absorbing material in the cells. For example, the skewness of the neutrophil distribution is attributable to a large low-density cytoplasm and a small dark nucleus. The separation of the two secondary maxima in all distributions is a measure of cytoplasmic-nuclem contrast. It is most pronounced for neutrophils, also consistently high for lymphocytes, lower for monocytes, and variable for eosinophils because of dispersed and usually dark refractile cytoplasmic granules. (From Prewitt, 1966b.)
302
JUDITH M. S. PREWITT
OPTICAL DENSITY (Cydac units) (0)
OPTICAL DENSITY (Cydoc units)
FIG.8(a)-(d). Optical density frequency distributions for five leukocyte types. Histograms for five arbitrarily selected cells of each of the five leukocyte types are shown. Each histogram was processed by a moving 5-point least squares linear filter to smooth out inconsequential, low-level fluctuations in count. The five histograms fur
PARAMETRIC AND NONPARAMETRIC RECOGNITION
303
k Neutrophil
70
50 40 30 20
A
3 40 60 80 100 120 140 160 180 200 OPTICAL DENSITY (Cydac units) (C)
OPTICAL DENSITY (Cydac units)
(d)
each cell type have been aligned by superimposing the background modes. Distributions for cells of any one type are fairly consistent in shape and extent, but they are markedly different for cells of different types. (a) Lymphocyte. (b) Monocyte. (c) Neutrophil. (d) Eosinophil.
JUDITH M. S. PREWITT
304
c 2 W
E
I50 I40 I30 I20 I10 100
Basophil
90 80 70 60 50 40 30 20 10 0 0
20
40 60 80 I00 120 140 160 180 200 OPTICAL DENSITY (Cydac units)
(el
FIG.8(e).See Fig. 8(a)-(d), p. 302. Basophil.
and biomechanical preprocessing, were noncontiguous and nonoverlapping, and dispersed against a generally uniform and structureless background. This situation contrasts with the more intricate patterns of juxtaposed, overlapping, and perhaps folded cells in other exfoliative cytological samples and in solid tissue sections, on which little progress has been made to date. The blood smear is, nevertheless, a nontrivial natural image, seen under the microscope as a two-dimensional projection of semitransparent three-dimensional structures. Thus it is subject to the usual hazards of microscopy which may lead t o visual ambiguity and errors of interpretation (Elias, 1971). Since segmentation of the visual scene into objects of interest was relatively uncomplicated, image processing efforts concentrated on the articulation and detailed analysis of discrete cell images. The initial tasks of image processing were (1) differentiation of the cell from background, and of the nucleus of the cell from the cell cytopIasm; and (2) establishment of a reference gray level on a per image basis, which would allow normalization and meaningful comparison of images and of their descriptors. Both of these functions were fulfilled by a method of image articulation based on machine inspection of histograms of gray values appearing in the image and density thresholding (Prewitt and Mendelsohn, 1966b). The method is suitable for pictures in which (1) objects appear as regions of fairly constant gray value on a fairly uniform
PARAMETRIC AND NONPARAMETRIC RECOGNITION
305
ground of different average gray value; (2) edges appear as regions of intermediate gray values and directionally consistent, rapid density flux; (3) objects occupy a greater proportion of image area than edges; and (4) object regions are either disjoint or satisfy set-theoretic inclusion relationships. Objects and background will generate histogram modes, while edges correspond to antimodes. The global threshold for delineating an object is given by the gray value at the antimode, and the background mode provides a natural baseline for anchoring the histogram and standardizing parameter values (Fig. 6 ) . Leukocytes appear rn dark objects on a light ground and have the aforementioned image characteristics. For four of the cell types-lymphocytes, monocytes, neutrophils, and eosinophils-the gray value histograms have a trimodal trend, reflecting the presence of background, cytoplasm,
CELL 77
MONOCYTE
FREQUENCY OF OPTICAL DENSITY VALUES 150tGROUND
NUCLEUS
OPTICAL DENSITY (Cydac units) ( 0 )
FIG.9. Context-dependent threshold selection based on the optical density histogram. Computer selection of thresholds for segmenting a cell image into nucleus, cytoplasm, and background is illustrated for a typical monocyte. The graph reproduces the frequency distribution for the monocyte of Fig. 5, with the local minima and local maxima identified. Optical density values at the two minima were used for boundary determination. Comparison of the printouts with the photomicrographs of the same cell in Fig. 5 indicates the effectiveness of this method of boundary determination. (a) Density histogram, with arrows indicating machinedetermined extrema. (b) Image of the monocyte, with 32 gray levels displayed. (c) Cell image segmented into background (.), cytoplasm ('), and nucleus (W) plus spurious granule, using the density values at the major relative minima as thresholds. (From Prewitt, 1966b.)
306
JUDITH M. S. PREWIll
...
FIG. 10. Edge detect,ion and spatial differentiation. The results of applying eight 3 X 3 oriented edge detectors to leukocyte images in parallel at each picture point are shown as picture pairs for (a) lymphocyte, (b) basophil (lobed), ( c ) neutrophil, (d) eosinophil, (e) basophil (segmented). On the left, each two digit value indicates the maximum figure of merit obtained over all eight detectors at the corresponding picture point, provided that this value exceeds a threshold expirically selected to deemphasize noise. On the right, orientations of the maximally correlated edge detectors, quantized to four directions; N S , E-W, N E S W , and NW-SE, are encoded by the symbols: I, -, /, and L, respectiveIy. The illustrations indicate that gradient thresholding is not a self-contained method of image articulation. A contrast function, incorporating both density and gradient information simultaneously, is needed for computer judgments, SO that contours are conveyed with maximum of useful information and can be recovered with a minimum of auxiliary logic. Both thinning of thresholded gradients and semantic rules for determining well-formed and meaningful tracks are necessary.
PARAMETRIC AND NONPARAMETRIC RECOGNITION
307
i
308
JUDITH M. S. PREWITT
FIQ.11. Comparison of thresholded and differentiated cell images. Each triplet consists of (left) a 321evel reconstitution of the original 256-level digitization, (middle) image segmented on the basis of machine-determined density thresholds obtained from histograms, with nuclear, cytoplasmic, and background compartments assigned the mem density of the correspondingstructure in the 256-level image, and (right) graytone display of gradient magnitudes, for which numerical values correspond to the gradient of the best local quadratic fit to the image. (a) Lymphocyte. (b) Monocyte. (c) Neutrophil. (d) Eosinophil. (e) Basophil.
PARAMETRIC AND NONPARAMETRIC RECOGNITION
309
and nucleus (Fig. 7). For the fifth cell type-basophils-prepared as described above, the histogram is bimodal because the cytoplasm is transparent to the scanner at the wavelength of light being used. Histograms for five cells of each of the five types are shown in Fig. 8 ; the curves have been aligned by superimposing the background modes. By virtue of these regular relationships between density histograms and the digitized cell images, computer determination of gray value thresholds for background, nucleus, and cytoplasm is possible on a dynamic, per cell basis. The algorithm which implements this approach uses a syntactic definition of well-formed leukocyte histograms to control and iterate a cycle of local conservative smoothing, tracing, and mode seeking for the gray value distributions. By syntactic definition in this context we mean a structural specification of a leukocyte histogram as intrinsically bi- or tri-modal, and a background histogram as uni-modal. This specification is the basis of the stopping criterion for the algorithm. Modes, antimodes, and certain fiducials based on the shape and extent of the background peak in clear field scans determine bounding density isopleths for cell and nucleus. Figure 9 illustrates the histogram method of selecting thresholds for a typical monocyte. Informal visual comparison of algorithmically thresholded images with
310
JUDITH M. S. PREWITT
. 0
0
( C )
FIG.12. Contouring by gradient-tracking. The contour-tracing algorithm described in the text was applied to extracting the nucleus of a neutrophil. The locus of maximal gradient magnitudes was obtained by tracking motions of two basic types: (1) local search along the direction of the gradient for evidence of a larger spatial density flux in
31 1
PARAMETRIC AND NONPARAMETRIC RECOGNITION
both unprocessed images and spatially differentiated images indicated concurrence of visually placed boundaries, computer-drawn boundaries, and the loci of peak gradient magnitudes. Letting p ( z , y) represent the brightness or density at point (2, y) in the picture, the gradient V p ( z , y) at that point is a vector in the direction of maximum spatial rate of change of p , and equal in magnitude to that maximum rate. This distinguished direction has direction cosines pz/ (pz2 p y 2 )lI2 and p,/ (pZ2 p,2) lI2, and the gradient magnitude I V p ( z ,y) I is equal to ( p Z 2 py2)lI2, where p , and p , are the partial derivatives of p (2, y) . Various digital masks or two-dimensional filters have been proposed for performing spatial differentiation. The derivative is obtained by convclving the mask with the original picture at each picture element. The appearance of the transformed image remains surprisingly constant for most of the masks that have been proposed, although they generate disparate numerical values (Prewitt, 1970a). For example, gradients can be approximated using a set of oriented edge detectors and searching sequentially around each image point for the best match. Gradient magnitude is equated with the maximum response, and gradient direction is taken parallel to the orientation of the corresponding detector. Figure 10a was obtained by applying the eight masks
+
1 1 1 1-2 1 -1 -1 -1 -1
-1 - 1 1-2 1 1 1 1
1 -1 -1
1 -2 -1
1 1 1
1 - 1 -1 1 - 2 -1 1 1 1
-
1 1 1 -1 -2 1 -1 1 1 1 1-1 1 - 2 -1 1 1-1
+
-1-1 -1 -2 1 1
+
1 1 1
1 1 1 1 - 2 -1 1 - 1 -1
to a 256-level white cell image. Detector responses below 11 were suppressed in the printout for clarity. In Fig. lob, the symbols -, L, /, and I were
the neighborhood of the current candidate point for inclusion in the locus, and (2) orthogonal movement from the selected locus point to the next candidate. The algorithm forces good continuation. The tracing procedure begins by passing over the picture with a relatively large step size until an above threshold gradient magnitude is found. The step size is then reduced, and the fine scan tracking motions replace the coarse scan steps. A suitable gradient threshold for initialization is obtained by inspection of gradient magnitude histograms. (a) Photomicrograph of the neutrophil. (b) Locus of maximal gradient magnitudes, indicated by 1’s. (c) Boundaries established by density thresholding for comparison.
JUDITH M. S. PREWITT
312
....... ............. ..... ....................................................... ....................................... ........................................ ...................... . . . " . . ......................... ........................................................................ ............................ ........................... ...................... .............................................. .................................................................. ................................................................ ..................................................................... .................................................................. ..- . .. . . ............................................................................ ................................................................................ ............................................................................... .............................................................................. .. - .. ..- . . .* ................................................................................. . . . . . . . ..- . .. . i
? ? W l r ~ o w ~ l l v v * v L l l l I ~ ~ ~ ~
r...""p."".*lrlrrr..r
RPU(AIYVVVliI1II.+I
A
v
1 I
I
-- -
y
I
I 1 1 Y
1 Y ..IIllCLv*VVb~**~P"*co
.riiiiLivvvvionnnv**
1AW"~lllIII..*
,
tirru a v i I
Y 1 Y L
I
I
I
Y A Y I V I
I
Y 1 I
I
-
. . -- . . ,
I
I
L v v a u .+r.~lllll~irlllliiiiiii~irru
v A
Y
~.IIIIL~**VVIIY~YW~Y*YYU~~~I*LLLLLI~~II.+I
k A Ia
v
--
I I
A
v
L
I L
v
..IIllLl
A Y
I I
f
rr//tl
v 1 w
r.
t I v A I . . / / I I I i ~ v v v ~ i ~ . n L a ~ ~ * ~ ~ * w ~ * ~ ~ * ~ ~ i i ~ ~ ~ ~ . A~ iY i Li i 1 / . . ~ I L v I V VI ~ ~ l l I l I I ~ I + 1Y V 1 Y I I V A Y ~ ~ I I I I l L * V Y V b l W P Q * L W @ O ~ O O O ~ ~ ~ ~ ~ " * ~ ~ ~I ~ V~ 1 I I +O+ u ~ ~h ~V ~I wI ~ l ~ v ~ ~ ~ Il l1 v I u I I v 1 Y ~ ~ I I I I L 1 ~ * v V 1 A * * ~ ~ ~ ~ O ~ ? t ? ~ ? r O 1o1 O
..- ..
FIQ.13a(i). Validation of tentative cell boundaries by inspection of averaged orthogonal density and gradient profiles. (a) Graytone renditions of "distance" inward and outward from tentative cell and nuclear boundaries of a neutrophil. These boundaries were established by thresholding the original digitized image at density levels corresponding to the relative minima in the histogram. In each picture, the grayness at each point i s proportional to distance to the nearest boundary point (cellular or nuclear). Distances were assigned by processing the tentatively segmented image with a bilateral distance algorithm, operating in forward and backward sequential passes.
used to encode approximate gradient directions at points with above threshold edge detector responses. However, when spatial differentiation is a precursor of further quantitative image transformation and processing, the method of computing
PARAMETRIC AND NONPARAMETRIC RECOGNITION
313
gradients is no longer immaterial. For this reason, we introduced a method of spatial differentiation which involved nonlinear combination of approximations pz and p , to the partial derivatives p , and p,:
1 0 -1
pz: 1 0 -1 1 0 -1
p,:
1 0 -1
1 0 -1
1 0. -1
+
The approximated gradient magnitude is I v p I = (pz2 p,2)1/2and the approximate gradient orientation is specified by the direction cosines p,/ 1 v p 1, p,/ 1 v p 1. These masks were derived by fitting a quadratic surface to density values in a 3 X 3 neighborhood of a Dicture element b s least
314
JUDITH M. S. PREWITT
squares, and then computing the true gradient to the fitted surface at the central point. Figures 4 and 11 contain gray-scale printer displays of gradient magnitudes obtained by this method. The connected contour of the neutrophil nucleus in Fig. 12 was constructed by the following gradient-tracking algorithm. The picture is searched for a sufficiently sharp change in gray value. The gradient threahold for this coarse scan is obtained by inspection of a histogram of gradient magnitudes for the corresponding digitized image. A fine scan is then initiated. Two types of tracking motions are used: ( 1 ) local search along the direction of the gradient for a possibly larger gradient magnitude, by means of a look-around procedure, and (2) movement orthogonal to the maximum gradient to the next candidate point. The algorithm enforces good continuation. 230.
I
I I I
I
.
I
1
I I
1
I
I
I
1
I
I
I
20s.
180.
1
+
I
I I
I I 1
I
I
..... I*.
I
I
I
I
I I I
I I
157,
I
I
.
1 130.
.
I
I 105.
I
..
I I I
I
I I I
I 1
.
I I
.......... . . . .. .. . .... .... .. ..... ... . ............................................. ............................ ... I
1 **
I
I
I I
I
I
I
t .
1
I
I
-40.1
-30.7
-20.3
-10.4
0.0
9.b0
19.3
Exterior
29.2
39.1 a9.0 OlllaWE
Interior RECORD 29004
DENSITY
I I
I
I
I
I
-50.0
. I I 1
I
I I
-20.0
1 I
I I +
I I I
3.00
.
I
I
39.0
30.0
I
I .
I I
ao.0
I
I
INNER BOUNDARY
FIG.13b(i). Validation of tentative cell boundaries by inspection of averaged orthogonal density and gradient profiles. (b) Mean gray values, averaged over all points of equal signed distance from the tentative boundary, as a function of this signed distance.
PARAMETRIC AND NONPARAMETRIC RECOGNITION
315
130.
I
I
I
I 20%
1
I
1
I
I I I I
I
I
I
IW.
I
I
1
I
I
I
15%
I
I
1 I
....
IYO.
I
I
I.
10%
m.o +.
55.0
I
I 30.0
I I 5 .oo
....tt
*
I
.................................... I
i I
1
.
1
....
....
ttt
I I
.
* * I
8
, . ........................................... I
. . 1
I I
1
1
I 1
I
1
I I
I
I
!
FIG.13b(ii).--Continued.
More formal tests which stressed analogs of our own perception of edges confirmed these informal judgments (Prewitt, 1970a). First, examination of individual density and gradient profiles taken orthogonal to histogramgenerated cell and nuclear boundaries, and composites of such profiles, showed t,hat these boundaries coincide with the inflection point or position of maximum gradient magnitude. Second, statistical testing of serial values in density and gradient profiles indicated more significant heterogeneity near a boundary than within background or within a major cell constituent. In Fig. 13b and c, the mean gray value and mean gradient are plotted as functions of distance outward and inward from the postulat.ed nuclear boundary of a neutrophil. Signed distances were determined by a bilateral extension (Prewitt, 1970a) of the discrete distance transform (Rosenfeld and Pfaltz, 1966) , positive distances being propagated inside the boundary and negative distances being propagated outside the boundary. The discrete distances are computed by modelling the propagation of unit waves from
JUDITH M. S. PREWITT
316
boundaries into free space until they interfere and cancel (Blum, 1967). Figure 13a gives a gray scale rendition of the results of applying the bilateral distance transform. If the postulated boundary corresponds to the actual boundary as defined by the maximum derivative, then the plotted gradient magnitudes should peak at distance 0 (in optical terminology, the line spread function is the derivative of the edge spread function) as it does. Using these concepts, it is possible to iteratively readjust a tentative histogram-generated density threshold until the isophote (i.e.) density isopleth) which best coincides with the average maximal gradient is found. Finally, assessment of the homogeneity of the several cell types, using boundary-dependent properties which are known independently to express )
M.0
S4.0
I I
I
I I 1
1
.
I
I
. I
W.0
I
I I
%?.O
I
I I
I
I 3b.O
m.0 I I I I 24.0
1 I
I 18.0
1
I
. . ........ ... ..
I
I .
I
I**
I
I
I
I 12.0
I
6- 00
*
...
I I I
1
.............
............-so.% .....+...............-.....*.........*.........*......... -?a o -30.2
-20.3
-10.6
..I
I.
.
~
.
1
0.0
............................... I
I
I
9.+0
19.3
19.2
39.1
49.0
DisrnncL
Interior
Exterior RECORD 29004
GRADIENT
OUTER BOUNDARY
FIG.13c(i). Validation of tentative cell boundaries by inspection of averaged orthogonal density and gradient profiles. (c) Mean gradient values, similarly obtained and plotted as a function of signed distance from the tentative boundary. The gradient peaks at the 0 value for distance, i.e., at the boundary position determined by density thresholding.
317
PARAMETRIC AND NONPARAMETRIC RECOGNITION
41.0
36.0
30.0
.
.
I
I I
I
24 .o
I
I
I +
18.0
.
I
I I
I
. I .
I I
I I
. I
6-00
*
I I
I I
-50.0
......... ........................................................ *
. *
+&I
-30.2
-LD.3
-10.4
0.0
I I
I I
. I
.
1
I
12.0
I
* .
I
9
*
I b
I I I
............................................
9.a
19.3
29.2
¶%I
I I I a9.O
OlSTl*c~
Exterior
RECORD 29004
GRADIENT INNER BOUNDARY
Interior
FIG.13c(ii).-Continued.
cell invariants preserved under imaging, provided an external test of consistency for the boundary positions. The rationale is provided by the statistical concept of unbiased, minimum-variance estimation. If the variable in question is a stable characteristic of cell type, then the boundary construction procedure is equivalent to a method of estimation : the closer the estimate is to the true value in probability, the better the method of boundary construction. The best boundary-construction procedure is equivalent to unbiased, minimum-variance estimation. As an illustration, coefficientsof variation are given in Table I1 for integrated nuclear density or absorbance, a quantity which is proportional to DNA content of the cell in the present context, and according to current biological theories, should be relatively stable for the leukocytes. This parameter is calculated by the CELLS programs, using the histogram method. The low variability of nuclear absorbance observed for computer-selected thresholds wm not significantlydifferent from that observed for thresholds obtained manually
318
JUDITH M. S. PREWITT
TABLE I1 COEFFICIENTS OF VARIATION FOR NUCLEAR ABSORBANCE (INTEGRATED DENSITY) Threshold obtained by Visual inspection of histograms Cell type Parameter : nuclear absorbance
L
M
11 15
N E B 6 12 10
Computer inspection of histograms
L M N E B 11 15 7 42 9
from the histograms and confirmed by examination of the images. With the exception of the anomalously high value for eosinophils, the coefficients of variation ranged from 7 to 1570and they were considered reasonable. The measurement for eosinophils is complicated by the presence of absorbing cytoplasmic granules. The thresholding principle can be applied to image transforms as well as to originals in order to effect picture segmentation. I n particular, it can be applied to spatial derivatives and other contrast functions or generalized gradients. As an example, Fig. 14 is based on the scanned autoradiograph of a leukocyte with constituents which had been selectively radicactively labeled while the cell was alive. Particles emitted from the fixed cell expose silver grains in a photographic emulsion on the microscope slide. When developed, this shows dark granulation superimposed on the cell image. If the cell is heavily labeled (i.e., takes up a large quantity of the radioisotope), or if exposure time is long, the granules will be very dark and densely packed. The number of grains per mitotic (reproducing) cell and the number of such labeled cells as a function of time since labeling gives valuable information about the cell cycle. Density thresholding will isolate clumps rather than individual grains, so that grain counting can only be approximate by this method. The derivative, on the other hand, contains dark annuli with very light, small centers marking individual grains. These blobs can be extracted by a thresholding sequence using a very high level first to isolate clumps, and then a very low threshold applied only to the clumps isolated in the first step. Propagation schemes may be used for segmentation of payscale pictures when uniform methods such as density or gradient thresholding are ineffective and inappropriate. We define two types of skeletons which are generalized axes of symmetry: (1) the endoskeleton, a conventional skeleton of an object or blob, and (2) the exoskeleton of the field of view, that is, the skeleton of background. Assuming that objects are compact subsets in the plane, for each object point, there exists at least one point
PARAMETRIC AND NONPARAMETRIC RECOGNITION
319
FIG.14. Leukocyte autoradiograph. Silver grains are superimposed on the image of the radioactively labeled blood cell. (a) 32-level computer reconstitution. (b) Image thresholded a t a level suggested by the density histogram. (c) Spatial derivative shows thick dark annuli a t grain boundaries and very light small centers a t positions of maximum gray-level saturation in the grain images.
L
c
E!
r!
FIQ.15. Partitioning the image field (a) by means of exoskeletons. (Left) Thresholded version of neutrophil with spurious granule in the field of view. Only the entire cell and the nearby granule have been delineated. (Middle) Outward distance transform from these boundaries. Equidistant points are denoted by identical symbols. (Right) Exoskeleton of the field of view. The locus of pseudo-maxima of the distance transform is indicated by the retained distance values. (b) Characterizing nuclear shape by means of endoskeletons. (Left) Thresholded version of nucleus of neutrophil, based on density isopleth derived from the histogram. (Middle) Inward distance transform from this boundary. (Right) Endoskeleton of the nucleus. The locus of pseudo-maxima of the distance transform is indicated by the retained distance values.
PARAMETRIC AND NONPARAMETRIC RECOGNITION
32 1
of the complement which is closest to it. If there are two such (equidistant) complementary points where this minimum point-to-complement distance is assumed, then the original object point is assigned to the skeleton of the object. The exoskeleton may be computed by a propagation method such as the medial axis transform (Blum, 1967), or by using a distance transform (preferably gray-weighted) , and tracking the locus of local maxima. Figure 15 is based on a 256-level scan of a neutrophil and a dark granule in the same field of view. Starting from object approximations obtained by thresholding the original image “high,” an exoskeleton separating spurious granule from cell is derived (Fig. 15a). The same procedure applied to the interior of the neutrophil, delineated by histogram-directed density thresholding, generates a (disconnected discrete) nuclear skeleton. This skeleton completely characterizes the shape of the nucleus and permits its reconstruction (Fig. 15b). 3. Image Description
Leukocyte parameters were derived in two ways: (1) by computations involving the partitioned density histogram, and (2) by processing the articulated gray-tone image and other image transforms. Although stable shape-oriented parameters, textural descriptors, and local detail can be obtained only by the second method, many relevant global or gestalt density-oriented parameters can be obtained by the first, more economical method. These include simple, intuitive functions such as cell and nuclear area, mean cytoplasmic optical density, integrated cell and nuclear optical density (absorbance) , and nuclear-cytoplasmic contrast. They also include more complex, abstract functions such as the kurtosis, skewness, and entropy of the optical density histogram. Among the parameters which either cannot be obtained in this way at all, or else cannot be obtained with reliability or precision are cell and nuclear perimeter, mean diameters, asphericity or eccentricity, curvature-describing functions, local textural descriptors, the histogram of local gradients, and the corresponding mean and variance. These require direct operations on the image itself. The current set of 80 leukocyte parameters generated by the CELLS programs comprises 72 density-oriented parameters, derived from the histogram, and 8 shape-oriented features, derived from spatial differentiation and contour tracing (Prewitt, 1965a; 1967). These are presented in Table 111.A partial list has been informally arranged into five parameter families which have been assigned the following interpretive names: (1) contrast and texture, (2) nuclear-cytoplasmic comparisons, (3) concen-
JUDITH M. S. PREWITT
322
TABLE 111
CELLSPARAMETER LIST A. Density-oriented parameters 1. 2. 3. 4.
5. 6.
7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45.
Range lowpoint Span Percent of full range Percent of cell range Range lowpoint Span Mode Modal frequency Percent of full range Span Mode Modal frequency Percent of full range Percent of cell range Range lowpoint Range highpoint Span Mode hlodal frequency Percent of full range Percent of cell range Range lowpoint Span Percent of full range Percent of cell range Area
Integrated density hfean density Standard deviation, density
Coefficient of variation, density
Skewness of density
Kurtosis of density
Background Background Background Background Cell Cell Cell Cell Cell Cytoplasm Cytoplasm Cytoplasm Cytoplasm Cytoplasm Nucleus Nu c1eu s Nucleus Nucleus Nucleus Nucleus Nucleus “Granules” “Granules” “Granules” “Granules” Cell Cytoplasm Nucleus Cell Cytoplasm Nucleus Cell Cytoplasm Nucleus Cell Cytoplasm Nucleus Cell Cytoplasm Nucleus Cell Cytoplasm Nucleus Cell Cytoplasm
PARAMETRIC AND NONPARAMETRIC RECOGNITION
323
TABLE 111-Continued 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72.
Entropy of density
Contrast differential
Area ratio Integrated density ratio Mean density ratio Separation of means Separation of modes Modal frequency ratio Standard deviation ratio Contrast differential ratio
Nucleus Cell Cytoplasm Nucleus Cell Cytoplasm Nu c1eus Nucleus/cell Nucleus/cytoplasm Nucleus/cell Nucleus/cytoplasm Nucleus/cell Nucleus/cytoplasm Nucleus/cell Nucleus/cytoplasm Nucleus/cell Nucleus/cytoplasm Nucleus/cell Nucleus/cytoplasm Nucleus/cell Nucleus/cytoplasm Nucleus/cell Nucleus/cyt.oplasm
First minimum Second minimum Syntactic type of histogram Smoothness factor
B. Shape-oriented parameters 73. 74. 75. 76. 77. 78. 79. 80.
Perimeter Asphericity Diameter (average) Perimeter Asphericity (lobedness) Diameter (average) Diameter ratio Asphericity ratio
Cell Cell Cell Nucleus Nucleus Nucleus Nucleus/cell Nucleus/cell
tration, (4)size and content, and (5) shape, and is presented in Table IV. These families were assembled on the basis of (i) operational definitions of the parameters and their interpretation in visual terms, (ii) the magnitudes of their within-cell type correlations, (iii) the degree of their intersubstitutability in formal leukocyte discrimination by computer, and (iv) their independent participation in minimum-dimension best sets of discriminating parameters.
JUDITH
324
M. S. PREWITT
TABLEIV LEUKOCYTE PARAMETER FAMILIES A. Contrast 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 35. 36. 37. 38. 39. 40.
and texture. Standard deviation: Coefficient variation: Skewness: Kurtosis : Entropy: Mean density ratio : Separation of means: Separation of modes :
B. Nuclear-ytoplasmic relationships. 31. Area ratio: 32. 33. 34. 41. 42.
Absorbance ratio: Modal frequency ratio :
C. Concentration. 1. Mode of density : 3. 5. 13. Mean of density: 14. 15. D. Size and content. 2. Modal frequency: 4.
6. 7. 8. 9. 10.
11. 12.
Area: Absorbance :
Cell Cytoplasm Nucleus Cell Cytoplasm Nucleus Cell Cytoplasm Nucleus Cell Cytoplasm Nucleus Cell Cytoplasm Nucleus Nucleus/cell Nucleus/cytoplasm Nucleus-ell Nucleus-cytoplasm Nucleus-cell Nucleus-ytoplasm Nucleus/cell Nucleus/cytoplasm Nucleus/cell Nucleus/cytoplasm Nucleus/cell Nucleus/cytoplasm Cell Cytoplasm Nucleus Cell Cytoplasm Nucleus Cell Cytoplasm Nucleus Cell Cytoplasm Nucleus Cell Cytoplasm Nucleus
PARAMETRIC AND NONPARAMETRIC RECOGNITION
325
TABLE IV-Continued
E. Shape. 43. 44. 45. 47. 48. 49. 46.
Cell : Nucleus: Nucleus/cell :
50.
Perimeter Asphericity Diameter Perimeter Asphericity (lobedness) Diameter Diameter ratio Asphericity ratio
Estimation of the density-oriented descriptors from the histograms involved the computation of curve parameters such as moments through order four; modes and intermodal distances; antimodes and ranges; and certain algebraic combinations of these which represent nuclear-cytoplasmic comparisons. Parameters of the entire cell used portions of the histogram which are attributable to the cell. Parameters of nucleus and cytoplasm were computed from their corresponding subdivisions. We refer again to Fig. 9. Quantitative definition of shape required identification and labeling of boundaries and connected components in articulated images. The area of a component was obtained by counting image points which satisfied its definition, i.e., which had gray values between the defining density thresholds for that component. These values were identical to areas obtained from the histograms, as expected. Perimeter estimates were obtained by applying a cross-shaped edge detector at all points of a component. This detector declares a point of a component to be a boundary point of that component if there is a point of the set-theoretic complement above, below, to the left, or to the right of it relative to the rectilinear picture grid. The number of affirmative responses to the edge detector was compensated for diagonal juxtaposition of boundary points by a statistically derived factor, and yielded an approximate perimeter (Freeman, 1961). In addition to perimeter, the shape-oriented parameters included mean “diameter” and “asphencity” for nucleus and for cell, and their ratios. These had the following operational definitions for the cell ( j = C) and the nucleus ( j = N ) : (a) Perimeter
Pi = kpj,
where p j is the actual count of peripheral elements of component j and k is a
JUDITH M. S. PREWlTT
326
conversion factor from counts to length (b) Area
Aj
= aj,
where aj is the actual count of elements of j
(c) Diameter
Dj
=
4aj/kpj
(d) Asphericity or lobedness
Ej
=
4aj/k2pj2
(e) Diameter ratio
RI = DN/Dc = (aN/ac)/(piv/pc) (f) Asphericity ratio
Rz
=
EN/Ec =
(aN/aC)/ ( p N / p C )
'-
Perimeters found by the preceding algorithm were not significantly different from results of explicit boundary tracing. For the explicit trace, a boundary point of a component R was defined as any of its points which has points of R and points of the complement of R in its 3 X 3 square neighborhood. The following algorithm, which allows horizontal, vertical, and diagonal tracks, was used t o extract the connected outline or boundary of the image component. The leftmost, uppermost point of R is taken as the initium and also as the current frontier point of the boundary. Its left neighbor is the current antecedent. The remaining points in the 3 X 3 neighborhood of the current frontier point are inspected in counterclockwise order, beginning with the antecedent, until another point of R , called the consequent, is located. The frontier point becomes the new ant'ecedent,,and the consequent becomes the new frontier point. The search is iterated until the algorithm returns to the initium and the next iteration would lead to a recycle (Prewitt, 1967; 1970a). I n general, there was negligible observed difference in leukocyte parameter values and in discrimination performance with minor changes in background reference level or in boundary thresholds, and the best features from the point of view of discrimination were perturbed the least. Some cells with very low contrast due to preparation or t,o nature occasionally lead to degenerate bimodal rather t,han trimoda,l hist.ograms. These included poorly flattened monocytes, very small lymphocytes wit,h relatively large nuclei and extremely low nuclear-cytoplasmic contrast, and eosinophils with heavy cytoplasmic granulation. These cells, prorat'ed for their occurrence in normal blood samples, would account for only 1% of a
PARAMETRIC AND NONPARAMETRIC RECOGNITION
white cell sample. Simple density thresholding, unassisted by gradient or color information or by syntax- and semantics-directed image processing, will never give perfect articulation in these cells. I n spite of this theoretical limitation, these difficult cells were characterized well using collections of parameters extracted by the CELLS programs from the histograms, and they were properly identified by the recognition logic. Furthermore, we have also found that for the problem of leukocyte identification, sets of CELLS parameters, dependent only on delineating unarticulated cells from their surround, are sufficient for recognition purposes. Thus the issue of recognition is to some extent separate from the issue of image elaboration for the case of the five unambiguous leukocyte classes which were studied. 4. Discrimination: logic and Rationale
A description of leukocytes by numeric Characterization of their images lead naturally to vectorial representation and a decision-theoretic approach to discriminatory logic. Pursuant to this formulation, a cell having measured values zl,. . . , xp for p features jl,. . . ,j, taken in an arbitrary but fixed order was represented by the vector X = ( xl,. . . , x,) in a p-dimensional space a,, in which each coordinate axis is associated with a specific feature. Since the coordinate values are measurements, X is a random vector of observations. The vector space 3, spanned by all possible random vectors constituted the feature space or parameter space for the discrimination problem. Further, for each of bhe g ( = 5) leukocyte types, a corresponding cell image category Ck was postulated. The g categories were presumed collectively to account for all classificatory contingencies, and to exclude shared membership or overlap. I n other words, they were assumed t o be exhaustive and mutually exclusive. The vectorial representation maps cells into points and cell categories into sets, in a space of limited dimensionality. Different selections of variates clearly may lead to topologically diverse descriptions of the universe of discourse. From the point of view of computer-implemented discrimination, feature spaces in which the images of the various categories correspond to compact and well-separated clusters of points are desirable. Part of the motivation for developing feature selection algorithms is to generate such well-behaved configurations in feature spaces of low dimensionality. This most favorable disposition is seldom realized, and a t best, points cluster in more or less disparate regions of the space in a configuration suggestive of the underlying taxonomy. The clusters generated by different categories overlap to various degrees, depending on the extent of natural
328
JUDITH M. S. PREWITT
variation and measurement uncertainty in the selected features, and their total number and interrelationships. Perfect recognition is unattainable in principle, but we can instead look for the “best” performance possible under the circumstances. Toward this end, a measure of discriminatory achievement must be defined, and an optimal classification strategy relative to this measure must be determined. Statistical decision theory is used to partition the feature space or selected subspaces into g mutually exclusive and exhaustive regions (RI, . . . , R,), corresponding to the g alternative leukocyte types ( CI, . . . , C,) . Identification or recognition then consists in assigning a cell to the category Ci if, and only if, its vectorial representation X falls in the associated region Ri. The partition is deliberately constructed so that longterm recognition performance is optimized in some sense (Welch, 1939; Fisher, 1936, 1937; Rao, 1948, 1952; Anderson, 1951; Nilsson, 1965). The optimal strategy is a function not only of the selected performance measure and the constraints of optimization, but of the particular combination of probability distributions peculiar to the problem a t hand and of the relative seriousness of unacceptable decisions as well. The introduction of probabilistic concepts is motivated by several considerations. First, they provide a vehicle for expressing the consequences of measurement variability that is unavoidable in the extraction of numeric descriptors. Second, they allow for differences in the degree to which members of a class exhibit the various features and for correlations between features. Third, although some of our preformal or intuitive criteria for recognition require precise adherence to a membership formula, more flexible specifications are appropriate when continuous rather than discrete parameters are involved, and probabilistic concepts renders this flexibility better than do set-theoretic concepts. Finally, and perhaps most important, the statistical point of view acknowledges that there may be an inherent ambiguity or indeterminacy attached t o the discrimination process which sometimes makes unique categorization relative to a given taxonomic system unattainable. Desirable discriminatory performance was construed to mean consensus between the computer and an acknowledged cytological expert, with respect to the complete set of preconceived classificatory alternatives. This goal was interpreted mathematically as an optimization problem, directed toward minimizing the adverse consequences of “erroneous” or divergent decisions, or more specifically, toward minimizing the probability or expected cost of “misclassification.” It is possible and sometimes desirable t o constrain the optimization by prescribing tolerances for the likelihood that a decision will be withheld or that a critical misclassification will be made. This modified approach has favorable implications for the control
PARAMETRIC AND NONPARAMETRIC RECOGNITION
329
of error rate, but obviously a t the expense of introducing a nonzero rejection rate (Rao, 1952; Chow, 1957; Chu, 1965; Chu and Chueh, 1967). For the leukocyte problem, we adhered t o the simpler and more demanding approach of associating just one category rather than several or none at all with a single decision region. Bayesian decision strategies are suitable formalizations for this because by design they maximize the probability of concordance or else minimize the penalties of discordance (computer “error”). These decision strategies optimize (maximize or minimize) an objective which is a function of (1) prior probabilities, p i and p ( X I i), and (2) a loss or cost matrix, L = [L(Dj I C i ) ] ,specific to the application and stipulated in advance. The prior information consists of (1) the relative frequencies P = (PI, . . . , p B } , where pi is the prior probability that Ci is the (‘correct’’ category for a randomly selected cell, and (2) the relative frequencies (p(XI i ): i = 1, . . . , g] of the possible observations within each category [i.e., p ( X I i ) is the conditional density for X when it does in fact represent a cell belonging to Ci].The unconditional probability density p ( X ) of the random vector X , independent of its origin, is the same as the joint density of the p parameters 21, . . . ,xp, and it is a linear combination of the preceding two probabilities: Q
P(X>= C P ( X I i ) P i
(1)
&l
p;>o
(2)
i= l,...,g
Q
c p i
=
(3)
1.
Bayes’ inversion formula relates these probabilities to p (i I X ) , the conditional probability that the specific vector X represents a member of C,, that is, to the posterior probability of Ci: p ( X ) p ( i I X ) = p i p ( X I i)
for all X and i
=
1 , . . . , g.
(4)
Element L(Dj I Ci) = L ( j 1 i) 2 0 of the loss matrix expresses the penalty (+) or benefit ( - ) incurred by deciding Dj in favor of category C j , when the subject actually belongs to category Ci. If a uniform penalty is assigned for erroneous decisions, a uniform bonus is assigned for correct decisions, and indecision is prohibited, then the loss function is represented by a square symmetric matrix:
L(Dj 1 Ci)
=
1 - 6ij
0
i = j
1
izj.
(5)
=
JUDITH M. S. PREWITT
330
Anticipated discriminatory achievement depends not only on the loss matrix L, which reflects the utilitarian importance attached to the various classification outcomes and is independent of any decision rules, but also on an event matrix E [ R ] , which gives the probabilities of the various classification possibilities under the yet to be selected decision rule R. Corresponding to the element L ( j I i) of L which stipulates the penalty or benefit associated with assigning a member of Ci to C j is the element E ( j I i, R ) of E [ R ] which gives the probability of the same event. The event matrix E [ R ] is a function of the category regions {&);=I induced in f l pby the decision rule R, as well as a function of the posterior probabilities P(X I k ) :
E ( j I i, R )
=
1%p ( X I i) dX.
The expected risk under the decision rule R is defined as the long-term overall cost of classification
I n terms of conditional average classification costs for an individual decision on the vector X favoring Ck,
this overall cost is equal to
Note that CCR] is a function of the vector of prior probabilities, P
=
(Pb . . . I p!J). By definition, the Bayesian optimal classification rule R* minimizes the (overall) expected cost of classification C[R]. This is ensured by favoring the category with the smallest conditional cost L x ( k ) for an individual decision about X . Equivalentiy, it is ensured by maximizing a discriminant DI,( X ) which is a strictly decreasing function of LX( k ),for example, Dk(x)
=
ln[1 - L x ( k ) ]
(10)
over all k. Thus the optimal Bayes’ strategy is given by the following rule
PARAMETRIC AND NONPARAMETRIC RECOGNITION
33 1
R*: R*: assign X to c k where Dk(X) 2 D i ( X ) for i = 1, . . . ,g, and the corresponding system of best regions { Rk*]
(11)
is given by :
i = l ,...,g)
1.
(11’)
The corresponding minimum cost C[R*] is ibtained by using the best regions Rk* in the formula for C[R], and represents the theoretically ideal or optimal performance. For the special case of symmetric cost, algebraic simplification shows that the average loss is also the probability of misrecognition, so that the “error rate” is minimized : Err [R]
=
1-
C pjE[ j I j , R]
= CLR].
i
Bayes’ strategy favors the most likely category: R**: assign X to Ci
if pip(X I i)
>= pjp(X I j)
or equivalently, if p ( i I X) h p ( j I X )
for j = 1, . . .
The best regions are R:*=
( x I p 2 ( X I k ) >=~jp(XIj), j=l>.-.,g]
and appropriate discriminants are Dk(X) = l n p ( k I X)
In (pkp(x I k ) ) .
or
This is the unconditional maximum likelihood decision rule. Ignoring or equating prior category probabilities ( p i = p i for all i, j),the Bayes rule becomes the well-known (conditional) maximum likelihood strategy : R***: assignXto Ci
if X ( i , j )
= p ( X I i)/p(X
I j ) 2 1.
(16)
It favors assignment of X to the most likely source, independent of category frequencies in the composite population, and maximizes the average E ( i I i, R). The discriminants success rate over the g groups, l / g reduce t o Dk(X) = l n p ( X I k ) . (17)
xi
JUDITH M. S. PREWITT
332
The optimal strategy is also equivalent to a sequence of g(g - 1)/2 pairwise Bayesian comparisons, each culminating in rejection of the class corresponding to the smaller discriminant. At each step, the space Q, is partitioned into two regions, according to the sign of Bki(X) = D k ( X ) - D j ( X ) . The region of positive sign, B:j, is associated with ck, and the region of negative sign, B,, with Cj. They are separated by a decision boundary B i j on which Bkj(X) vanishes. Rk is simply the common portion of all positive regions B t j associated with c k :
Rk
=
n B&.
(18)
j#k
Its boundary Bk consists of all points on pairwise decision boundaries B i j which are not in any negative region B,. Variants in the optimality criteria and in the prior probabilities clearly may lead to very different systems of decision regions and boundaries. The Bayesian rules can be formulated in terms of a thresholding operation applied to the likelihood ratio A(;, j ) = p ( X 1 i ) / p ( X 1 j ) for all category pairs. Thus, R*: assign X to Ci
if A ( i , j ) 1 t ( i , j ) = p j / p i .
(19)
Other useful though suboptimal (non-Bayesian) classification strategies result from varying the conditions on the likelihood threshold, t ( i , j ) . These and similar alternative criteria may be better suited for some discrimination problems than the classical approach. For although Bayes’ rule ensures the lowest ideal collective error rate, and therefore the highest overall success rate, it provides no control over the margins between the success and error probabilities for each category: piEii - &pjEij or piEii - pjEij, where
E..”
-
/,,,p ( X I j ) clx
i,j
= 1,.
. . , 9.
(20)
Merely the prevalence of correct assignments into each category is guaranteed by the nature of the best regions:
p a. E . . - p i tt
Li*
p ( X I i) dX 2
pj
p ( X I j ) dX
=
PjEij.
(21)
La*
However, unless the success margin for a particular category is substantial, little confidence can be attached to an assignment into it. This inherent deficiency of Bayes rule may make it totally inadequate for some applications, even though it is optimal. Two “suboptimal” decision strategies, a generalized Neyman-Pearson rule and a minimax rule, merit consideration as alternatives.
PARAMETRIC AND NONPARAMETRIC RECOGNITION
333
While the maximum likelihood strategy assures only that Edi 2 Eij, the generalized Neyman-Pearson method (Rm,1952) strives to obtain the highest success rates consistent with constraints (upper bounds) on the probabilities of misassignment to individual groups. Control over the error rates is obtained by interspersing a doubtful zone among the category regions, and broadening the classification options to include no assignment at all. Confidence in individual decisions is enhanced and the error rate is reduced by the expensive artifact of permitting deferred judgment or rejection. The regions S; which maximize success rates E;; when upper bounds aij are placed on the probabilities of incorrect assignments: Eij 6 aij or = aij(j # i), have the form:
si
=
( X I p ( X I i> 2
cbiip(X l j ) I
(22)
ifi
where bij is chosen to enforce the bounds a;j. If Ri* denotes the best region { X I p ( X [ i) Ip ( X [ j ) for all j), then the doubtful regions Di = R,* - Si. Analogously, the region T i which maximizes Ei; while maintaining a preselected probability of &classification into Ci
has the form
Ti
=
{ X I pip(X i) 2
bi
c i#i
p j p ( X I j >I
-
(24)
The success rate is bounded away from the error rate as follows:
Ei; 2 bia;
piEii 2 bi
or
C pjEij.
(25)
i#i
These Neyman-Pearson regions are hard to determine analytically. The more conservative minimax strategy (von Mises, 1945) provides for the worst category error rate by minimizing the maximum probability of mistaken assignment into any one class. The total error rate for an arbitrary decision rule R is bounded by the maximum and minimum conditional category error rates: min (1 - Ei;) S 1 - c p i E i ; i
i
5 max (1 - E j j ) .
(26)
i
The largest individual error rate is minimal for the decision regions R,** which equalize the error rates. This minimax strategy R*** has the defining property : max (1 - Eii[R***]) S max (1 - Ejj[R]) i
i
for all R,
(27)
334
JUDITH M. S. PREWITT
and the corresponding regions satisfy
1
p ( X I i) dX = Eii[R***] = 1 - a,
constant for i
=
1, . . . , y.
Ri***
(28) The total minimax error rate is independent of the prior probabilit.ies and equals the uniform individual rat'e: 1 - xpiEii[R***] = 1 - a c p i i
=
1 - a.
(29')
I
Only in exceptional cases is the minimax strategy also Bayesian, yielding 3, minimum overall error rate. The minimax approach can be generalized further (Rao, 1952) so t'hat the error rates for each group are in preassigned ratio r l :7-2: . . . rg.The regions are modified Bayesian regions:
Ri***
=
{ X I Xip(X I i) 2 Xjp(X l j ) ,
j = 1,.
. . , g),
(301
where X i and X i are selected to ensure the ratio (1 - Eii)/ (1 - Ejf) = ri/rj. The minimax strategy grants equal st'atus to all categories, and in doing so, i t generally sharpens the Bayesian discriminat,ion of poorly defined or diffuse categories, while degrading the discrimination of well-defined categories. For this reason, i t does not seem appropriate for critical machine perception. Since the success margins for leukocytes were demonstrated by Pxperiment to be wholly adequate, the need for altermtive approaches to c)ptimieation was obviated. As the universe of discourse is expanded, h o w v PI it is likely that a Neyman-Pearson type alternative or sequential denisiI.iv procedures will prove useful. The analytic form of the discriminants is determined by the specific probabalistic assumptions about, the cell populations and the cost, matrix. Linear and quadrat,ic discriminants are technically optimal when t,lw conditional densities p ( X I i) are unimodal, ellipsoidally symmetric miilt'ivariate densities, and the loss is symmetric (i.?., when a decision is slwnrs forced and t,he loss function assigns a standard penalty for erroneous corn. puter decisions but none for correct decisions) (Cooper, 1964; Day, 19W) The discriminants are monot'onic functions-logarithms, for example -If t,he quantit,y: a p I 2 ; I-'"f[(X
- ui)Tz;'(X - V i ) ] P i
( 3 1.1
where aPis a positjive const,ant depending only on the dimensiona.lit,yof tht. feature space, Ui is the mult'ivariate populat,ion mean for Ci,X i is the corw aponding dispersion matrix, and f is a monot,onically decreasing fiinctioir of its argument. Using t,he notat'ion &[z] for the expect.at,ion of a raiidfcni
PARAMETRIC AND NONPARAMETRIC RECOGNITION
335
variable 5, and &[&I] = ( & [ M i j ] )for the expectation of a matrix M ( M i j ) ,t.he population parameters for Ci are
zi = E [ ( X - Ui) ( X - U i ) T I x
E Ci]
Lp
( X - U i ) ( X - U i ) T p ( X 1 i) d X .
=
=
(33)
Multinormal probability densities fall into this category :
p ( X I i)
= (2?r)-'"p
I Z; I-'
exp [ - + ( X - U i ) ~ ; ' ( x-
Ui)T].
(34)
Probability isopleths for such distributions are hyperquadrics in Q p . Linear discriminant,s and hyperplane decision boundaries are optimal for populations which differ in location Ui but not in dispersion Zi.Quadratic discriminants and hyperquadric decision boundaries are optimal when the dispersions are not identical. The discriminants for general multinormal discrimination are the family of functions
Q,(X)
=
( - + p In 27r)
+ (In pi - + I n 1 Z i I) - + ( X - Ui)TZ;'(X - Ui) (35)
which specialize to linear functions in the homoscedastic (equal variance) CMC.
The robustness of these discriminant functions is well documented (Cooper, 1964), and they are often used in situations for which they are not st'rictly optimal. In particular, the insensitivity of linear discriminants to heterogeneous dispersion and moderate departures from normality had been noted earlier (Rao, 1952), and they perform impressively well even under significant departures from the less demanding hypot'heses which guarantee their optimality. Fcr the linear case, the discriminants simplify to
+
O ; ( X ) = ( X - + U i ) T z - l U ; In pi.
(36)
Thc space Q p is partitioned by g(g - 1)/2 decision hyperplanes: Hkj(X)
= (uk -
Uj)TZ-'(X -
$ ( U k
+
-h
uj))
(pk/pj)
fork #j; k , j
=
= 0
1,. . . , g .
(37)
rleleting or equating the prior probabilities p ; as we did in our application, leads to a simple geometric configuration. The hyperplane H k j ( X ) = 0 bisects the line of centers ( uk - uj) at 3 ( u k uj), and is orthogonal to
+
336
JUDITH M. S. PREWITT
the vector z-'(uk - U j ) . H k j ( X ) itself is the projected position of X , relative to the midpoint of the line of centers, in this direction. In general, the hyperplane is skew to both the line of centers and the principal axis of the two distributions, that is, the direction of maximum internal variance of each category, given by the eigenvectors of 2. However, it is perpendicular to the line of centers for uncorrelated parameters with unit variance. This linear discriminant is also the population analog of Fisher's function (Anderson, 1958; Morrison, 1967; Rao, 1952), which maximizes betweenclass to within-class sample variance for any possible linear combination aTX of parameters for two categories, or equivalently, maximizes the projected between- to within-sample variance over all possible spatial orientations. The ratio of projected between-class contrast, [aT(u k - U j )12, to within-class (common) variance aTZa,
attains its maximum value in the direction a* = 2-'( uk - U j ), which is normal to the decision hyperplane. This maximum F(a*) equals the Mahalanobis distance between the two populations (Mahalanobis, 1936) :
Dij
( u k
- Uj)Tz-'(Uk -
Uj).
(39)
This, in turn, is the projected intercenter distance ( u k - U j ) perpendicular to the decision hyperplane, and represents the effective intergroup separation after adjustment for the internal covariance of the parameters. The Mahalanobis distance Di is invariant under nonsingular linear transformations and translations of the feature space and is therefore a joint characteristic of the two classes. In pairwise linear discrimination, all the relevant information lies in the direction of the normal vector Z - l ( u k - U j ) to the decision hyperplane H k j ( X ) = 0 . The theoretically optimal error probabilities are completely determined by the univariate conditional distributions of the test statistic H k j ( X ) . When the distributions p ( X I k ) are normal, H k j ( X ) is also normally distributed as X ranges over Ck and over C j , with mean values mk and mj equal to f ( u k - U j ) T Z - l ( uk - U j ) and constant variance equal to the Mahalanobis distance D i p The pairwise probabilities of misclassification are simple functions of Dij , since
(40) becomes, by change of variable :
PARAMETRIC AND NONPARAMETRIC RECOGNITION
Ekj
=
7z1-
1 6
337
-1 f2Dkj
exp ( -+u2) du
l/Wki
= __
eXp ( - i U 2 ) du = E j k .
-m
(41)
Thus, the greater the Mahalanobis distance, the smaller the ideal probability of misclassification. The pairwise decision is both Bayesian and minimax, with an overall error rate of E k j -k Ejk = 2Ekj. Even when p ( z I k) is not normal, the central limit theorem leads us to expect that H k j ( X ) , being a linear combination of many independent measurements, will be asymptotically normal, regardless of the underlying individual distributions. Although the ideal error for the composite decision can be expressed analytically, it leads to a difficult computation, even for normal distributions, because of the generally asymmetric polygonal decision regions. For this purpose, corresponding to each category C k , we define a ( g - 1 ) vector V k ( X )with j t h component equal to H k j ( X ) , j = 1, . . . ,g, j # k. vk(x)is a linear transformation of X given by
vk(x)= aTZ-'X
-y
(42)
ui)]i#k
(43)
where 6=
[ ( u k
-
the p is (g - 1) matrix of mean differences, and y is the vector with components 7; =
i(uk -
Ui)TZ-l(Uk
+
ui)i+k.
(44)
vk(x)is therefore multinormal in c k with mean +6TZ-16and dispersion 6TZ-16 if X is multinormal. The success probability for ck is the joint probability of nonnegative discriminants H k j ( x ) , j = 1, . . . , g with
k fixed:
/YI(H)bO exp
-$[vk(x)- WZ-161~ x 6TZ-'6[vk(X) - @pz-16])d v k ( X ) an integral which is difficult to evaluate directly for g > 3.
E~~=
all j#k
(45)
Homogeneity of covariance (Ho: Zi = Z for all i), which justifies the use of linear discriminants for multivariate normal populations, wtts tested against the alternative ( H I : z; # Z j for some i a n d j ) , by a multivariate analog of the univariate test for homogeneity of several variances (Bartlett, 1934, 1938; Box, 1949). The test compares maximum likelihood estimates of the generalized variances I Z I and I Zi and is not a significant computational problem. If the estimate Ei of Z; is based on n; = N i - 1 degrees
I
JUDITH M. S. PREWITT
338
of freedom in a sample of size N i , then the pooled estimate 2 of Z when H o is true is z = n i z i / c 4 = 1ni. The test statistic is Box’s M :
cp=l
The smaller the value of M , the more likely the null hypothesis. The distribution of M ( l - A1) is asymptotically chi-square with +(g - l ) p ( p 1) degrees of freedom, when g and p both are less than 4 or 5 and each Nk - 1 is a t least 20. For more groups, higher dimensionality, and fewer samples Nk, an F approximation is better. Letting
+
n= C n i = N - l ;
N = ENk;
W P
f
1
-1
- 2(9 - l ) P ( P
fz =
ni=Ni-l
k
I
( f l
+ l ) ( g - 1)
+ 1)
+ 2)/l AS - A12 I
the F-approximations are if A z - A12> 0, then F ( f1, fz)
if Az - A?
=
M/b,
where b
=fl/
(1
-
A1 - fi/fz)
< 0,
then F ( f ~ , f *=) f ~ M / f ~( bM ) , where b
= fz/(l
- A1
+ 2/fi)
(48)
However, the test is very sensitive to even moderate heterogeneit,y among some of the groups, and adherance to conventional confidence levels would have lead us t o be overly conservative and favor quadratic discriminants in the leukocyte problem when, in fact, linear functions are quite adequate [see Prewitt ( 1972a) for a discussion of significance levels in discrimination problems]. 5. linear Logic and the Evaluation of Performance: logic and Rationale
The Bayesian approach lends itself readily to the problems of 1eukocyt.e discrimination. Clinical applications as diverse as the differential counter and the rare cell detector can be accomodated by manipulating the cost
PARAMETRIC AND NONPARAMETRIC RECOGNITION
339
matrix to emphasize certain discriminations and de-emphasize others. Our own goal was not to pursue such specific applications, but to achieve machine discrimination of leukocytes that was as compatible as possible with our own discrimination, independent of cell type and diagnostic intent. Our own discriminatory ability is reflected in the decision-theoretic model when we specify and estimate the conditional probability distributions, p ( X 1 k ) , and the priors, pi. By insisting on indifference of machine performance to cell type, we meant first, that all divergent decisions (computer “errors” or “misclassifications”) are equally undesirable, and that all convergent decisions (“correct” classifications) are equally meritorious ; second, that compatibility of individual decisions was more important than long-term consensus for sequences of decisions ; and third, that recognition performance should be uniform for all cell types. These subjective goals were best rendered by selecting a symmetric cost matrix and equating prior probabilities. This is tantamount to assuming the greatest uncertainty about the composition of the source population in the information-theoretic sense. The resultant maximum likelihood strategy is conservative. It is appropriate in the absence of any prior knowledge, or when all categories are to receive equal emphasis. Were it not for the facts that the probabilities pk and p ( X I k ) are unknown, and that the penalties and benefits L ( D j 1 C , ) are difficult to quantitate, discriminants could be constructed, constants could be evaluated, and automated Bayesian recognition would become a straightforward computational exercise. I n reality, compromise with the strict definition of optimality is unavoidable. We have followed the practice of using an actual decision rule which approximates the ideal in a statistical sense, and retains the same hypothesized analytic form. Population parameters or discriminant constants are then estimated from prototypes for each category which have been selected and “correctly” labelled by an accepted authority. The acquisition of the requisite information is called “learning,” and the initial data base is called the “learning set” or “design set,” in contrast to subsequent data bases or “test sets” which are used for evaluation and validation. Learning has been variously classified as (i) parametric or nonparametric, (ii) supervised or nonsupervised, and (iii) adaptive or nonadaptive, according to whether or not (i’) the form of the probability density or optimal discriminant is postulated, (ii’) the design samples are pre-identified by a n expert, and (iii’) the data are used simultaneously to construct the decision rule, or serially, in the hope of generating a sequence of decision rules which gradually improve and asymptotically approach optimal performance (Abramson and Braverman, 1962). The method of choice for leukocyte recognition was nonadaptive, supervised linear discrimination. The linear discriminants and polygonal
340
JUDITH M. S. PREWITT
decision regions were those which would have resulted from minimizing the composite error rate for equally likely multivariate normal (or ellipsoidally symmetric) populations with common dispersion under symmetric loss. For probability distributions of this type, the parameters of the optimal decision boundaries are independent of the precise form of the distribution, but performance is not likewise independent. A decision rule constructed from samples is called an actual rule, in contrast to the ideal or optimal decision rule which is based on (infinite) populations rather than on sample statistics. In the nonadaptive linear discrimination of two equiprobable Gaussian or ellipsoidally symmetric categories with common dispersion, the actual rule is usually the sample analog of the Bayes rule (Fisher, 1936), wherein the population means and common dispersion matrix Uj, Uk, and Z, have been replaced by their unbiased, maximum likelihood estimators, U j , o k , and X :
R* ideal:
D,jk(X) = XTZ-l( Uj -
Uk)
- +(Uj
+
U k ) TZ-l(
Uj -
Uk)
(49)
R* actual: D A ~ ~ (=X XTE-'(Uj ) -
u k )
- +(Uj
+
nk)TE-l(Uj-
0,) (50)
If NI,samples Xk; of Ck are observed, these M L estimators are given by (51)
O
N = CNk. ~
(54)
k=l
These estimates are consistent and approach true population values with increasing accuracy as sample size increases. Whereas decision rules of the preceding type are easy to implement, it is hard to predict their performance. The classification rule R* actually constructed involves estimation of the category probability densities or discriminant parameters, and therefore depends on the idiosyncracies of the available design sample. Furthermore, actual discriminatory performance will obviously change from test sample to test sample, and
34 1
PARAMETRIC AND NONPARAMETRIC RECOGNITDN
predictions of long-term optimal performance, based on sample statistics, will change similarly with the design set. Although an actual decision rule is inferior to Bayed rule by definition, performance estimates based on single experiments tend to be optimistically misleading because of the technically improper but common reuse of the original design sample as a test sample, or because of reliance on too small or too homogeneous a test set. For example, even in the aforementioned simple case of two Gaussian populations with common covariance, the actual linear discriminants D A b j ( X ) are not optimal. The true error rate of the ideal rule R* is often estimated by replacing the interpopulation deviance A:j = ( uk - Uj)TZ-l( Uk - U j ) in the normal error formula (41) with - U j ) T z - l ( o k - oj).The its maximum likelihood est,imate, D:j = (8, true error rate of the actual rule is a random variable, subject to sampling fluctuation due to choice of the design sample. It is determined by the very complicated distribution of the discriminant v k j ( x )
=
[x - +(oil,+ o j ) l T % ' ( O k
-
uj),
(55)
which, by virtue of being a function of the random vector X and the design sample statistics 8k,oj,and 2, is, in turn, a function of the unknown population means and dispersions, uk,Uj, and Z (Wald, 1944; John, 1961; Lachenbruch and Mickey, 1968; Dunn and Varday, 1966). However, for a specific actual rule, o k , Oj, and 2 are constant as far as test sets are concerned, and V k j ( X ) is normally distributed in each category Ci, with mean
+ 0j)]T2-1(
I/'i =
[Ui - +(o
a,2 =
(8,- o j ) T p z p(0, - oj).
k
o k
-
(56)
oj)
and variance (57)
Thus, if C$ denotes the cumulative normal distribution function, then the success probability of the optimal discriminant: pl = 4 ( + A k , ) , the estimate: p2 = 4($Dkj), the true success rates for the actual rule: pai = P3,
=
4(vj/aU2) = Prob @(Vk/Uz)
=
Prob
[vj(x) 2 0 I x E Cj; [ v k ( x )
5 0Ix E
ck;
Oj,
21
ok,
uj, ok,x],
(58)
and the observed success rates p4i and plk for the actual rule are only approximately equal, especially for small samples. Discrepancies therefore exist in general between (1) the optimal error rate (i.e., the true error rate for the optimal discriminant); (2) theestimated optimal error rate (i.e., the estimated error rate for the optimal discriminant, obtained from the Bayes' risk by replacing population parameters
342
JUDITH M. S. PREWITT
with maximum likelihood estimates) ; (3) the actual error rate (i.e., the true error rate of the actual discriminant) ; and (4) the observed error rate (i.e., the error rate found in practice with the actual rule). From our point of view, since the underlying population distributions are unknown, it is impossible to compare observed discriminatory achievement with the theoretical potential of the ideal discriminator. Under these circumstances, the effectiveness of the entire recognition system is best evaluated by the observed error rate of a specific actual rule. These rates can be summarized in the form of a confusion or scoring matrix, in which the ( k , j ) t h entry gives t,he number of members of c k which are classified as members of Cj by the actual rule. The observed error rate of a specific actual rule follows a binomial distribution. Confidence intervals for observed values can therefore be obtained, their lengths, of course, depending on the size of the test sample (Highleyman, 1962). The observed error rate for class c k is the relative frequency of members misclassified mek to the total number of samples for that class N k . These relative frequencies are binomially distributed random variables for a fixed actual rule. The overall theoretical error rate E is a function of the g individual theoretical error probabilities Ek for the actual rule :
Using test sets that are random samples of the composite population, the theoretical error rate E has the maximum likelihood estimator B = m,/N, where me is the observed number of misclassifications in a test set of size N . The relative frequency is binomially distributed with mean value E and variance E ( 1 - E ) / N = ( p k E k ) (1 - c k p k E k ) / N :
zk
Prob
[z1 N ]
=
(3E y 1
-
E)N--m..
Confidence limits for binomial proportions and 95% confidence intervals for a binomially distributed variable as a function of sample size are readily available from charts and tables and can be used for evaluating the experimental results. Because of the relationship between the F distribution and the binomial, lower and upper confidence limits are given explicitly by the formulas: M p L ( N I M )= M + ( N - M + l ) F ( y ; 2 N - 2 M + 2 ; 2 M )
p u ( N ,M )
=
1 - P L ( N ,N - M )
(61)
where F ( y ; nl;a)stands for the 7th critical value of the F distribution
PARAMETRIC AND NONPARAMETRIC RECOGNITION
343
with n1 numerator degrees of freedom and n2 denominator degrees of freedom (Abramowitz and Stegun, 1964). Figure 16 reproduces such confidence intervals for moderate-size samples. Tighter confidence intervals can be obtained by random sampling of the component categories rather than the undifferentiated population, provided that the prior probabilities are known. The N test samples should be distributed so that each category is sampled in proportion t o its prior probability: N i / N = p i , and the observed error frequencies mei should be tallied separately. The M L estimate of the composite error rate is E = ( 1 / N )z i m e i ,and the variance g i = ( 1 / N 2 ) = ( 1 / N ) c i p i E i ( 1 - Ei) is smaller than t,he variance for unselective random sampling by the p i ( E ; - E ) 2 / N . The category error rates are binomially amount distributed with unlike means, while the composite error rate is approximately Poisson when the individual error rates are small, and approximately binomial when, in addition, the overall error frequency is small. As supplementary measures of discriminatory achievement, tests of significance based on sample estimates of Mahalanobis distances and on likelihood ratio tests for equality of multivariate group means were used, but with caution. Since overall tests of significance can mask details of discrimination, especially when the groups are not symmetrically dispersed in the feature space, they were accompanied by more sensitive pairwise tests. Interest in these supplements to the error rate centers around their mathematical additivity, which makes them potentially useful for feature selection. The appropriate statistic for judging equality of two multivariate sample means is Hotelling's (1931) T2,which is proportional to the sample intergroup distance D i j , and is equal to Fisher's critical ratio (1936, 1938) :
xi
where Skj is a pooled covariance estimate based on ( N - m ) degrees of freedom. Under the null hypothesis, Ho: u k = Uj, specifying no real difference in the mean values of the p features, the following quantity has the F distribution with p and (Nk Nj - p - 1 ) degrees of freedom:
+
When only two groups are involved, s k j is calculated from the unbiased estimates of the group covariances s k and 8,:
344
JUDITH M. S. PREWITT
When more than two groups are involved, S k j is based on a weighted, unbiased sum of all g within-group sums of squares of products and crossproducts matrices, using N = X k N k - g degrees of freedom. For a probability a of erroneously rejecting HO and accepting the alternative, HI: u k # U j of significant differences (detecting a false distinction), the critical region is F 5 F,. The probability of wrongly accepting Hi(ignoring true distinctions) is given by a noncentral F distribution (Bose and Roy, 1938; Hsu, 1938). Specifically, if Y is normally distributed with mean U and covariance 2, estimated by S on n degrees of freedom, then the quad-
-
WN
XIN
-
(a 1 FIG. 16a. Confidence intervals for binomial proportions. (a) Confidence coefficient 0.99. (b) Confidence coefficient 0.95. Confidence bands are plotted for the observed proportion of favorable events for outcomes following the binomial probability distribution with theoretical probability of occurrence p. N is the total number of events observed and X is the number of favorable events observed among these.
-
PARAMETRIC AND NONPARAMETRIC RECOGNITION
345
wn
wn
(b)
FIG.16b.-Continued.
ratic form T 2 = YTS-'Y has the noncentral F distribution with p and (n- p 1) degrees of freedom and noncentrality parameter a2 = U T W U . Significant values indicate real differences between groups, i.e., that the functions have discriminatory potential. When comparison of a sample mean u k is to be made to some preselected quantity U* rather than to another sample mean, the same formulas apply with U* replacing Uj, 1replacing N j , and Sk replacing S k j :
+
T2= N
( o k
-
u*)s,'(ok - u*)
Construction of confidence regions for population means follows directly
JUDITH M. S. PREWITT
346
from this. The 100( 1 - a)% ' region is an ellipse centered a t
0:
An individual observation may also be compared to sample means in this way. Considering it to be a sample of size one, the variance of (X - 0) is ( 1 l / N )S , and the test quantities are
+
U )TS--I ( X
-
The T 2test is robust for equal or large Nk and N,, even when the covariance matrices 2 k and Z j are unequal, but not necessarily otherwise (It0 and Schull, 1964). If the covariance matrix and means are known or have been estimated on a large number of degrees of freedom, the posterior probabilities of the classes for an individual observation can be estimated. For this purpose, the quadratic form D$k = (X - U ) T Z - l ( X - U ) is distributed as chisquared with p degrees of freedom when X is normally distributed in flp with mean U and covariance Z. I n principle, an approximate distribution for the posterior probabilities can be computed from the distribution of n$k, and a measure of confidence in the classificatory decision can be based on this : fl
p(k
I
=
P(X
1 k ) / CP(x
fl
lj) = exp ( - $ % k ) / C
i=l
exp ( - + B : ~ ) .
i=1
(68) Other tests of hypotheses also simplify under these assumptions. For two samples, when H o is true, the difference in sample means, t 7 k - U j , is 'X(0, (1/Nk f 1 / N j ) Z) , and 6he statistic when Ho is true is
For testing a preassigned value, Ho: UI, = U*,
0 is
X[O, ( l / N ) Z ] and
PARAMETRIC AND NONPARAMETRIC RECOGNITION
347
the test statistic when HOis true is
xz(p)
=
N ( 0-
u*)T8-1(u - u*) =
N ( U k
- U*)Z-'(uk
-
u*). (70)
Large values of x2 are counterindicative for Ho. The Mahalanobis distance can be generalized to several groups and tests of significance for composite discrimination can be based on the extended form. The statistic is B
D;g =
N k (
0k - O ) T Z i i (u k
-
0)
(71)
k=l
where N ksamples of c k are available, Z p e is either exact or has been estimated on a large number of degrees of freedom, and 0is the overall mean. D2can be used as a chi-square with p ( g - 1) degrees of freedom to test the compound hypot'hesis t,hat the mean values based on p features are the same for all populations. The pairwise tests of significance can be extended in yet another way to test the composite hypothesis Ho: u k = Uj under the assumption of common covariance, against the alternative H1:u k # Uj. This extension proceeds from t,he interpretation of Fisher's discriminant as the maximum att.ainable ratio of between- to within-class variance for linear compounds of the original variables zl,. . . , x,. The heart of this method is the analysis of scatter or dispersion into within-class and between-class components, and t,he introduction of the generalized variance rat,io of the determinants of these matrices. The test statistic, the Wilk's-Bartlett lambda, A (Wilks, 1932; Bartlet,t, 1934, 1938), varies inversely with the proportion of total dispersion attributable t,o category differences, and furnishes a formal compound t.est of equalit,y for the g group centroids, under the hypothesis of common covariance. Lambda is discussed in detail in connection with feature selection. 6. Feature Selection: logic and Rationale
Since the number of features we consider is limited only by our own ingenuity and patience, we are often confronted with more variables than we would like or can accommodate. I t is therefore appropriate to ask ( 1 ) which, if any, variables in the current set. can be discarded without impairing the discrimination significant,ly,and (2) can subsets of features surpass the diagnostic performance of the complete set. Feature selection refers to the systematic screening of the original set
348
JUDITH
M. S. PREWlll
of variables for effective and computationally tractable combinations. This is a multiphasic endeavor, involving: (1) feature selection per se-inclusion of variables which facilitate discrimination ; (2) feature reduction-exclusion of variables which are redundant or irrelevant, and hence possibly unwanted; (3) feature combination-formation of new variables with desirable mathematical and statistical properties from the original variables ; and (4) feature ordering-serial arrangement of (possibly independent or uncorrelated) variables according to their individual discriminatory merit. The impetus for utilizing multiple features is many-fold. First, the theoretical Bayesian error rate cannot be affected adversely when a feature set is augmented, although definite theoretical benefits have been demonstrated only for some special cases. If the additional variable is at all relevant and not totally redundant, the error rate must decrease. Similarly, between-group contrast, as measured by the Mahalanobis distance, cannot decrease and most likely will increase. If these trends are rapid enough, improving discrimination by adding variates is reasonable to the extent that it remains practical. Finally, the contribution of features to discrimination may be indirect because of correlations. Retention of relevant but covarying features is seemingly inefficient, yet intuition suggests that redundancy might provide a corroborative effect. The outcome is determined by the relationship of interspecific to intraspecific variability and correlation. I n individual features, a high ratio of interspecific to intraspecific variability is desirable. I n pairs of features, a high intraspecific correlation, with interspecific correlation of opposite sign is desirable. As Figs. 17a and 17c show, relatively poor discrimination with a given set of features sometimes can be dramatically enhanced by the addition of suitably correlated features which by themselves are not diagnostic. However, correlation and not dimensionality is the primary factor operating. Uncritical addition of variables with reinforcing rather than contrasting patterns of intragroup and intergroup correlations can be self-defeating, as Fig. 17b shows. Practice and principle are not in harmony, however. The confluent experience of many investigators indicates that recognition does not necessarily improve and often deteriorates when arbitrarily more features are included. This apparent loss of information takes the form of an increase in observed error rate, or else a decline in the significance level of achieved discrimination, as measured by multivariate test statistics such as Hotelling’s T2and Wilk’s A (Rao, 1952). Although these contrary patterns may occasionally be sampling artifacts, theory also suggests that, in some
PARAMETRIC AND NONPARAMETRIC RECOGNITION
Y
Y
349
Y
X
FIG.17. Covariance as an element in improving discrimination. Identical spherically and ellipsoidally symmetric probability densities have been represented by probability isopleths in various configurationsto illustrate the influence of covariance on discrimin* tion. Relatively poor discrimination with a given set of features can be dramatically enhanced by inclusion of suitably correlated features which by themselves are not diagnostic. Correlation and not dimensionality is the primary operant. (a) Inclusion of variables with internal correlation patterns contradistinct to between-class differences enhances discriminability by reducing overlap of the distributions. (b) Inclusion of variables with reinforcing rather than contrasting patterns of intra-group and intergroup correlations can increase overlap, especially for highly eccentric distributions, and hence be self-defeating. (c) Identical marginal distributions along the vertical axis render Y useless as a solitary variable for discriminatory purposes. Marginal distributions along the horizontal axis are not identical, but overlap to an extent that likewise precludes adequate univariate discrimination using X alone. However, two-dimensional orientation, scale, and location for these two distributions are favorable for bivariate linear discrimination using both X and Y. The dotted line indicates the optimal discriminant in two dimensions.
special cases, feature selection may lead to obvious economies in data collection and processing, without compromising discriminatory resolution. Seemingly, but not actually, conflicting claims have been cited. On the one hand, theoretical investigations based on infinite sample sizes (populations) have led to the optimistic result of asymptotically perfect discrimination: the error probability becomes arbitrarily small as the variables become sufficiently numerous, provided that appropriate (but unrealistic) statistical hypotheses are satisfied. To see that, in any case, by adding another variable, we may fare no better, but we shall fare no worse, let X, be the random observation vector ( 2 1 , . . . ,2,) in Qn, and let Xn+lbe the extension ( 5 1 , . . . ,zn, Zn+l) in a,+*. For class Ci, let p i be the prior probability; p(X,,+l I i) and p ( X , I i) = JZ- p(X,, I i) dz*l the conditional densities in Q,,+I and Q,, respectively; and R i ( n 1) and R i ( n ) the corresponding Bayesian deci-
+
JUDITH M. S. PREWITT
350
compare as follows:
sion regions. Then the error rates in 3, and
21-CPi 1
1
p(xn+1\
i) dxn+I.
Ri(n+l)
(72) Thus the error rate in Qn+l is always bounded above by the error rate in 3,. Equality holds when x,+~ is essentially irrelevant, and all cylinders in 3,+1based on the decision region Ri(n) are also Bayesian decision regions in The n-dimensional probability distribution degenerates; it is not. distributed throughout the n-space, but is concentrated in a manifold of lower dimensionality (Lukomski, 1939). This occurs, for example, when some variables are connected by a functional dependence. Stronger results on asymptotically perfect discrimination have been obtained. Gaffey (1951) derives a sufficient condition for two Gaussian categories with common covariance. Certain sequences of functions of the class means U(l) and U @ ) and , covariance matrix X must be uniformly bounded away from zero and a related series must diverge :
n
n
k=l
j=1
(S-l)kjAkAj
diverges as n increases.
(73b)
Albrecht and Werner (1964) obtain error bounds for independent Gaussian variables or dependent variables for which the central limit theorem applies. Chu and Chueh (1967) derive upper bounds for the Bayesian error probability of independent (but not necessarily Gaussian) characterist,ics in terms of the differences between the two marginal probability distributions fi(2k) and f ~ ( a of )single characteristics x k . The sufficient condition is divergence of dk, where
c&
d k =
1 fi(xk)
- fZ(xk)
\
dxk.
(74)
On the other hand, pessimistic results have been obt'ained for more general models making no assumptions about probability structure, statistical independence, or continuity. Hughes ( 1968) showed the existence
PARAMETRIC AND NONPARAMETRIC RECOGNITION
35 1
of an optimal measurement complexity (related, but not identical to dimensionality) for the average over all individual Bayesian pattern recognition problems, when conditional category probabilities are discrete. He considers both finite and infinite samples, and derives formulas relating the optimal measurement complexity to sample size and prior probabilities. Specifically, he studied the mean accuracy (error probability) of all twocategory Bayesian classifiers as a function of prior probabilities p l and p2, design sample size m,and measurement complexity K (equal to the total number of resolvable measurements) : n
K
= n r k
(75)
k=l
where r k represents the number of measurement states for x k , the kth of n discrete variables. Thus there are K states Sifor each class. For unrestricted design samples ( m infinite), given the observed state S j , the Bayes rule favors Ci corresponding to the maximum value of p i p ( & ] Ci). When the categories are equiprobable, the probability of correct classification approaches an upper bound of 0.75 as measurement complexity becomes infinite. For finite samples, state probabilities are estimated by the appropriate relative frequencies, which are maximum likelihood estimators, and a relative frequency (suboptimal) recognition rule is adopted. Assuming (unrealistically) that such discrimination problems are equiprobable (i.e., that the p ( S j I C i ) are uniformly distributed in problem space), the mean accuracy over all possible discrimination problems peaks at a level of measurement complexity determined by the size of the design set and the priors. Abend et al. (1969) amended and clarified the results for finite design sets, using true Bayesian estimation [see also Lebo and Hughes (1967)], and found qualitatively similar optimality behavior. In finite samples of sizes m1 and mz, complexity K , priors p l and p2, and occurrences rli and r2i for state X i in classes C1 and CZ,respectively, the true Bayes strategy (minimum mean accuracy) favors c1 or czfor an observation s k according to the maximum value of pj(rji
+ l)/(mi + K )
(76)
whereas the naive maximum likelihood relative frequency rule favored the class corresponding to the maximum of
provided that the m samples are distributed in proportion to the prior probabilities: mj = pjm, j = 1, 2, and m = ml m2. For equal priors and
+
JUDITH M. S. PREWITT
352
fixed sample size, there exists an optimum measurement complexity for which the mean correct recognition probability is a maximum; for an infinite number of measurements, the performance is no better than it would be if based on knowledge of priors alone. The optimum measurement complexity increases with sample size. I n these cases, performance never falls below max ( p l , p z ) , but unusual “clamping” effects are noted for unequal priors, indicating a sample size threshold which must be exceeded for sampling to have any effect a t all. Similar conclusions apply to several equiprobable classes and infinite design samples, as Kain (1969) shows. Performance curves are convex and asymptote, and are lower bounds for the general case of unequal priors. For a given complexity level, the mean accuracy deteriorates as the number of classes increases, in harmony with intuition. Along similar lines, Allais (1964) had earlier explained the analogous phenomenon for the related problem of multivariate normal prediction or regression, and conjectured its extension to pattern recognition. Using the mean square error criterion, he compared the performance of the ideal predictor and the maximum likelihood ( M L ) predictor derived from it by M L estimation in finite samples, and he found an optimal dimensionality for the average over all linear regression problems of a given sample size. The expected value of the error e in M-space, using m samples to establish the predictor, is
<m
infinite
M
=
undefined
MLm-1
p2[(m
E(e)
=
I
+ l>/m][l + M / ( m - M - 2)]
M
i
-2
m - 2 (78)
where M i s the number of variables, m is the number of design samples, and p2 = P ( M )is~ the minimum mean square error when all parameters are known. Depending on the rate a t which the (nonzero) ideal error P ( M ) ~ decreases with additional measurements, this expected error will be minimized for some optimal number of measurements Mopt< m - 2. Although the criterion in classification is error rate rather than mean square error, and the multinormal analysis does not apply to the corresponding discrete predictor equation, similar optimality results were conjectured for individual pattern recognition problems and supported by actual data. Lest these conclusions be misinterpreted and misapplied, we re-emphasize that they address themselves to mean recognition accuracies, obtained by assuming that one pattern environment is as likely t o be encountered as
PARAMETRIC AND NONPARAMETRIC RECOGNITION
353
any other. This assumption is patently unrealistic for patterns occurring in nature. Individual classification problems may show a higher or lower Bayes accuracy than the mean, depending on their probabilistic structure. The preceding remarks suggest that the development of practical, effective discriminant functions requires a judicious and parsimonious selection of features. From the point of view of methodological consistency, the appropriate criterion for choosing among alternative sets of features is the system performance measure: the best set of features, relative to the totality available, is the one with the minimum expected cost of classification, or alternatively, the one with the minimum expected error rate. I n other words, “best” feature subsets correspond to “best” optimal discrimination. Accordingly, the outcome depends on the behavior of the Bayes risk and, in practice, the behavior of the actual risk, when the M-dimensional feature space is restricted to an M’ ( <M ) -dimensional subspace. It is evident that the effectiveness of feature selection is contextdependent. That is, the theoretical effectiveness is limited by the hypothesized probability structure or the family of admissible discriminant functions. The measured effectiveness will be influenced by the same factors which attenuate the accuracy of estimated error rates, namely, the number, relevance, and interrelationships of the available features, and the peculiarities of the design sample. Even for three or more normal distributions with known population parameters, the necessary computations for feature selection to comply with optimality criteria are formidable. Since, in general, there is no assurance that a best set of M variables will have members in common with a best set of M’ variables, the entire screening procedure must, in any case, be repeated when new parameters become available. Short of exhaustive search (that is, consideration of all combinations of features) , there is no simple, comprehensive feature selection algorithm. The other extreme, random selection, is too casual. A practical alternative is a limited, but systematic search which avoids the unwieldy combinatorics of exhaustive search, and incorporates a modified performance criterion which is responsive to the inherent resolving power of the variables. One way t o obtain figures of merit which are intuitively appealing, yet manageable, is to generalize easy-to-compute sample statistics which are either mathematical functions or empirical correlates of the error rate for the simple case of two multinormal populations with common covariance. The relationship of error rate to Mahalanobis distance in the painvise normal case provided a rationale for such an alternative figure of merit: maximization of the between-class to within-class variance ratio, which is equal to the Mahalanobis distance in this case. The specific figure of merit
354
JUDITH M. S. PREWITT
which we adopted (Prewitt and Mendelsohn, 1966a) was its multivariate multicategory extension ; more precisely, we used the total to within-class generalized variance ratio (Wilks, 1932). Both sample divergence (Kullback, 1959; Marill and Green, 1963) and average information or conditional entropy (Lewis, 1962), are also closely related to optimal error rate for the two-category Gaussian case, and have been used with the same purpose. One way to reduce search time is to substitute stepwise, accumulative procedures which are exact in well-known special cases, and generate a nest of maximally effective features in these instances. The essence of the sequential algorithm which we followed is to compute at each stage, a context-dependent measure of effectiveness for each of the available but not yet selected variables, and retain that feature which gives the largest value of this measure in the presence of the already selected subset of features (Prewitt and Mendelsohn, 1966a). Hopefully, by incorporating the most informative feature at each stage, an acceptable performance level will quickly be achieved and the selection process terminated in a low-dimensional feature space. Similar empirical studies of linear discrimination by others (Weiner and Dunn, 1966) reinforce our finding that relatively naive selection rules based on simple but plausible figures of merit are superior to random search for good features, and almost as effective as exhaustive search, especially as the number of included features increases. The stepwise search derived successive sets of features, each extending the capability of its predecessor as much as possible relative to the figure of merit. The subset is developed by beginning with the single most powerful feature and iterating a two-phase rule for introducing or eliminating features one at a time. The first or additive phase tries to improve discrimination; the feature which leads to the greatest increase in total t o within-class generalized variance in the presence of the already selected features is added, provided that this fractional increase exceeds a stipulated level of statistical significance; otherwise, nothing is added. The second or subtractive phase tries to control redundancy: the feature which contributes least to the generalized total to within-class variance in the presence of the remaining features is deleted, provided that the fractional decrease does not exceed a stipulated level of statistical significance; otherwise, nothing is deleted. The algorithm terminates when either the list of available features has been exhausted or tolerance levels preclude further changes. The computations at each step automatically compensate for discriminatory power already achieved by the previously selected features by the method of covariance adjustment (Rao, 1952; Sampson, 1965; Efroymsen, 1962). Linear discriminants based on the current selection of variables were
PARAMETRIC AND NONPARAMETRIC RECOGNITION
355
computed for each doublet of cell types at each stage, and the confusion matrix giving the corresponding classification results was also derived. The confusion matrix plays no formal role in the feature selection algorithm. It provides an independent check of its efficiency. Experience indicates that the resultant parameter sets are near optimal. However, features selected in the opening steps may not always yield the lowest possible error rates for the corresponding dimensionality. This occurs when interspecific differences are unevenly distributed among the groups; the variance ratio, which is used to identify strong features, is sensitive to the larger intergroup contrasts, and it biases the selection rule in favor of the variables which make the easiest pairwise discriminations. Features which are accepted early in the development because of their discriminatory strength may later prove ineffective in combination with others, and will be discarded, not because they are weaker than subsequently selected features, but because they are redundant. The continual critical reappraisal of accepted features and rejection of useless ones is thus an important safeguard in the selection algorithm. The stepwise feature selection heuristic is justified by the mathematical fact that, as a feature set is enlarged, increases in the Mahalanobis distance and in the generalized variance ratio are accompanied by a decline in the optimal error rate, provided that the categories are multinormal and homoscedastic. To verify this, let the observation vector X be partitioned into two fixed subvectors X1 and X2 with p and q components, respectively:
.=El.
(79)
This induces a conforming partition of the group centroids Uk,the universal centroid U , and the (common) dispersion matrix 8,as follows:
Here, Zll and Z22 are the dispersions of X1in 8, and XZin 9,,and ZU = Z:'. is the p X q matrix of sums of cross products. The inverse dispersion matrix is also partitioned:
a = Zll
- z;2Z;1z12.
(82)
The values of the elements of 2 1 2 relative to those of Zll and 2 2 2 describe the pattern and extent of dependence between the two sets of observations, X1 and X2.
JUDITH M. S. PREWITT
356
When the subvector X2 is augmented by the subvector XI t o form X , the intergroup distance increases and reflects a gain in discriminatory power. Because of correlations between XI and X2, the net improvement is restricted t o the gain produced by the so-called residual vector X1.z = XI - Z11Z&’X2in Qq, which is uncorrelated with X2. This residual has the mean value Ukl - Z12Z11U~2in class Ck,and a constant within-group covariance OL = Zll - Z:2Z&’Z~2, consisting of two parts: the covariance Zll, and a subtractive adjustment for the interaction of XI and X Z . We introduce the term “deviance” as an abbreviation for “squared Euclidean distance. ’) The deviance D k i ( p q ) 2 between groups ck and Ci in Q H q can be analyzed as the sum of their deviance Dkj(q)2 in Q q and a covarianceadjusted deviance D k j ( p I q ) 2 in Q p which is measured for the residual vector. Letting AU, = (uk- Ui),, m = 1or 2, stand for the vector group difference, the deviance in Q H qis given by
+
D k j ( p
+
fJ)z
=
- Uj)TZ-’(Uk -
(uk
uj)
+ [(AUi - ZIZZG’AUZ)~((.Y) (AUI-ZIZZ~~AUZ)] =
Dkj(Q12
+ &(p
(83)
14)’.
If the two groups Ckand Cj are multinormal, the decline in the optimal pairwise error rate as we pass from Q~ t o Q~~ is a ncnlinear function of the corresponding intergroup distances : Eki(q) - E k j ( p
+ q)
1 =
1/2Dkj(~+p)
-
dz-J
exp ( - +u2)du.
(84)
llZI&j(q)
For a given base error rate E k j ( q ) ,the larger the increment in distance, the greater the decline in error rate. The effect of a fixed increment in distance is, however, greater, the lower t.he initial error rate Ekj(q). Stepwise feature selection introduces only unit changes in dimensionality a t each stage. The associated incremental distance :
can be interpreted as the ( F - ) ratio of the squared between-class difference t o (common) within-class variance after covariance adjustment. Thus, for pairwise decisions and unit changes in dimensionality, the feature producing the greatest change in this variance ratio is necessarily the one
357
PARAMETRIC AND NONPARAMETRIC RECOGNITION
producing the greatest increase in Mahalanobis distance and the greatest decline in error rate. Although any decline in pairwise error rate inevitably produces a decline in the multicategory emor rate, the exact relationship is complicated and certainly not additive. On the other hand, any change in between-group distances produces a determinable additive change in the average Mahalanobis distance : D: = C p k ( u k - U)TZ-l( u k - U) k
The feature selection algorithm maximizes the increase in 0 2 . This, in turn, is equivalent to maximizing the increase in the generalized variance ratio, a population parameter defined in terms of the total and within-class dispersion, and to maximizing the divergence. We turn to this next. The total dispersion T of the population can be partitioned into betweenclass and within-class dispersions B and W . These matrices are defined as follows: T = & [ ( X- U ) ( X - U)T] (87) w k
=
E[(X -
w= B
u k )
(x-
Uk)T
Ix E
c k ]
(88)
PkWk
(89)
Pksk
(90)
k
= k
where u k is the centroid of class c k , u is the universe centroid, and deviance of c k from the universal centroid:
(91)
k
u
(92)
8k = ( u k
-
u)(uk- u)T
C
+C
u=
~
p
k
k
k
and it easy to show that =
the
Ix E c k ]
u k =
T
8k
p k w k
Pk8k
=
W -I- B.
(93)
(94)
k
The determinants I T I and I W I are population parameters called the generalized total variance and generalized within-class variance, respectively. The generalized variance ratio I T 1/1 W I is a population parameter also; it varies directly with the proportion of the total dispersion attributable to category differences, and is invariant under linear transformation of the original space. If the vector X is partitioned into subvectors XI and X Zas before, any generalized variance for a*+, factors into the product of the corresponding
358
JUDITH M. S. PREWITT
quantities for X z in the q-subspace and Xl.2 (i.e., X1 after covariance adjustment for X,) in the adjusted subspace:
I z I = 1 z22 I I Zll - zlzzG1zzlI. (95) The generalized variance ratio factors analogously into the product of two such ratios :
For a unit step in dimensionality, q
=
1:
Hence
By induction, the generalized variance ratio will always increase as a feature set is augmented. The largest single-step gain in the generalized variance ratio is given by the residual variable with the largest betweento-within class variance ratio. To apply these results for assessing the significance of achieved discriminatory power and the significance of additional features to actual data, the Wilks-Bartlett lambda (Wilks, 1932; Bartlett, 1938)
I
]/I
1
(99) which is a sample estimate of the generalized variance ratio, provides a suitable test statistic. For a sample of size N , originating from g groups in a p-dimensional feature space, A has the degrees of freedom (p, g - 1, and are maximum likelihood estimates of the corresponding N - g). population parameters W and T , adjusted for bias, and computed as follows. Suppose a representative sample from the g categories (CI, . . . , C,) is given. Let x m k i represent the ith feature of the kth member of class c,; X,.; the averaged ith feature for C,; X..; the universal average of the ith feature; N , the number of samples from C,; and N the total number of samples from all g groups. The total sample covariance p and the pooled within-class covariance are matrices with the elements: A =
Ti3
=
o
Nm
m=I
k=l
cc
(Xmki
- X..i) ( Z m k j
- X . . j ) / ( N- g )
(100)
PARAMETRIC AND NONPARAMETRIC RECOGNITION
359
m=l k=l
for i, j = 1, . . . , M , and N = Nm. Degrees of freedom are apportioned ( N - 1) to p, ( N - g ) to W ,and (g - 1) to B = - W.If these are computed after covariance adjustment for r of the p original variables, r degrees of freedom are removed from both p and W . The exact distribution of A is complicated, but either F- or x 2 approximations can be used in most circumstances. The statistic
+ 2X) = 7 - ms 2r+ 2X 1-
F(2r;ms where m =
All8
+
12 - 3 ( P s) P = number of variables
9 = number of groups
n=
2
N k
- 1 = N - I,
the degrees of freedom for
k=l
b = g - 1, the degrees of freedom for B r = Pb/2 X = - ( P b - 2)/4
and
[(p2b2- 4)/ ( p 2
+ b2 - 5)
Ill2
s =
provided p 2
+ b2 # 5
(103) for p z b2 = 5, is distributed approximately as an extended variance ratio with 2r and (ms 2X) degrees of freedom. The extended variance ratio is the random variable FOwith the distribution (Abramowitz and Stegun, 1964)
+
+
P[Fo:VI,
=
VJ
Q[Fo: v i , v*]
=
Prob [0 5 F 5 Fo] 1 - P[Fo: v i , VZ] = I 2 ( 3 v z1 'v2 1)
(104)
wherein v2
z= v2
B(2, w) =
+ YIFO
r ( 2 ) r (w)= B ( w , z ) , r(Z
For p
= 1 or 2
+
and for g
t'he beta function.
W>
=
2 or 3, the distribution is exactly that of F
JUDITH M. S. PREWITT
360
Alternatively, for large m, the statistic V = - m In A is approximately distributed as chi-squared, with p(g - 1) degrees of freedom. The statistic A actually furnishes a formal simultaneous test for the equality of the g group centroids in the p-space, under the assumption of equal covariance. A more informative procedure is to make g(g - 1)/2 painvise tests of equality for means with A, which, in this pairwise situation, reduces to an F-test. The conditional significance of p features in the presence of q others is tested using A ( p I p) = I W ( p I q ) ]/I T ( p 1 p) I, with N reduced by p degrees of freedom. For unit change in dimensionality, p = 1, this A-test reduces to the likelihood ratio test of equality of the between-class and within-class variances for the conditional distribution of the single variable being added or eliminated, in the context of the q already accepted features. If X is partitioned as before into vectors Xz, containing q already selected features, and X1containing the residue of p as yet unselected features, then the diagonals of the matrics R and S below contain the required partial variances :
wn - w*w~lwzl WG1WZI R=[
WlZWGl
I
1
(105)
If z j is a current candidate for inclusion, then the part.ia1variance ratio is the F-statistic
The smaller the value of Fj, the less influential zj for sharpening discrimination. If xj has already been selected, but is a candidate for deletion, then the partial variance ratio relative to the other ( q - 1) selected variables is
FA7 - 1 ; N - ( 4 - 1 ) - sl - N - ( q - 1) - g g--1
Rjj
[s,- ll
PARAMETRIC AND NONPARAMETRIC RECOGNITION
36 1
To see this, consider that xj is the first element of XZ. Applying the matrix inversion formula for partitioned matrices (81) to W-' and to T-I shows that the leading elements are the reciprocal within-class and total variances of xj,adjusted for covariance with the already selected features. Specifically,
The smaller the value of F,, the less influential x, is for distinguishing between the groups in the presence of the other features. Canonical analysis, a by-product of the feature selection algorithm, determines the linear transformation to the smallest subspace which accounts adequately for data variability, and assigns proportions of the variability to the selected dimensions. By definition, the new coordinate axes are in the directions of greatest interclass to intraclass dispersion, and the transformed parameters are uncorrelated and standardized (have unit variance). The ratio of total generalized variance to within-class generalized variance in the canonical space is partitioned so that the first axis in the new coordinate system has the largest quantum, the second axis has the next largest quantum, and so on. Although most of the cumulative variability can generally be accounted for in a canonical space of lower dimensionality than the original space, all of the original variables are needed to compute the new coordinates. Thus the canonical analysis is not a genuine method of dimensionality reduction; it is merely a guide to what could be achieved with linear discriminants and uncorrelated parameters. The ordered set of eigenvalues of the product of total and inverse within-class covariance matrices give an empirical indication of the number of linearly independent factors operating. We show below that these eigenvalues are related to Mahalanobis intergroup distance, and hence to the ideal error rate for the two-category Gaussian discrimination problem. The canonical axes are generated by the p eigenvectors vz = ( V z l , . . . , v , M ) T of (m-G - X , l ) V , = 0 (1 10a) which maximize the projected between-class variance v,'Bv, along the ith canonical axis, while subject to ortho-normalizing constraints which ensure that they are uncorrelated and have unit variances: vtTWv,
=
a,,
for all i and j .
(1 lob)
JUDITH M. S. PREWITT
362
Letting A denote the canonical transformation matrix consisting of the column vectors vi: A [ ~ i ,. . . , v M ] (110c) and A denote the square diagonal matrix whose nonzero elements are the eigenvalues X i , these relations can be summarized by
AWA= I (w-'B)A
=
(11Od)
AA.
(110e)
A is orthogonal only when the variables are uncorrelated and standardized initially. The eigenvalues X i are best computed from the related eigenvalues 0 = 1/(1 X) of (110f) - w F ) v = 0,
+
(m
since w is confined to the interval [0, 11. X i turns out t o be the maximized generalized variance ratio in the direction of vi: X i = viTBvi/viTWvi= viTBvi.
(110g)
These eigenvalues Xi are also related t o Wilks' lambda A ( p ) as follows: (110h) or P
~ ( p =)
JJui.
(1lOi)
i=l
The best t linear combinations of the p original variables corresponds to the t largest eigenvalues and therefore maximize the dispersions X i in the corresponding directions. The total normalized between-class variance or D2for the p features is the sum of all roots:
c P
Tp
=
(W)
Xi
i=l
and the same quantity for the best t linear functions is the sum of the t largest roots : t
Tt
=
CXi
XI
L
Xz
2
..-L
At
2 0.
(110k)
i=l
The adequacy of a t-dimensional representation is judged by the relative
PARAMETRIC AND NONPARAMETRIC RECOGNITION
363
size of the residual, r p - r t . The quantities P a
=
[N - $ ( p
+ g) In (1 + A,)]
(1101)
which are distributed approximately as x2 (Bartlett, 1947) can be used to test the statistical significance of canonical “reduction in dimensionality.” The p ( g - 1) degrees of freedom are distributed ammg the roots A, according to the pattern: (p+g-2)
+ (p+g-4)
+...
so that each successive root has two degrees of freedom less than its predecessor, the next largest root. The quantity vtTBv, is also proportional to the sum D2 of all g(g - 1) /2 intergroup Mahalanobis distances calculated with the variate v , ~ X :
Dij = [ V a T ( uk - Ui)1’9 (111) which is also maximized. The canonical axes are therefore discriminants in the directions of minimum group overlap. Linear orthogonal transformations such as factor analysis, canonical analysis, and Karhunen-LoevB expansions have repeatedly been advocated for reducing dimensionality (Rao, 1952; Morrison, 1967; Karhunen, 1947 ; Watanabe, 1967; Chien and Fu, 1967, 1968). The criteria for determining the particular linear transformation vary from method to method, but all can be expressed as eigenvalue problems. Other cluster-enhancing techniques (Sebestyen, 1962) are also of this type. Although the terminology misleadingly suggests that genuine economies in data collection may result in these cases, as classically used, all of these transformational methods merely re-present the original variables, albeit in the form of new (and fewer) linear combinations. A final remark about the practical tradeoff between feature space dimensionality and design sample size, as it relates to recognition performance, is warranted. Even when it is certain that increasing the number of variables will improve discriminatory performance, it may be more expedient to sharpen the decision rule by increasing the size of the design sample, and in so doing, amplifying the number of degrees of freedom available for multivariate tests of significance. A related issue is optimal apportianment of sampling effort in the design set : how many samples should there be in each learning class? The question of optimally partitioning a given sample into design and test sub-samples has recently been treated by several investigators. The popular and misleading method of using the same sample for design and validation is biased toward optimistic estimates of recognition performance. Highleyman (1962) proposed a “hold-out” design in which a large available sample is split into disjoint design and test sets, so that the variance of the
JUDITH M. S. PREWITT
364
estimated error rate is minimized. The analysis is valid only for large samples. A less biased method of Lachenbruch and Mickey (1968) estimates the error rate for the m partitions of a sample of m objects into m - 1 design samples and 1 test sample. For the two-category case, in dealing with the mean minimum accuracy, Hughes (1969) derives optimal measurement complexity, optimum partition of the design sample, and resultant mean accuracy for given priors and total sample size. Structural assumptions, the number and quality of the variables, and the sample size affect both actual and predicted performance. Kana1 and Chandrasekaran (1971) suggest that the optimum dimensionality for a given sample size increases with the increased structure in problem formulation, and is accompanied by an increase in the maximum probability of correct classification. There are no definitive simple answers to date. 7. Experimental Results: Parameter Variability
Parameter distributions for samples of the different leukocytes overlapped considerably (Fig. 18), and parameter pairs exhibited statistically ( C ) NUCLEAR-CYWPLASYIC
101 100 68 64
. 48 . 44 . 40 . . 32
28 24
.. .
96.
'' .. I
.
56 52
36
'.
..
2:. L
164 1601 . 156 152 148 144 140) . 136 132 128, 124 120 116 112 108 104 1001 : 96 91 88 64 80 76 72 88 648 . 601 . 56
Y
I YYI I Y
.
'.
'' .. '.
52
48 44 401 36
CONIRAST
72
. .
52 46 44 40 36 32 28 24 20 16
.. ... ... ..
92.
88
8
I I
i
m
I
n
L
i
i
spq Y
I
I
T B
'. .
FIG. 18. Parameter distributions for typical leukocyte samples. Means and twostandard deviation intervals are shown for three parameters: (a) Mean optical density (b) Absorbance (c) Nuclear-cytoplasmic contrast. The triple letters represent class centroids and the single letters represent the end-points of two standard deviation ranges accorrlmg to the conventioD: L = lymphocyte, M = monocyte, N = neutrophil, E = eosinophil, and B = basophil. There is considerable univariate overlap.
PARAMETRIC AND NONPARAMETRIC RECOGNITION
365
TABLEV CORRELATION COEFFICIENTS FOR VARIOUS MEASURESOF CONTRASTIN LEUKOCYTE IMAGES‘
Correlates
OD nucIeus/OD ceU OD nucleus - OD cell Skewness, cell Standard deviation, cell nucleus - OD cytoplasm
Correlation coefficient .826 .746 .689 .627 .608
Note: = mean integrated optical density (mean density), a Parameter family: nuclear-cytoplasmic contrast. Best representative : coefficient of variation for cell density.
significant within-class correlations (Prewitt and Mendelsohn, 1966a). Correlation coefficients for various measures of nuclear-cytoplasmic contrast are given in Table V. These values, which were obtained from pooled within-class correlation coefficients, illustrate the experimental evidence for the construct “contrast.” High correlations were manifest even for parameters which collectively proved to be good discriminators, although their within-class variation tended t o be relatively low. Sample means, standard deviations, and coefficients of variation for the parameters which formed the best discriminating quartet in an initial sample of approximately 100 cells are given in Table VI for each cell type. Withinclass variation ranges from 1 to 25%, and the across-class average of within-class variation ranges from 8 to 137& Sample correlation matrices are given in Table VII, along with the corresponding within-group generalized variances, which measure group homogeneity in terms of individual variances and covariances. The larger the generalized variance of a group, the more diverse are the members. Using this as a measure, the cell types in order of increasing variability are : basophils, neutrophils, eosinophils, lymphocytes, and monocytes. The corresponding generalized variances stand in the ratios 2.9:4.8:8.6:9.6:9.7. F values assessing the merits of individual variables as pairwise discriminators are unusually high, and cannot be regarded in the classical way. This often occurring phenomenon of extremely high statistical significance for variables in spite of poor classification performance sug-
TABLE VI MEAN,STANDARD DEVIATION, AND COEFFICIENT OF VARIATION FOR Parameter Statistic5 Cell Type L
Mean S. D. C.V. Mean S.D. C.V. Mean S.D. C.V. Mean S.D. C.V. Mean S.D. C.V.
M N E
B
Total absorbance
A. Within-class variation Mean density
84.06 9.43
66905 7786 11.64%
5.63%
Average
24.52%
0.3300 0.0357
8.13%
0.2490 0.0274
13.42% 0.9701 0.0150
10.11yo
10.35%
11.00%
1.557”
10.82%
10.25%
12.G7%
7.97%
B. Total variation All
Q
Mean S.D. C.V.
S.D.
=
63183 11792
60.21 16.36 18.66%
0.4420 0.2150 27.17%
Standard deviation; C.V. = coefficient of variation.
0.7350 0.1650 48.64%
r
v,
0.5022 0.0674 10.81%
8.72% 40.67 4.21
10.25% 0.6813 0.0554
6.45%
8.73%
12.70% 51209 5176
6.50% 0.7515 0.0771
0.8284 0.0535
67.23 5.86
Absorbance ratio
10.55%
12.2570
0. 0.
0.8649 0.0562
0.3793 0.0930
49.13 4.29
59240 7526
Contrast
11.22%
14.03% 55907 3146
BESTFOUR LEUKOCYTE PARAMETERS
0.3460 0.0365
52.14 6.39
77862 10922
0
THE
22.45%
367
PARAMETRIC AND NONPARAMETRIC RECOGNITION TABLEVII CORRELATION MATRICES AND GENERALIZED VARIANCES BASEDON BESTLEUKOCYTE PARAMETERS Correlation matrix R parameters Total Mean absorbance density
Contrast
Lymphocyte
1.00
0.24 1.00
-0.04 0.03 1.00
Monocyte
1.OO
0.10 1.00
-0.68
1.00
0.51
1 .oo
Absorbance ratio
-0.41 -0.32 -0.70 1.oo
Eosinophil
1.00
0.21 1.00
-0.02 -0.45 1.00
Basophil
1.00
0.65 1.00
0.43 0.22 1.00
Pooled
1.00
0.22 1.OO
-0.36 -0.28 1.oo
0.08 0.38 -0.35 1.00 0.91 0.24 -0.70 1.OO 0.75 0.78 -0.54 1.00 0.56 0.10 0.13 1.00 0.37 0.05 -0.35 1.OO 0.59 0.31 -0.46 1.00
TABLE VIII MODIFIED TERMINOLOGY FOR STATISTICAL SIGNIFICANCE OF PARAMETERS USED IN CLASSIFICATION Level of significancefor F values
FOUR
Generalized variance
1.00
Neutrophil
THE
Parameter designation
0.100 0.050
Relevant Important
0.025 0.010 0.005
Significant Highly significant Determining
9.64
9.65
4.88
8.63
2.87
-
368
JUDITH M. S. PREWITT
gested the modified terminology of Table VIII for interpreting tests of significance in the context of discrimination. 8. Experimental Results: Parametric Discrimination Using Decision Theory
Several typical experiments illustrating the decision-theoretic approach to parametric discrimination of leukocytes are summarized in Table IX. For each experiment, a performance matrix gives the proportion of correctly classified cells. Under dimensionality is listed (1) the size of the original unscreened parameter set, (2) the size of the reduced parameter set at the stage of minimum number of misclassifications, and (3) the minimum dimensionality needed to account for data variability. The latter dimensionality would be achieved if we had available as primary features the the linear combination of features formed in the canonical analysis. Also given is the apportionment of variance by canonical analysis, indicating the relative importance of each dimension of the “best” space. The data were derived algorithmically from a sample of 92 white cells in which the five categories : lymphocytes, monocytes, neutrophils, eosinophils, and basophils, were represented approximately equally. The parameter space of Experiment I involved 42 tonally oriented descriptors (see Table IV) . Stepwise linear discrimination selected a diagnostic set of four parameters. Ranked in order of importance, they are (1) coefficient of variation of optical density for the whole cell, (2) ratio of nuclear to cell absorbance or integrated optical density, (3) mean cell density, and (4) nuclear absorbance. The first three of these are responsible for almost all the discriminatory capability of the ensemble, and the canonical analysis likewise indicates a dimensionality of three. The corresponding parameter families, ranked accordingly, are (1) contrast, (2) nuclear-cytoplasmic comparison, (3) concentration, and (4)content. The three parameters reported in The Analysis of Cell Images (Prewitt and Mendelsohn, 196613) as good discriminators for the four most prevalent leukocyte types (lymphocytes, monocytes, neutrophils, and eosinophils) were reexamined in Experiment I1 for the larger population of five white cell types (basophils now included). Although they were chosen on the basis of intuition and manual review of the data rather than by a formal feature selection rule, comparison shows them to be virtually as good as the dominant trio of Experiment I. Experiment I11 dealt with the discrimination of selected cell categories. Neutrophils were ignored in the development of the discriminant functions, thereby emphasizing the discrimination of the more difficult cell types.
TABLEIX CLASSIFICATION PERFORMANCE IN PARAMETRIC DISCRIMINATION OF LEUKOCYTES I
Experiment Cell type Step 1 2 3 4 5 6
L
M
N
I1 E
BAll
Parameter added 19 33 13 12
I11
E
L M N
BAll
Parameter added 10 50 100 50 92 58
80 85 100 100 100 94 95 95 100 100 100 98 100 100 100 100 100 100
35 8 9
L
E
M N
BAll
Parameter added 65 40 100 70 100 73 90 70 100 65 100 84 100 95 100 100 100 99
33 13 12
70 80 95 95 100 100
- 100 100 86 - 100 100 97 - 100 100 100
Pz
1
Z
U
7 8 9 10 Nbr types Parameterso Dimensionality : Unanalyzed space Selected subspace Canonical space Distribution of dispersion in canonical space Dimension 1 2 3 4 4
Parameters numbered as in Table IV.
5 1-42 42 4 3
48% 35% 13% 4%
5 8, 9, 15 3 3 3
69% 18% 13% 0%
4
1-42 42 3 3
66% 23% 11% 0% 9
w
2
TABLE IX-Continued __
Cell type Step 1 2 3 4 5 6 7 8 9 10
V
IV
Experiment
L M N E B A l l Parameter added 19 33 13 49 48
L M N E B A U Parameter added
10 80 95 95 100
50 100 50 92 85 100 100 100 95 100 100 100 100 100 100 100 100 100 100 100
Nbr types Parameters" Dimensionality Unanalyzed space Selected subspace Canonical mace Distribution of dispersion in canonical space Dimension 1 2 3 4
58 94 98 99 100
46 48 49 47 50 44 45
7 9 43 5 1-50
50 90 95 95 95 95 95 95 95 95
45 65 45 60 60 60 95 60 95 60 100 65 100 65 100 70 95 70 95 70 5 7, 9, 43-50
80 80 95 95 100 100 100 100 100 100
50 5 3
10 8 3
42% 33% 18% 7%
46 % 40 % 11% 3%
75 100 100 100 100 100 100 100 100 100
62 73 80 88 89 91 91 92 91 91
PARAMETRIC AND NONPARAMETRIC RECOGNITION
371
Results confirmed the conjecture that the coefficient of variation of density, a measure of contrast and the single best parameter of Experiment I, is successful because it effectively makes the two easiest discriminations : neutrophils and basophils from all other white cells and from each other. This parameter is not selected in Experiment 111. A similar study, not summarized in Table IX but illustrated in Figure 23, concentrated on the four most difficult discriminations (Fig. 23e), and on the separation of granulocytes from non-granulocytes (Fig. 23d). The shape-oriented parameters were analyzed by themselves in Experiment IV. Canonical analysis again indicates a dimensionality of three, but the discriminatory achievement of the shape descriptors in the absence of density information is markedly inferior to the capability demonstrated in Experiment I. The best shape parameters are concerned with nuclear attributes, as expected.
MN
LN
\
cn
3 W _1
0
3 2
W 0
2
a
m LT
0
cn m
a
CONTRAST: C.V. OF C E L L DENSITY
FIG.19(a). Linear discrimination of the five white cell types based on the two best parameters. The coordinates are (I) contrast (as measured by the coefficient of variation for cell density) and (2) relative nuclear absorbance or content (as measured by the absorbance ratio of the nucleus to the entire cell). Bayesian linear discriminants were constructed, but prior probabilities were ignored and population parameters were replaced by maximum likelihood estimtors. Dots indicate class centroids. Names of the five cell types are abbreviated by the appropriate initials. (a) One decision hyperplane is shown for each of the ten necessary Bayesian pairwise comparisons. Each hyperplane bisects the space into half-planes associated with two specific cell types, indicated by the adjacent letter pairs.
372
JUDITH M. S. PREWITT
CONTRAST: C.V. OF CELL DENSITY
FIG.19b. See Fig. 19a, p. 371. (b) The set-theoretic intersection of all half-spaces corresponding to a single cell type defines the decision region associated with that cell type. The regional boundaries are denoted by heavy solid lines. The final system of polygonal regions is delimited by two physical and mathematical constraints : the nonnegativity of the contrast measure, and the fractional nature of relative nuclear absorbance. Thus the actual decision space corresponds to the semi-infinite rectangle indicated by the dotted lines.
All fifty parameters used in both Experiments I and IV were analyzed together in Experiment V. The addition of shape descriptors to the tonally oriented parameters had little effect on performance, and the best three parameters remained the same as in Experiment I, reiterating the significance of the information contained in density patterns, and the redundancy in the large parameter set. The hyperplane decision boundaries for the five white cell types based on the two best parameters of Experiment I (contrast, as measured by the coefficient of variation of density for the cell, and the absorbance ratio of nucleus to the entire cell) are shown in Figs. 19a and 19b. I n the first of these illustrations, one hyperplane is shown for each of the 10 possible pairwise comparisons. Each hyperplane bisects the space into regions associated with two specific cell types. The intersection of all half-spaces corresponding t o a single category defines the decision region for that category. Darkened lines indicate these final decision boundaries, and group centroids are shown by dots. The final partition of the feature space is abstracted in the second of the figures. Additional superimposed heavy dotted lines indicate further boundary restrictions imposed by physical and mathematical constraints of the problem: absorbance ratio by
PARAMETRIC AND NONPARAMETRIC RECOGNITION
373
definition must lie between 0 and 1, and the coefficient of variation must be positive. The relative locations of group centroids in the best 3-space of Experiment I are shown in Fig. 20, the parameters being contrast (coefficient of variation of density for the cell), absorbance ratio of nucleus and cell, and mean cell density. Each parameter is expressed in units of the group standard deviation. Centroids have been connected by lines, the lengths of which give some indication of the intertype heterogeneity. The disposition of actual measurements around these centroids is given in Fig. 21a, and after canonical transformation, in Fig. 21b. These displays make dear why linear functions of the CELLS parameters are effective discriminants, in spite of departures from both normality and a common covariance structure. In terms of the CELLS parameters, interclass dispersion far outweighs intraclass dispersion, and masks the effects of both heterogeneous covariance and asymmetries in the category probability densities. The canonical linear transformation merely enhances the between-group separability and within-group similarities.
FIG.20. Relative locations of class centroid8 for the leukocytes in the best threespace. The parameters are contrast (coefficient of variation for cell density), relative nuclear content (absorbance ratio of nucleus to cell), and mean cell density. Each parameter has been scaled by the corresponding standard deviation computed for all leukocytes. The lengths of the lines connecting the centroids convey these scaled interclass distances, and give some indication of the degrees of similarity among leukocytes bY type.
,
JUDITH M. S. PREWITT
374
Figure 22, in conjunction with Table IX, confirms the validity of the CELLS approach by disclosing the discriminatory completeness of the best four descriptors and the corresponding parameter families. Experiment I showed that one member from each of these families is included in the best linear discriminant in E4.Figure 22 plots distance in a 33-space versus distance in this 4-space1 between all points in one sample of cells and all points in another sample of cells for all categories, as well as mean square diameter of each cell cluster. Pairs of letters correspond to the pair of cell
11
.D E N S ~ T Y
‘
CONTRAST
FIG.. 21(a). Dispersion of leukocytes in the best three-space. Individual white blood cells in a sample of 92 indicated by the letters: L, M, N, E, and B, according to their type : lymphocyte, monocyte, neutrophil, eosinophil, and basophil. (a) A perspective view in the space of the three best parameters: contrast, nuclear to cell absorbance ratio, and mean cell density, indicates distinct clustering tendencies by cell type. Lines connecting group centroids and marking their projections onto the coordinate planes have been introduced as visual aids. (From Prewitt, 1972a.)
PARAMETRIC AND NONPARAMETRIC RECOGNITION
375
i
1
FIG.2l(b).--Continued. See Fig. 21(a), p. 374. (b) The canonical linear transformation to nncorrelated and standardized coordinates (see Table X) enhances intratype clustering and intertype separability, since the canonical space is the space of maximum betweento within-class generalized variance ratio (up to linear transformations). (From Prewitt, 1972a.)
types involved in the computation. The relationship between distances in these two spaces is predictable, in fact, is linear, indicating that the addition of variables beyond the fourth contributes no new information in the sense of between- t o within-group dispersion, and no change in the already excellent classification performance. The final set of illustrations, Figs. 23a through 23e, demonstrates the effectiveness of the canonical transformation in securing dense within-type clustering and dramatic between-type separation. A two-dimensional projection of the configuration of cells in the best canonical space is shown in each case. Along the coordinate axes, those two independent linear combinations of the original variables which account for almost all the data
376
JUDITH M. S. PREWITT
DISTANCE IN 4-SPACE
FIQ. 22. Redundancy of discriminatory information in high dimensional feature spaces for leukocytes. Intertype distance (root mean squared distance between all points in one sample of cells and all points in another sample of cells) and intratype root mean squared radius in the best four-space are plotted versus the same quantities in the best thirty-three-space. Letters L, M, N, E, and B identify the leukocyte categories, according to our previous convention. The almost linear relationship indicates stabilization of information content, i.e., no genuine gain in information as the number of variables is increased beyond four. Variables were selected sequentially, following the stepwise procedure described in the text, which maximized the covariance-adjusted increment in between- to within-class generalized variance. (From Prewitt, 1972a.)
variability are plotted in normalized form. Individual letters represent one or more actual cell observation vectors, group centroids are marked by asterisks, and coinciding observations by "$'. Figure 23a gives the canonical space for Experiment I, a two-dimensional projection of the best 4-space. Two cell types, N and B, are well separated from each other and from the remaining types, L, M, and E. These are better resolved in the projection of the best 5-space, shown in Fig. 23b, and linearly separable in the pro-
377
PARAMETRIC AND NONPARAMETRIC RECOGNITION
jection of the best 33-space of Fig. 23c. The result of focusing the discriminatory effort on a subproblem of the complete recognition task is illustrated by Figs. 23d and 23e. In the first of these, discriminants were developed for distinguishing the granulocytes (N, E, and B) from the nongranulocytes (L and M) . In the second, neutrophils were not included in the determination of discriminants for separating the other four types, but the canonical projection was constructed for all five types. The neutrophils, though distributed loosely through the space, nevertheless can be delineated from the other cell types. The canonical variables derived from the four parameters of Experiment I act independently to a large degree, in the sense that each separates one or two cell types from the remaining groups at the four standard deviation level or better. The first canonical variable characterizes basophds, the third lymphocytes, and the fourth monocytes, while the second distinguishes neutrophils and eosinophils from each other and from all other cell types. Canonical coordinates for the five group centroids are plotted individually in Fig. 24. Examination of the matrix for transforming from original to canonical variables, scaled by the mean of each original variable as appropriate to render the entries or transformation weights commensurate, suggests that each canonical variable is connected principally with a single descriptor family (Table X) .
TABLEX CANONICAL ANALYSIS OF LEUKOCYTE PARAMETERS" Canonical variable effective weights Original parameter and rank Total absorbance 4 Mean density of cell 3 1 Contrast Absorbance ratio, nucleus: cell 2 Predominant parameter family Distribution of dispersion in canonical space
I
I1
I11
IV
- .63 -1.22
-5.68 -4.58 -2.85
-3.15 +8.67 +I .43
+ -38
+8.16 Contrast
+12.71 Nuclearcytoplasmic relationship
Concentration
48%
35%
13%
+8.44
+4.96
-8.20
-
.48
-1.55 Content
4%
a The effective weight of each original parameter in the canonical variables was obtained by adjusting the corresponding canonical coefficients by the mean value of the parameter.
378
JUDITH M. S. PREWITT
-21. 600
............................................................................................ .. .. -¶.LOO
-13.bOO
-9.600
-1l.bOO
lI.+OO 13.133 13.011
11.400 Il.IJ3 11.061 10.400 9.133
9.Obl a.400
1.133 1.061. *.+00 5.133 3.061 4.400
1-13) 3.011 2.400 1.133
1.011 0.400
-0.261 -0.933 -1.600 -1.261
-2.*33 -J.bW -4.261 -4.PJJ -¶.boo
-b.261
-.el3 -7.b00 -&ZbI -8.933 -9.600
-10.261 -10.933 -11.600
-12.261 -12.933
-13.b00 -14.261 -1I.9)) -1Y.600
-16.261 -16.933
-11.600 -1I.261 -10.93) -1q.600
-20.2bl -20.933 -21.600
ia.400
1.400
b.400
-l+bOO
.. .. ... . ... .. ... ... fn .. .. .. .. ... .. . ... .. .. .n ... . ... .. ........................................................................................... N
Id NN * N N-NN
N N h NhN
N
N
M
I I E L LE E f E f*Ef
M
z
LE
PI
L
II.M (I
* I * l1 1
L l L
L
1.
LLI
IL L
0
n o 0
8 0 0
- I 1.600
-9.600
- I -bOO
b-600
i4.400 14.400
.. .. ..
.. .. . ... .. ... . ... .. ... .. . ... .
13.133
lI.OL1 11.400 11.13s ll.Ob1 10.400 9.118 *.#I 0.400
1.131 1.011 *.+a0
Y.133 Y.01) 4.400
3.lY3 3.011 2.400 1.133 1.W) 0.400 +1*1
*
.. ... .. .. . ... *
. s
-0.*33
-1.-2.1*1
-1.*3s
-LMO -*.**I
-*.*11
-¶.boo -*.2*1 -*.*33
-1.bOO -0.2*1 -0.qBS
-*.boo
-1O.2*1 -10.*33 -1I.MO -W.Z*P -11.*33
-1LbOQ -14.1*1 -14.911 -1Y.MO -1*.2*1 -1*.*11 -11.*00
-10.2bl
-10.*11 -1*.*OQ -20.1bl
-2O.SID -1i.LOo
16-600
FIG.23(a), (b). Linear separability of leukocyte samples as shown by canonical projections. In each case, canonical transformation was applied to the original feature space under consideration. The corresponding best two-dimensional canonical sub-space is shown. Coordinates consist of those two independent linear combinations of the original variables which individually account for the greatest and next-to-greatest proportion of between-class to within-class data variability. Individual letters indicate actual single or multiple observations from the same cell category, 9s indicates conicident observations corresponding to different cell categories, and * denotes class centroids. Although the canonical linear transformation visually enhances between-type separation and compresses within-type dispersion, the probabilities of correct classification are unaffected, provided that the leukocytes follow similar multivariate ellipsoidally symmetric probability densities. The sequence of parts (a)-(c) shows that separability is enhanced as dimensionality is systematically increased in the feature selection procedure.
PARAMETRIC AND NONPARAMETRIC RECOGNITION
-15. 600
11.400 1O.VOO 10.400
9.900 *.4w 8.MO
8.400 7.900
1.400 b.900 b.400
5.900
-%boo
.............................................................................................. .. . -12.bOO
-*.boo
-3.b00
8.m
1.400
-0.600
5.400
... .. ... .. .. . ., . .. .. .. ... . . ... .. .. . ... .. . ... . .............................................................................................
5.400 4.900.
4.400 3.900 3.400
2.900 2.400
I..QO.
I
1.400
0.voo. 0.400
-0.L00 -0.600
-1.100 -1.b00
-2.100 -2.600 -3.100
-3.Loo
-4.100 -&.Loo -5.100 -5.*00
-*.I00
-*.Loo -7.100 -7.-
-..loo.
-8.600
-..I00 -%6W
-10.100 -10.*00 -11.100 -1I.bo0
-12.100 -1z.Loo -11.100 -13.600
-14.100 -14.boo
-15.100 -1LbW
379
L
I
*
LL
L
L
L
L LLL.LL L L LL L
I
R
la
E
I
E I
€
*
€
I
-11.*00
-1~.b00
L
M I € I R E CL I4 SEE EE L EEE. C .... la.
L
-%boo
-*.LOO
-3.600
-0.b00
2.400
I.400
a.4m
11.400
. ... .. ... .. ...
11.400 10..00 10.400
*.w 9.400
8.W 0.400
?.W
1.400 &WO
b.400
1.uo 1.400
*.we 4.*00 D.9W
. ... .. .. .. .. . ... . .. -l.w .. -*.boo ... .. -..loo ... .... .. . ..
DAOO
2.wo
2.400
1.-
1.400 0.0.400 -0.100 -0.400 -1.100 -1.*00 -&I00
-2.MO -).I04 -3.MO -4.100 -4.MO
-1.mo
-b.l.Q
-1.100 -1.-
-&I00
-#.bOQ
-*.boo
-I0.100
-10.404 -1I.100 -1I.bW
-12.IOo
-U.MO
-1D.lW
-1s.wQ
-14.100 -14.MO -1S.IoO -19.w
11.400
The sequence (d) and (e) concerns discrimination of some, rather than all leukocyte types. (a) Canonical projection for the best four-space. Neutrophils and basophils form discrete clusters set well apart from the other groups. (b) Canonical projection for the best five space. Separation of lymphocytes, monocytes, and eosinophils is improved. (From Prewitt, 197%)
In pursuing the complete automation of the CELLS programs, a second set of experiments was conducted, with two purposes in mind: (1) to test the ent,ire algorithmic sequence, including computer histogram analysis, feature extraction, computation of linear discriminants, feature selection, and ultimate classification; (2) to examine further the impact on cell discrimination of two contrasting points of view: formal articulation of cell images and informal appreciation of nuclear-cytoplasmic distinctions. A subset of these experiments, using approximately 200 cells, is given in
JUDITH M. S. PREWITT
380
Table XI. The specimens were a more varied collection, coming from different donors and prepared and scanned at different times. Several points were clearly made. ( 1 ) A residual weakness in the histogram preprocessing of markedly granular eosinophils, attributable to histogram degeneracy, confounds the articulation of the corresponding images, but in no way impedes the correct recognition of these cells. (2) The density-oriented parameters are a rich source of cell descriptors, and many combinations are effective for characterization and discrimination. (3) Canonical analysis of between- to within-class generalized variance indicates that in all cases, virtually all discriminatory information is contained in the first four dimensions.
'11.100
-¶.Zoo
-13.ZW
2.1-
1 0 . ~
............................................................................................. . .. .. ... ... .. . .. . . .. *.wa r.wr *.*oo . .. .. . ... . .. . . ... . -. -i.aoo ... . .n .. ... -).am .. -. . .. .. .. .. .. .. ... ... . .. -w.w . .. .. .. .. . -n.zao . .. ... .. .. - m . w . . -n.aao . ........................................................................................... . 4.1W
-11.200
*.Dm
-1.1W
11.100
*
1a.m
lb.100
I*.lll
1b.133
13.**1
11.a.1 I2.1W 12.133
ILWI
ll.*bl
11.131 11.W1 10.100
f
10.100 10.133 9.bbl
LO.II#
€ I
9 1 b . 0.100 D.111
f ff
a.mo
ff.tt
e.131
6€ f f t f
7.W1
*.I13 L**1
*
b.131 %**l
b.IW
b.MO
a.133 >.*.I
*.I11
2.100 2.133
1.w 1.113 l.W.1
J.bb.)
I.b.1 0.000 0.133 -0.531 -1,100 -1.m.1 -1.533 -3.100 -3.w -b.533 -5.100
0.ma
0.IS3 -0.138
-1.w
11 I
-1.911
1 111)11D
').HI 'b.9J3
I
-5.mo -¶.#a? *a933 -l.m -i.n?
+.aw
4131 -1.100
-?.a*)
-0.931
-1.531
-9.100
-9.m
-*.w
-9.w1 -10.931
-10.313 -11.m -11.0.1
-Il.m
-1i.n.) -lI.SBI
-11.s11 -13.200.
-13.-
-1n.w -1a.131 -1B.M
-1*.ss3 -15.ma
-is.nv
-15.1b1 -1*.533
-L#.BJ# -11.m 4?.#*7 -1m.911
-11.1**
-im.au
-w.zao
-1*.aea
-19.a..
-ao.s*>
-m*s19
-&l.no
-11.2w
-1I.100
*.ow
.l.100
-9.m
-11.200
-5.100
Id00
1b.M
IO.MO
FIG.23(c). See Fig. 23(a), (b), p. 378. (c) Canonical projection for the best 33-space. There is complete linear separability for all five cell types in this 2-space canonical projection. (From Prewitt, 1972a.)
PARAMETRIC AND NONPARAMETRIC RECOGNITION
-11.300
12.000
6.000
-0.ooO
-b.000
381
............................................................................................. .. .. .. ... .. . .. .. *.ma .. .. . a.m . .. ... -. *.m . .. .. ... .. .. .. .. . .. ... .. .. .. . . . . 4.cw. ... . .. .. . . -1.000
-9.000
*
3.000
..OOO
15.000
1S.Y
m.Ya i4.m I3.W 13.11.W
la.-
11.m. 1l.Y. 1I.W 1I.Y
15.000 14.500
1b.000 13.500 13.000
11.500 12.000 11.300 11.000 10.I00 10.*00
*
s.sm
9.IW
*.OW
I . *
B.300
*.no0
1.1.m
?.300
r.000 b.300
0 . w .
6.000 5.500 5.000 4.s00 4.000
9.W . 3
*
4-. 4-. 3-.
i . 3 ~
3.-
8.000
1.m
1.100 1.000
2.m 1.1 -. 0.-
4.-
f
-I.-
I
f
-1.Y. -1.-2.-
-1.m
1.soo
1.OW 9.100 -0.060
4.500 -1.w -1.500 -2.OW -1.300
-3.m.
-*.on
4.W.
-3.m -4.ODO 4.100 -5.000
-&W
-I.o(I.
I
-S.U
-6.m.
-*.UI.
I
-l.Y -l.U.
4.m
.
.. .. .. .
... . ............................................................................................ ..
4.U. -9.m.
4.W
-1a.m -la.*
-1l.Y -11.Y.
-1s.-
-12.OII
-9.m
-1.000
-6.-
..Dm
1.000
-0.oOCr
..no0
11.000-
-3.30. -6.W
-b.100 -7.m
-1.500 -8.000 -8.300
-*.ow
-*.so0 -1e.000 -1O.M0
-11.m -1L.SOO -12.000
ls-OoO
FIG.23(d). See Fig. 23(a), (b), p. 378. (d) Discriminants for distinguishing the three granulocyates (N, E, and B) from nongranulocytes (L and M considered as a single group) were developed and canonically transformed. There is complete linear separability of the four groups.
(4) These four dimensions continue to be an admixture representing four feature families: contrast, content, concentration, and absolute or relative nuclear size. (5) Recognition with the fewest number of features (five) is obtained when the selected parameters explicitly or implicitly refer to nuclearcytoplasmic distinctions. (6) The first single parameter selected by the computer under free choice is coefficient of variation of density. This has the highest formal measure of average discriminatory power because of its ability to make a
JUDITH M. S. PREWITT
382
-10.100
-1b.100
-4.100
1.200
1.200
............................................................................................ *.zoo . .. .. .. .. .. .. .. .. .. .. .. .. . . -*.eon . . . -l.m. .. .. .. .. . -.. . ... . -1.100
-1.100
-13.100
4.100
10.100
10.100 *.100.
1.100 0.200 1.100.
L
L
1.200
b.100
L
b.200
LLL
LL
5.100 5.100 4.100 q.100 3.100 3.1OQ. 1.100 2.100 1.100 1.200 0.100 0.200 -0.100 -0.100 -1.390
L
L L . L L L L L
L l
L
. I
I
PI* n I.. I I #
#
n
*
n
Y
N
-1.100
.
1
-?.3CO -2.100
M
N
-3.300 -3.000
1
I
9
01
-4.300
a 1
-5.300 -3.100. -b.300 -*.100.
I
f
N
E
f
Elf f f 1.111
E
N
1
f E
0
I
f
N
N
I
.
N
-1.100 -1.300 -1.800
Y
N
N
-*.300 -9.188
-10.300
I
-10.100 -11.300 -11.100 -11.300 -12.100 -11.300 -13.000
-L*.30. -14.000 -19.100 -1s.000 -1b.300 -1*.000
ll
N
M
f
.. .. .. .. .. .. .. ... .. .. .. ... .. .. .. .. ..
-16.800
-10.100
-4.100
1.100
1.100
0.100 1.100 T.100
6.100 b.100 s.100
s.m 4.m 4.200 1.100 3.100 2.100 1.200 1.100 1.100 0.100 0.100 -0.m -0.100 -l.JaO -1.800 -2.300
-1.100 -3.m -3.-
-*.Ma -4.wa -9.300 -9.w
4.m -b.#O
-l.m -1.100 -1.100 -0.100
-*.NO
-9.w
-10.lao -10.100
-11.m -12.#0 -12.100 -I3.M0
-13.~0
-14.m
-14.m
-1s.-1s.w
--I*.#. 1b.W
10.100
4.100
-1.100
-).SO0
*.m
.. ... .. .. -ii.ew .. ... . ..
............................................................................................. -13.1w
10.100 *.IW
1.2W
FIG.23(e). See Fig. 23(a), (b), p. 378. (e) Discriminants for distinguishing all types except neut,rophils (the easiest group to identify) were developed and transformed. Lymphocytes, monocytes, eosinophils, and basophils are compactly clustered in different regions; neutrophils are widely dispersed in terms of the canonical variables developed for discriminating the other four types only, but nevertheless, t.hey are distinct and bsparable.
perfect tripartite decision : monocytelymphocyteeosinophil vs. neutrophil vs. basophil. (7) The articulation-dependent parameter, mean density ratio of nucleus to cell, correlates with the coefficient of variation of cell density at the 0.9 level, but has a lower formal measure of discriminatory power. Its discriminatory capacity is more uniform among cell types, however, and it actually improves recognition performance by 20%. (8) When selection is confined t o features of the unarticulated cell, perfect recognition occurs in 7-space.
PARAMETRIC AND NONPARAMETRIC RECOGNITION
Lymphocyte
tjeutrophil
Larophil
Monocyte
383
Eosinophil
0 Group separated from others ot 4 stondard deviation level FIG.24. Canonical coordinates for the five leukocyte group centroids. Coordinate values were obtained from the four best original cell parameters by linear canonical transformation (see Table X). Together, the four canonical variables account for all the between- to within-type variability in the sample of cells; they are rank ordered according to the proportions of the generalized variance ratio accounted for. The canonical variables act largely independently to separate and characterize individual leukocyte types by simple linear inequalities. At the four standard deviation level: (i) the first canonical variable characterizes basophils; (ii) the third characterizes lymphocytes; (iii) the fourth, monocytes; and (iv) the second distinguishes both neutrophils and eosinophils.
(9) When selection is limited to 32 articulation-dependent parameters, perfect recognition again occurs in 7-space. (10) When selection is restricted to the 14 exclusively nuclear parameters, recognition falls just short of being perfect. 9. Nonparametric Pattern Detection: A Cytotaxonomy for leukocytes
The five white cell types were represented as well-dispersed, compact clusters even in the space of the three best features generated by the CELLS program (Fig. 21). The intense unimodal clumping and the high degree of separability in the original feature space and especially in the canonical coordinate system of standardized and uncorrelated variables, make apparent the reasons for the success of linear functions, although technically they are not the optimal discriminants. These demonstrations strongly suggested that the five cell types might be recovered from the
TABLEXI A. Linear discrimination with thresholds supplied Parameter universe" Experiment
Number
Set
A1
68
1-68
A2
49
26-28; 7, 8, 11, 12, 18, 19
Availability
Background reference
All free
Mode
All free
Mode
Step 1 2 3 4 5 6 7 1 2 3 4 5 6
7 A3
49
26-28; 7, 8, 11, 12, 18, 19
All free
Mean
1 2 3 4 5 6 7
Variable 38 C.V. density 55 Absorbance ratio 32 Mean density 31 Absorbance 65 S.D. ratio 10 Span 27 Area 38 C.V. density 55 Absorbance ratio 32 Mean density 31 Absorbance 65 S.D. ratio 36 S.D. 39 C.V. density 38 C.V. density 55 Absorbance ratio 32 Mean density 31 Absorbance 65 S.D. ratio 36 S.D. 39 C.V. density
%
Articulation dependence
N/CE* N N/CE CYT CYT N/CE N N/CE CYT CYT N/CE N N/CE CYT CYT
Correct
+ + + + + + 4-
+ + + + + + + +
.
60 94 97 99 99 99 100 60 94 97 99 99 99 100 58 92 98 100 99 99 100
385
+ + ++++
PARAMETRIC AND NONPARAMETRIC RECOGNITION
++++ +
TABLE XI-Continued 0 OD
A. Linear discrimination with thresholds supplied
0.
Parameter universea Experiment
Number
B4
14
B5
32
Set
Availability
All free
Nuclear parameters
Articulation dependent parameters
Parameters numbered as in Table IV.
All free
b
Background reference Mode
Mode
Step 1 2 3 4 5 6 7 8 9 10 11 12
13 14 1 2 3 4 5 6 7 8 9 10
Variable 16 Range highpoint 37 S.D. 43 Skewness 31 Absorbance 40 C.V. density 19 Modal frequency 7 Entropy 34 Mean density 15 Range lowpoint 17 Span 18 Mode 52 Contrast differential 28 Area 46 Kurtosis 60 Separ. means 34 Mean density 59 Separ. means 31 Absorbance 53 Area ratio 55 Absorbance rat.io 65 S.D. ratio 67 Contr. diff. ratio 54 Area ratio 28 Area
N = Nucleus, CYT = Cytoplasm, C E
=
Cell.
Articulation dependence N N N N N N N N N N N N N N N/CYT N N/CE N N/CE N/CE N/CE N/CE N/CYT N
+ + + + + + + + + + ++ + + + + + + + + +
% Correct 61 79 88 92 92 92 93 95 93 93 93
1
c
P
I' ic u, V
A
93 93 94 70 89 92 95 95 98 100 99 100
100
?
:
PARAMETRIC AND NONPARAMETRIC RECOGNITION
387
raw data without the need for a learning set or for an analysis based on distributional assumptions, e.g., by a clustering technique. There are also other reasons for seeking a nonparametric or distributionfree approach. Although linear discrimination was effective for small, unambiguous leukocyte samples, the learning machine approach may well prove deficient in dealing with harder t o distinguish examples and more variable cell populations. This possibility is further complicated by the fact that the joint probability densities for the several cell types did in fact, differ in covariance structure and symmetry, and did respond differently t o variations in the preparation and presentation of specimens. In this broadened universe, it may be difficult t o apply classical parametric decision theory, since it requires the estimation of presumably stable subpopulation parameters from test cases. A nonparametric cluster synthesis was therefore developed, using similarity or proximity in feature space rather than maximum likelihood as the basis for class membership (Prewitt, 1967, 1971, 1972a, b). Unlike the statistical approach, neither the nature nor bhe number of categories need be prescribed. The notion of cluster is explicated in terms of quantitated statistical, psychological, and mathematical notions such as resemblance, proximity, tendency, connectivity, and significance. Pattern articulation proceeds de novo on the basis of the heuristics embodied in the algorithm rather than on the basis of externally supplied information. Thus a taxonomy responsive t o structure detected in the raw data is generated in preference to a superimposed, preconceived codification. The clustering algorithm accepts object parameters or similarity relations, and generates groups by (1) examining a dynamic dissimilarity function defined for all pairs of objects and clusters and constituting an objective measure, (2) applying a merger strategy in accordance with iterative optimization of the dynamic objective function, and (3) evaluating a dynamic objective measure which can be thresholded to provide a halting criterion. Program options permit variation of data normalization, measures of similarity, and clustering strategies. There are three principal phases in cluster synthesis. I n the first or preprocessing phase, parameters of the candidates can be transformed by one or more mappings which fall in the category of linear transformations and translations. If X is the original p vector, and Y the corresponding q vector of transformed variates ( q < p ) , these transformations are described by
Y
=
A(X - B)
where A is a p X q rotation matrix and B is a q-dimensional translation vector. Among the possibilities are pure translation, pure rotation, feature
388
JUDITH M. S. PREWITT
selection, suppression, or accentuation. Variables typically are centered about the universal centroid, and scaled by their standard deviations. In particular, the original variables may also be replaced by a subset which retains their discriminatory information but is less redundant. For example, uncorrelated, standardized variates can be derived by means of special variance-minimizing or cluster-enhancing linear, but nonorthogonal mappings. These mappings can be learned by feature selection in a similar, but independent design data set, and later applied to unknowns. The canonical transformation to uncorrelated, standardized variables for which the ratio of between-class to within-class generalized variance is maximized is potentially useful for preprocessing since it is cluster enhancing but not ortho-normal and distance preserving. The second phase consists of computing a measure of dissimilarity or distance for all candidate pairs, using one of a variety of metrics or similarity coefficients. These include Euclidean and other Minkowski metrics, including the limiting case, various correlation and matching coefficients, and a measure of nonrandom association. Alternatively, the synthesis may proceed directly from similarities or dissimilarities rather than object parameters. I n the final phase, clusters are synthesized. Initially, each object is a singleton cluster. The “most similar” or “closest” clusters are fused to form a new cluster which replaces them. Dissimilarities or distances between revised clusters and unperturbed clusters are computed in accordance with the user-elected clustering strategy, and the ent,ire process is iterated until the supply of candidates for merger is exhausted. The dissimilarity matrix thus constitutes an objective function which is minimized at each iteration and drives the formation of clusters (cf. Ward, 1963). This progressive modification of the dissimilarity matrix involves one of six variants of a single transformation which can direct the mergers so as to emphasize either compactness, connectedness, or central tendency of t.he resultant clusters. The algorithm can thus efficiently accommodate a broad range of cluster shapes and topologies within a parsimonious program structure. Since cluster formation involves successive merger of preexisting clusters and isolated objects until all objects have been subsumed by a universal cluster, and since a membership assignment, once made, cannot be revoked, a hierarchy or tree of nested clusters encompassing the entire sample is generated. The root corresponds to the conjoint partition, branch tips correspond to the disjoint partition, and nodes represent intermediate arrangements. The tree is equivalent to a nest of increasingly coarser partitions of the universe of discourse into equivalence classes. The status of the clustering is summarized by successive values of an objective measure which is a generalization of the multivariate within-
PARAMETRIC AND NONPARAMETRIC RECOGNITION
389
cluster to total variance ratio, computed either in the space of transformed variates or else in Euclidean space. Within-cluster “variance” increases continually as clusters are fused. Since the total “variance” or sum or squared dissimilarities is fixed, there is an inevitable reduction in betweencluster ‘Lvariance.J7 The objective measure is therefore bounded and ranges from a value of 0 for the disjoint partition (no internal heterogeneity) to a value of 1for the conjoint partition (no external heterogeneity). Changes in the objective measure are computed at the same time that intercluster differences are revised. Both objective function and objective measure are monotonically increasing functions of the number of clustering events. I n general, clusters which are merged because they are currently “the closestf1or “most alike” do not correspond to the smallest possible decrement in between-cluster variance.” However, for one of the merger strategies, dissimilarity between two clusters is measured by the incremental internal variance of the hypothetical conglomerate; minimizing the objective function at each iteration in this case coincides with minimizing the objective measure, and cluster formation tends to preserve within-cluster homogeneity or internal structure. Values of the objective measure are useful for evaluating the significance of clusters. The sequence of values of the objective measure at the completion of each iteration corresponds one-to-one to the sequence of clustering events by definition, and these values can be used for determining a halting criterion. The cutoff threshold is at the user’s discretion. I n general, the level which seems to correspond to psychological significance is a function of the underlying probability structure of the data, as well as of the cluster merger strategy. The selection of a cutoff threshold is analogous to the selection of significance levels in statistical hypothesis testing. I n our diversified experience, clusters existing at the 0.01 level of the objective measure were persistently in harmony with independent human judgments (Prewitt, 1971, 1972a). For multivariate normal populations with common covariance which are analyzed in Euclidean space, the objective measure equals Wilks’ lambda statistic, and it leads to an exact statistical test of significance of global discriminatory achievement. The application of clustering to leukocyte data had two purposes: first, the de novo construction of a cytotaxonomy based on machine rather than human perception and cognition; second, validation of the traditional leukocyte taxonomy which formed the basis of the statistical approach by prescribing the learning sets. Using the algorithm as described above, recovery of the five white cell types was completely effective at the 0.01 level of the objective measure (Fig. 25). The five categories were stable over a broad range of normalization procedures, metrics, and cluster merger ((
390
JUDITH M. S. PREWITT
LEUKOCYTE CLUSTER DETECTION THE OBJECTIVE FUNCTION: INTER-CLUSTER DISTANCE AT MERGER MAX EUCLIDEAN 4-SPACE COMPACT CLUSTER STRATEGY
or 0
71-
9
18
27
36
46
55
64 73 82 91 CLUSTERING EVENT
FIG.25(a)-(c). Validation of the traditional leukocyte taxonomy by cluster synthesis. The farthest-neighbor clustering strategy was applied in the four-dimensional Euclidean space defined by the parameters which lead to perfect linear discrimination: (1) contrast, (2) relative nuclear content or absorbance, (3) mean cell density, and (4) total cell content or absorbance. This strategy merges clusters a t each iteration of the algorithm so that the resultant has the smallest possible cluster “diameter” over all unions possible. The algorithm produces a nest or hierarchy of compact clusters. The five leukocyte classes were recovered from the raw data without human intervention or assistance. (a) The objective function: intercluster distance at each merger. (b) The objective measure : generalized intracluster to total “sun1 of squares” or generalized variance ratio a t each merger. The function rises abruptly just several steps before the conjoint partition (all cells included in the universal cluster) is generated. (c) Cluster formation tree with the objective function (intercluster distance at merger) as ordinate. Terminal dots represent individual cells, and confluent lines represent cluster mergers. (From Prewitt, 1972a.)
strategies. For Minkowski spaces with exponents 1.0, 1.5, and 2.0, classification errors on a sample of approximately 100 cells were only I%, 1%,
PARAMETRIC AND NONPARAMETRIC RECOGNITION
39 1
LEUKOCYTE CLUSTER DETECTION THE OBJECTIVE MEASURE: INTRA-CLUSTER TO TOTAL SUMS OF SQUARES
:
:
I
i EUCLIDEAN r s T 4-SPACE E R STRATEGY
0.7 0.6
0.5 0.40.3
-
0.2 0.1 -
0 . 0 7 0
I
9
18
27
36
46
55
64
73
82
r
91
EUCLIDEAN 4-SPACE COMPACT CLUSTER STRATEGY
5 CI 5I tw 2 a
0
t-
2 a W rn t-
3 -1
v
MAX
-
.. .. .. .... .. . ... .. . . .. .... .. . .... .. . . ... ... ... .... ... . ... . ... .. . .. . . . .. ... .... .... .. . .., ...
T-rF
392
JUDITH M. S. PREWITT
EUCLIDEAN 4-SPACE COMPACT CLUSTER STRATEGY v)
LYMPHOCYTES EOSINOPHILS MONOCYTES BASOPHILS NEUTROPHILS ............................................................................................
w
v)
I
U
I
.2-
0 v)
3
.3-
v)
.4-
e
.5-
+
LJ
.6-
.7-
c
2
.8-
2
.
91.
-
EUCLIDEAN 4-SPACE COMPACT CLUSTER STRATEGY
............................................................................................ K
w CI K
:
t
a
v)
2a
K
K w v) I-
3
0
I
FIG.25(d), (e). See Fig. 25(a)-(c), p. 390. (d) Tree representation of clusters rescaled by the objective measure (generalized variance ratio). The cell types of the clustered sample were labeled independent of the computer analysis by a cytologist using other, unverbalized criteria for class membership. (e) Cluster size tree for the leukocytes showing radius at cluster formation. (From Prewitt, 1972a.)
and 4%) respectively. The nodal positions marking merger of the main cell types varied with the metrics, but the categories maintained a regular pattern of relationship (Big. 26).
PARAMETRIC AND NONPARAMETRIC RECOGNITION P
6 %
error
1.0
1 %
1. 5
1
%
4
%
@f&7&iJ
2 .o
393
FIG.26. Stability of the leukocyte taxonomy. Taxonomic trees for leukocytes are shown for three different Minkowski four-spaces, based on contrast, relative nuclear ahsorhance, mean cell density, and total cell absorbance. The spaces differed in choice of the exponent p in the distance formula d ( X , Y ) = (zi - yi)P]'/p, where X = (zl, . . . ,z), and Y = (yl, . . , yn) are two points in the space and n was taken to be 4. The fartheseneighbor or compactness-emphasicing merger strategy was used in each case. All five cell types were detected in Minkowski spaces with exponents p = 1.0, 1.5, and 2.0, as well as in many other spaces not illustrated. Nodal positions marking merger of the main cell types (abstractly defined as the autonomous clusters existing at the 0.01 level of the objective measure) varied with the metric, but the categories can be arranged in the same terminal order. Individual misclassifications by the computer, as judged by a cytologist, were less serious for p = 1.0 and p = 1.5 (1%) than for Euclidean space, p = 2.0 (4%). Aside from these insignificant departures, the basic cytotaxonomy was reconstituted by the computer cluster synthesis, using photometric quantities rather than explicit morphological and textural qualities. (From Prewitt, 1972a.)
.
[x?-l
In this way, taxonomic consistency was established between the unverbalized qualitative discriminatory criteria used by the morphologist, and the feature space based on quantitative computer-generated, tonally oriented leukocyte parameters. By this we mean that the morphologist's classification of white blood cells and the numerical taxonomy agreed in extension, although they were based on fundamentally different criteria. This, we felt, was a fundamental test for not only the clustering algorithm, but also for the CELLS approach to characterizing leukocytes. 10. The Inverse Problem: Human Visual Discrimination Using Computer-Oriented Image Properties
The experimental results reported have settled affirmatively the two questions: Can normal white blood cell images be analyzed and described by computer? and Can the white blood cells be discriminated algorithmically, without human intervention or participation in decision making? In the preceding approaches to designing the recognition logic, performance standards were defined externally by a cytologist who used unverbalized
3 94
JUDITH M. S. PREWITT
(a 1
(b)
(C)
FIG.27. Visualization of dissimilarities in leukocytes attributable to photometric rather than morphologic features. Stylized synthetic cell images, with no shape information, were generated for a “neutrophil” by the method described in the text. Compartmented constructions, corresponding to the articulated cell, and noncompartmented constructions, corresponding to the unstructured cell, are shown in paired sets, according to the photometric properties they illustrate. (a) Size and mean optical density (hence also integrated optical density or absorbance) match those of an actual neutrophil, and individual density values have the same frequency distribution as in the actual cell. (b) Size is preassigned. (c) Size and mean optical density are preassigned.
qualitative spectral, morphological, and textural cues. To perform the same visual task, the computer algorithm used quantitative tonal image features which were mathematical constructs. The decision-theoretic experiments then demonstrated that the computer could emulate the cytologist under whose tutelage the critical recognition program parameters were determined, albeit using different, non-morphological criteria. Cluster synthesis demonstrated further that the computer, proceeding de novo, without authoriatative assistance, could derive the traditional leukocyte classification. Hence, very likely, it can be entrusted with meaningful unsupervised recognition. Additional experimentation has addressed the two complementary questions: Can leukocyte images be synthesized by computer on the basis
PARAMETRIC AND NONPARAMETRIC RECOGNITION
395
of relatively simple generative rules incorporating a stochastic element? and Can a human observer emulate computer recognition, that is, visually discriminate leukocytes using only the successful computer-generated, machine-oriented features as visual cues? To answer these challenges, a computer program, SYNTEX, was written for synthesizing stylized cell abstractions which embody the simple discriminatory features. Using Monte Carlo techniques, populations of leukocyte-lie images with specific morphological features and tonal characteristics were synthesized (Prewitt, 1967; Prewitt et al., 1968). These computer-constructed images can have arbitrarily shaped boundaries, content, area, mean optical density, mean contrast, optical density dis-
FIG.28a. Size and mean density have been equalized (hence also integrated density) for synthesized “cells” corresponding to the five leukocyte types. Thus visual Merences are attributable only to the unconditional probability distribution of optical density (that is, to the shape of the histogram). The top row of images is patterned after granulocytes E, N, B (left to right), and the bottom row after nongranulocytes L, M (left to right). a, Compartmented images.
396
JUDITH M. S. PREWITT
FIG.28b. See Fig. 28a, p. 395. Noncompartmented images.
tributions, and nuclear placement. Such images have not only those properties which would allow correct classification by the recognition algorithms, but also present realistic visual characteristics such as lobulation and rudimentary textural patterns. A set of synthesized images intended to display only those tonal properties which are utilized in the optimal discrimination scheme, in isolation of other features present in the natural context, is given in Figs. 27-30. Morphological cues were suppressed by the artifice of using only circular nuclear and cell contours. First-order leukocyte facsimiles were obtained by assigning individual brightness values to image points so that characteristic histogram-derived properties were statistically matched. Specifically, to construct an image, an actual leukocyte optical density histogram was selected arbitrarily, and appropriate cell and nuclear density thresholds were extracted. The histogram was next converted to cumulative distributions of optical density for background, cell, cytoplasm, and nucleus. The cumulative distributions were further transformed to conform with constraints on areas, contrast, mean density, and integrated density, as desired. To generate a single-compartment stylized image (corresponding
PARAMETRIC AND NONPARAMETRIC RECOGNITION
397
to a nonarticulated cell), cell size was specified, and a conforming digitized circular boundary was drawn. Density values for individual picture elements within the contour were determined by random sampling of the transformed cumulative distribution for cell optical density. Similarly, picture elements outside the contour were assigned random values from the cumulative density distribution for background. For two-compartment constructions (corresponding to articulated cell images), nuclear and cell circumferences were drawn as described, and density values were assigned by analogous sampling procedures. The technique is illustrated for a stylized neutrophil in Fig. 27. Compartmented and noncompartmented views are paired. In the first pair, size and
FIG.29a. Size has been equalized for synthesized “cells” of the five types, so that mean density can be compared. The top row is patterned after granulocytes E, N, B (left to right), and the bottom row after nongranulocytes L, M (left to right). a, Compartmented images.
398
JUDITH M. S. PREWITT
FIG. 29b. See Fig. 29a, p. 397. Noncompartmental images.
mean optical density (hence also integrated optical density) match those of the parent histogram, and individual density values represent' random samples from the cumulative distributions. The second and third pairs have preassigned sizes; the t,hird pair also has preassigned mean optical density. Visual dissimilarity, attributable to differences in size, content', and first-order optical density statistics, are clearly manifest in synthesized abstractions corresponding t,o the five leukocyte types. I n Figs. 28a and 28b, size and mean density have been equalized (hence also integrakd density), so that the only operating factor is t.he relative unconditional distribution of optical density values (i.e., only the shape of t,he histogram is effect'ive). I n Figs. 29a and 29b, size and integrated density have been equalized, so that' mean density can be compared. Finally, in Figs. 30a and 30b, size, mean density, and hence integrat,ed density are free variables. $0 t,hat t.he synthesized images reflect the natural tonal propert.ies upon which recognition was based, up to first-order statist,ics. A set of 25 stylized images, based on five randomly se1ect)edwhite blood cells of each of the five leukocyte types was constructed for use in an
PARAMETRIC AND NONPARAMETRIC RECOGNITION
informal psychological experiment. Unlabeled, unsorted photographic reproductions were presented to observers with instructions to group them according to subjective impressions of visual similarity. Five groups were consistently formed, with five facsimiles per group which were attributable to prototypes from the same leukocyte category. Knowledgable observers could also usually identify the “leukocyte” type. Finally, Fig. 31 shows a rudimentary granulocyte facsimile which incorporates not only the tonal characteristics, but also the nuclear morphology of a two-lobed neutrophil. The nuclear contour wtts generated by a trigonometric series for radial distance from an origin inside the nucleus to the periphery as a function of angle. Assignment of individual density values to “nuclear” and “cytoplasmic” elements followed the preceding
FIG.30a. Size, mean density, and hence integrated density are free variables, so that the synthesized ‘‘cell’’ images portray those tonal properties of the actual cells upon which computer recognition was based. Only first-order statistics are involved. The top row is patterned after granulocytes E, N, B (left to right), and the bottom row after non-granulocytes L, M (left to right). a, Compartmented images.
400
JUDITH M. S. PREWITT
FIG.30b. See Fig. 30a, p. 399. Noncompartmented images.
density-placement algorithm. The synthesizer can also include randomized placement rules for granules and other cellular structures, and higher than first-order optical density statistics. 11. Perspectives on Automatic Microimage Analysis
The history of microscopic image analysis has been marked by vigorous application of mathematical, statistical, and engineering techniques to problems of visual inspection, decision-making, and interpretation by machine. The most effective research to date has centered around the use of the digital computer as image processor and pattern recognizer. Early attempts at automation in microscopy were concerned with measuring physical and biochemical characteristics of cells that could be determined in a fairly direct way and could serve as standards for com-
PARAMETRIC AND NONPARAMETRIC RECOGNITION
401
parison or as statistical templates. Scanning microscopes and electronic devices for measuring absorption and for counting, sizing, and sorting cells in tissue sections or in smears were under development in many laboratories simultaneously over a decade ago. Berkley (1959) described several devices for the detection, counting, and sizing of objects. These generally used pre-set density threshold logic to define areas of interest in the microscopic field. Simplifying assumptions about particle shape were commonly made, and led to inadequate handling of reentrant contours. By 1962, many techniques in scanning microscopy were available for application to high resolution quantitative cytochemistry (Montgomery, 1962; Caspersson and Lomakka, 1962).
..................................... ... ........... .__ . ........... ........ .......,... ..-.-. ......... I
. . L . I. .I I.
I
I .
... .... .. ... .... ...... .-. .... ... ...... ... ... .... .... ...... ... ..... ... .... -..-.. -.. ... .... ... ... ... ..-.-. .-.. ..... ..... ...... ...... ..... ..... ..... ...... ..... ..... ..... ... .... ..... ..-.. .. ..... ... ...... ..... ..... .... ..... ,..... ..... ..... ..... ....... ....... ... ....... .... ..... ............. ............. ... ........ .....,....... .... .,..... ...... .... .............. ...... ....... .... ...... ....... .. .... ..... ....... .............. ...... ........ ...... .. ...... ...... ........ ...... ...... . L I . l . . . . , . . L . . l . . . L .
I . . .
. . I
. . - I . .
L...
1 . 1 . 1 .
. . I .
. I .
. I . .
. . I
. I . I . I I .
..L.‘.
‘ I . . , .
/ . . . I .
I . . I
. . I
. I . -
I I . , ./
L.
. . , , , I
L . .I ., - . .
. , I . . . L I . . / . I . . . . .
. . I - L
t.....
L . . L . I I . . . , I .
. . I . .
L .
. . . I .
. I I . . I . . I . . . I
. I . . . . . I .
1.L
. I . / .L.....
I . . . l . . . . . / I
. . L . . l l
/I.
I
I .
I
. I . , # .
. I . . / .
1 . 1 . . 1 .
I . . . I .
. L . . . . l . . I l . . L I .
’,...........+.,.l....*..+.....*.,..., I I . . . . . . . . . . . . . . . . . . . / . . . . / . . . . . I .
.. .. .. .. .. ..
L .
. r
1 I .
.. .. ., .. .. :i ..
I
..-. .. .. .... .... .. .. .... .... .... .... .... .... .... . I . .
I . . I . I
I.
.... .. / ./.. .. , .. ... ... .. .. ... .. .. .. .. .. .. .. .. .. .. .. ..: .‘ .. .. ., .. . . .. .. .. .. .. .. .. .. ... ... .. . . .. .. .. .. _. .. .. .. / . . .. .. ..... .. .. ..... .. .. .... .... . .. .... ,... .. .. .. .... ,, .. .-.. .... .... ,... .. ’.../ .. .. .... .. .... .... .. ..... . . . . ,.
I .. 8
/ L . .
I
. I
. I I .
. I
. I
. I ./ . L
I .
. L
I .
. I
I .
L.
I .
. L
/ I . .
I .
L . .I
I . /. LL
. - I .
.. LI
I .
I .
,I I.
. I
I , . .
. I
L . # .
FIG.31. Synthesized granulocyte facsimile. The synthesized image incorporates both Characteristic nuclear morphology of a two-lobed neutrophil, and tonal properties up to firsborder statistics. The nuclear contour was generated by a trignometric series for radial distance from an origin inside the nucleus to the periphery, as a function of angle. An alternate algorithm uses arc length along the closed nuclear contour rather than polar angle as the parameter. Assignment of individual density values to “nucleus” and “cytoplasm” proceeded according to the placement rule based on random sampling of appropriate portions of an actual optical density histogram for a neutrophil, as described in the text.
402
JUDITH M. S. PREWITT
Concurrent with these developments, several major investigations sought statistical information on the physiochemical properties of normal and abnormal cells that had diagnost]ic potential. Hoffman (1962) used a flying-spot microscope for making measurements of intact cells, with the purpose of studying DNA content and nuclear-cytoplasmic volume ratio, particularly as a function of cell age. I n a manual study of 10,000 abnormal cells of cervical squamous cell cancer, Reagan et al. (1957) compared several indicators of cell shape, texture, and size for normal and cancerous cells. Tolles (1954) gathered statistics on four quantitative features of exfoliated cells in vaginal smears : nuclear density, nuclear diameter, cell diameter, and cytoplasmic color, and constructed two- and three-dimensional scatter plots. The study was broadened (Tolles et a t , 1961), and cell population distributions for cell and nuclear diameter and nuclear extinction considered pairwise were obtained. Many of these design principles and cytological findings were incorporated into special-purpose discrimination devices, but at a time too early in the state of the art of image processing to have led to permanent instrumentation solutions. Screening criteria for detecting abnormalities in routine clinical diagnosis were selected on the basis of the forementioned and related studies, and were finally incorporated into the CYTOANALYZER (Bostrom et al., 1959; Tolles, 1955), an experimental, fixed logic machine for screening cytological smears. Spencer and Bostrom (1962) were pessimistic about field use of the instrument and pointed out the disparity between smear preparation (which tended to produce dense aggregates of leukocytes, for this purpose inconsequential) and the discrimination criteria (which attributed high density only to malignant or suspicious cells). Diacumakos et al. (1962) further stressed the biological, cytochemical, and electronic complexities attendant to the problem, and criticized the inflexible logic that precluded the CYTOAhTALYZER from perceiving salient information, rejecting irrelevant information, and learning. The original CELLSCAN system (Preston, 1962; Izzo and Coles, 1962; Ingram et al., 1969b), another example, employed a television microscope with adjunct special-purpose digital logic for processing scanned fields. The system utilized manually preset density thresholds to produce silhouettestransformed or binary images with two intensity levels, “black” and “white”-of blood cells, one cell a t a time, and also utilized additional resolution-degrading pictorial compression. Using parameters derived from the size and number of light and dark areas in the firld of view, the system was applied t o distinguishing granulocytes, normal lymphocytes, and binucleated lymphocytes in conventionally prepared smears. The approach emphasized discrimination on the basis of gross cell morphology, as visible in the Wright’s stain smear (Ingram et al., 1969a).
PARAMETRIC AND NONPARAMETRIC RECOGNITION
403
All of the forementioned efforts aimed toward ultimate real-time screening systems in which few properties would be involved and in which simple discrimination logic and decision rules would be embodied in analog circuitry and short-term memories. These early systems were seriously hampered by the use of manually preset thresholds, limited field of view, coarse sampling density, lack of focus and brightness stability, and suboptimal match of scan tube emission spectrum and specimen absorption spectrum (Mansberg and Segarra, 1962). Premature hardware implementation of preconceived and often unproved recognition criteria produced a generation of discrimination devices which not only were inadequate, but difficult and expensive, if not impossible to modify or improve. I n their discussion of size and shape determination in microscopy, Tolles and Mansberg (1962) conclude by emphasizing the fact that many criteria would have to be reviewed critically t o optimize parameter combinations for effective and efficient discrimination. They suggested that, the coupling of a microscanner to a general-purpose digital computer afforded the only flexible yet practical mode of approach. The CYDAC system was the realization of this suggestion (Bostrom and Holcomb, 1963; Nadel, 1965). Many systems have since been designed along these lines. They incorporate either a scanning microscope or film reader for pictorial input, and a general-purpose large scale digit a1 computer for image manipulation, structuring, and feature extraction. Often an on-line minicomputer manages the scanner controls, and locally edits, formats, and displays data (Stein et al., 1969). Qualitat.ive and quantitative methods are being combined, mensuration being the foundation for both, and the tone of research has shifted from what we have called particle analysis of binary objects to image processing of gray-tone scenes. The discrimination of white blood cells has peisistently been a fruitful stimulus to research and development. The histogram method has proved to be reliable, accurate, and noise-insensitive. It is now standard in virtually every cell scanning system for threshold selection, and in many i t is also the basis for rapid feature extraction. It is used for determining gIobal thresholds when background is somewhat non-uniform, and these thresholds are then refined by local picture processing operations. Young (1970) and Bacus (1971), working with Wright’s stain material and color, have used the CELLS parameters and density histograms developed for CYDAC, and Green ( 1970) has applied these to erythrocyte analysis. Investigators of cell systems other than leukocytes (Wied et al., 1968) have also followed suit and based picture processing around these ideas. Interim revisions of CELLSCAN maint,ained the emphasis on cell shape as a distinguishing feature, but utilized a general-purpose digital computer for rnultivariate analysis of the sizing and counting parameters. The most recent CELLSCAN-GLOPR version (Ingram and Preston, 1970) uses a
404
JUDITH M. S. PREWITT
mirror scanning system and reverts to high speed digital circuitry to examine the density histogram and perform thresholding, as well as to execute the Golay topological operations (Preston, 1971). Bourk (1970) has also stressed nuclear morphology. Finally, Lurie et al. (1969) have discussed the recognition of white cells by features based on transmittances at three different wavelengths, 535, 570, and 635 nm. Multispectral scanning complicates instrumentation somewhat, but obviously carries more information and would be expected to lead to more features, albeit redundant features, hence possibly to better discrimination. The use of color, however, is hardly essential but very helpful. A quadripartite discrimination rule is the minimal complexity required for five category discrimination, and this has been achieved monochromatically. Recently, Kamentsky et al. (1965) , following the lead of Coulter (1955) , have designed a spectrophotometer for rapid on-line mass cell analysis. The instrument measures light absorption and scattering of cells in suspension as they pass through a small aperture and generates a two-dimensional display of these parameters for entire cell populations. Patterns vary with pathology, and of course with cell type. In addition, analog electrooptics in contrast to digital electrooptics, share some of the advantages of the throughput of flow systems, including speed and insensitivity to motion and to small departures in i 1 f ~ ~ ~ . 7 7 Certain features of the granulocytic series of leukocytes should show specific spatial frequency responses corresponding to characteristic size of granules, and when combined with features based on spectral extinction, they should suffice to discriminate the mature leukocytes. Bowie (1970) has reported preliminary success in discriminated leukocytes on the basis of the ratio of extinction in annular samples of the diffraction pattern (an optical Fourier transform) of transparencies of white cells. Stand-alone microscanners and optical filters cannot compete favorably in throughput with flow systems. On the other hand, stand-alone flow systems are ill-suited for the detailed, targeted analysis of individual cells. The next generation configuration will very likely consist of a hybrid arrangement, combining the advantages of both systems. The flow system could perform easy-to-make discriminations on the basis of “gestalt” image properties and cell sorting. The microscanner in conjunctions with digital computers, both local and remote, and perhaps also interactive graphic terminals, could handle more difficult, detailed, or complex analyses. 12. Summary and Prospectus
Automation of the acquisition and interpretation of data in microscopy has been a focus in biomedical research for over a decade. In spite of many
PARAMETRIC AND NONPARAMETRIC RECOGNITION
405
serious attempts, mechanical perception of microscopic fields with a reliability and cost-benefit ratio that would inspire routine clinical application is not yet a reality. Nevertheless, it is also no longer speculation. Many facets of the problem of automating cell image analysis are well within the grasp of present technology. Available histochemical techniques make it possible to prepare cytological material so that morphological integrity is preserved, key cell constituents are stained differentially and, if desired, also stoichiometrically, and specimens are favorably dispersed for viewing. Scanning microscopes have the requisite sensitivity, resolution, and stability to sample such objects and make photometric measurements over a wide range of magnifications and wavelengths. Computer-managed microscanners with selective attention and controlled illumination, wavelength, magnification, focus, and stage motion are actively under development. Current generation information processing facilities permit the rapid manipulation of the hitherto unmanageable quantities of optical information resident in all but the simplest microscopic images. However, the impact of very recent advances in high speed digital circuitry, electrooptics, and hybrid digital-analog flow/scanning systems on our ability to implement algorithmic approaches to biomedical picture processing efficiently is yet to be felt, and very likely will turn feasibility into practicality. The automatic discrimination of human leukocytes or white blood cells has probably commanded more interdisciplinary effort and skill than any other biological picture processing problem, perhaps with the exception of chromosome karyotyping. In the process, computer discrimination of leukocytes has become a classic example of the successful application of computer “pattern detection and recognition” to natural imagery and uncontrived scientific problems. The objective of our leukocyte study was to distinguish the five types of white cells found in normal human peripheral blood on the basis of computer processing of digitized microimages. Relevance to automating the differential white cell count and the detection of rare cell abnormalities is immediate, provided that more rapid electrooptical input of slide data is feasible. This research has indeed indicated that machine perception can contribute materially to microscopic diagnosis, not merely in performing lesser tasks, but in stimulating fundamentally new insights in cytomorphology and cytophotometry. The densitometrically oriented approach to the analysis of cell images which we introduced (Prewitt and Mendelsohn, 1966b) emphasized: (1) finely discriminated gray-scale information taken at high optical resolution, ( 2 ) the use of optical density frequency distributions of images to perform the preliminary spatial organization of quantitative optical information, (3) the extraction and evaluation of large collections of
406
JUDITH M. S. PREWITT
potential image-characterizing parameters within a decision-theoretic framework, and (4) the algorithmic construction of a taxonomy for mature leukocytes. I n the experiments reported herein, stained white blood cells were imaged monochromatically and digitized by a flying-spot microscanner, CYDAC, with high spatial and tonal resolution. Computer programs separated cell images from background, articulat]ed the major morphological units, and generated approximately 80 numeric parameters for describing each white cell image. A leukocyte was represented by a point in a multidimensional vector space in which the coordinate axes were systematically identified with these parameters. Classification of specimens utilized the Bayesian linear discriminants which correspond to the maximum likelihood strategy. The 80 cell parameters varied greatly in their discriminatory relevance and covariance properties, and were highly redundant. I n fact, only a small subset was actually necessary for discrimination, and many different sets of four to seven parameters were satisfactory. A feature selection algorithm produced a parsimonious ensemble of four parameters capable of perfect linear discrimination of the five leukocyte types-that is, able to secure agreement between the computer analysis and the cytological authority who provided the labeled learning sample. Essentially no new information was gained with additional variables. The best-performing set of variables were not those we would be inclined to pick on the basis of intuititive interpretation of the parameters. Strategies of four types entered into the decision-theoretic approach: ( 1 ) strategies for classification; (2) strategies for affixing values to parameters of the decision process; (3) strategies for variable selection; and (4) strategies for implementing the chosen discriminatory logic. Optimization was a pervasive and unifying theme in both theoretical and pragmatic considerations. Image articulation itself was directed toward segmentation of the field of view into morphological units so that subsequent aspects of image processing would be facilitated. I n other words, “optimal” placement of boundaries was sought. Parametric Bayesian discriminants with the maximum likelihood policy were used t o minimize the long-term probability of misclassification. Constants of the linear discriminants were “optimally” estimated from exemplars labeled by a cytologist, using maximum likelihood (rather than Bayesian) estimation techniques. The linear discriminants are optimal in the sense of Bayes if the multivariate probability densities for the five cell categories are unimodal and ellipsoidally symmetric, and share a common covariance matrix. Although technically suboptimal for blood cell dis-
PARAMETRIC AND NONPARAMETRIC RECOGNITION
407
crimination because of heterogeneous covariance and possible asymmetry, the linear functions nevertheless were successful. I n order to condense and optimize the cell representation relative to the given variates and the linear form selected for the discriminant functions, a feature selection algorithm was used. This procedure directed a stepwise accumulation of predictors in a manner that tried to maximize the betweenclass to within-class generalized variance ratio (and, incidentally, the sample divergence). It would succeed in doing this precisely if we were dealing with multinormal probability densities and independent variates. At each step, the set of already selected parameters was augmented by the as yet unselected parameter which gave the maximum, above-threshold increment in the generalized variance ratio, or else was purged of the parameter which made the least, below-threshold contribution in the context of the remaining parameters. Finally, to obtain good actual algorithms and realistic predictions of their behavior, the best use should be made of the available sample information. Specifically, apportionment of sampling effort should be optimized with respect to system design and evaluation goals concurrently (although this was not pursued in the leukocyte experiments). This includes simultaneously optimizing (1) the number of subjects, (2) the number of variables per subject, (3) the measurement resolution of each variable, (4)the apportionment of samples among the various categories, and (5) the complexity of the analysis, all in the light of (i) any structural assumptions about the underlying probability distribution of subjects and variables within categories, (ii) risks or penalties of error, (iii) cost of data acquisition, and (iv) cost of computation. Experimental results have validated the density-oriented approach t o the analysis of cell images, and have lead us to the following position. 1. The CELLS algorithms for image articulation and feature extraction generate image parameters that are responsive to taxonomic differences among leukocyte types, yet relatively insensitive to within-type variability and to secondary factors attendant in specimen preparation and electrooptical scanning. 2. The performance criterion of maximizing the compatibility of manmade and machine-made leukocyte discriminations can be formulated in decision-theoretic terms, and leads to efficient implementation of discriminatory logic in the form of linear discriminant functions, operating in a multidimensional vector space of image representations. 3. The set of parameters is sufficiently rich to characterize the five leukocyte types : lymphocytes, monocytes, neutrophils, eosinophils, and basophils. Many subsets of as few as four parameters yield effective machine discrimination.
408
JUDITH M. S. PREWITT
4. Statistical and quasistatistical or heuristic methods exist for feature selection, that is, for obtaining strong minimal subsets of characterizing parameters and for evaluating their absolute and comparative discriminatory effectiveness. This systematic assemblage is superior to parameter selection on the basis of intuition or random search, and is more eEcient than exhaustive search and comparison. Furthermore, it provides insight into the variance-covariance structure of good feature sets. 5. Multivariate linear discrimination is approximately optimal for leukocyte discrimination with respect to the dual criterion of minimizing man-machine divergence in discrimination while minimizing the requisite dimensionality of the image-representing space. 6. The parameters can be given plausible morphologic and densitometric interpretations, and hence correlated with cell descriptions. These nearoptimal subsets of cell parameters give insight into the salient global visual differences and similarities among blood cell images. 7. In the CYDAC system, densitometric information, as contrasted with morphological information, is totally adequate for cell identification, and indeed, is somewhat more powerful. 8. The ensemble of algorithms for cell articulation, characterization, and identification appears to be a robust solution to the problem of machine discrimination of normal mature leukocytes. The success of the linear discriminatory logic is attributable in part to the appropriateness of the methods of image articulation to the subject matter at hand, in part to the plethora of computer-generated features, and in part to the robustness of the decision-theoretic and feature selection techniques. However, pronounced intrinsic differences among the five leukocyte types are apparent even to the untrained eye, and the relevance of the density histogram and the CELLS parameters to expressing these distinguishing characteristics is of primary importance. Our success has encouraged others to try the methods on similar problems of cell identification. This subsequent corroborating research by others has indicated a broad field of application. Future research alone will tell whether the techniques will yield the discriminatory power to master more intricate and subtle discrimination problems in microscopy. ACKNOWLEDGMENT
The author gratefully acknowledges the continuous support of the National Institutes of Health, Bethesda, Maryland. This research was sponsored under grant USPHS1-R01-GM-16913 from the National Institute of General Medical Sciences and contract PH43-62-432 from the National Cancer Institute, to the Department of Radiology, University of Pennsylvania and the manuscript was written at the Division of Computer Research and Technology of the National Institutes of Health.
PARAMETRIC AND NONPARAMETRIC RECOGNITION
409
The author wishes especially to express appreciation to two colleagues: Dr. Mortimer
L. Mendelsohn, who initiated the leukocyte project, collaborated in the discrimination research, and encouraged the development of new analytical techniques; and Dr. Brian H. Mayall, who expertly prepared, selected, and scanned the specimens for optical discrimination. Portions of this research were reported at the following scientific meetings: Conference on Advances in Biomedical Computer Applications, New York Academy of Sciences, 1965; Conference on Data Extraction and Processing of Optical Images in the Medical and Biological Sciences, New York Academy of Sciences, 1967; Symposium on Automatic Photointerpretation, Washington, D.C., 1967; IEEE Pattern Fkcognition Workshop, Delft, The Netherlands, 1968; NATO Summer School on Automatic Interpretation and Classification of Images, Pisa-Tirrenia, Italy, 1968; Psychopictorics Symposium, Arlington, Virginia, 1969;IEEE Symposium on Adaptive Processes, Austin, Texas, 1970. REFERENCES Abend, K., Chandrasekaran, B., and Harley, T. J., Jr. (1969). Comments on the mean accuracy of statistical pattern recognizers. ZEEE Trans. Inform. Theory IT-15, 420. Abramowitz, M., and Stegun, L. A. (eds.) (1964). “Handbook of Mathematical Functions.” National Bur. Stds. Appl. Math. Ser. 55. National Bur. Std., Washington, D.C. Abramson, N., and Braverman, D. (1962). Learning to recognize patterns in a random environment. IRE Trans. Inform. Theory IT-8,558. Albrecht, R., and Werner, W. (1964). Error analysis of a statistical decision model. ZEEE Trans. Inform. Theory IT-10,34. Allais, D. C. (1964). The selection of measurements for prediction. Rep. SEL-64-115 (TR No. 6103-9). Stanford Electronics Laboratory, Stanford, Calif. Anderson, T. W. (1951). Classification by multivariate analysis. Psychometrika 16, 31. Anderson, T. W. (1958). “An Introduction to Multivariate Statistical Analysis.” Wiley, New York. Bacus, J. W. (1971). An automated classification of the peripheral blood leukocytes by means of digital image processing. Ph.D. Thesis, Dept. of Physiology, Univ. of Illinois, Chicago, Ill. Bartlett, M. S. (1934). The vector representation of a sample. Proc. Cambridge Phil. Soc. 30, 327. Bartlett, M. S. (1938). Further aspects of the theory of multiple regression. Proc. Cambridge Phil. SOC.34, 33. Bartlett, M. S. (1947). Multivariate analysis. J. Roy. Stat. SOC.Suppl. 9,76 1947. Berkley, C. (1959). Electronic counting, sorting, and sizing. In “Medical Electronics: Proc. 2nd Conf. Med. Electron., Paris, 1959,” (C. N. Smyth, ed.) pp. 459-470, Illiffe, London. Bledsoe, W. W. (1966). Some results on multicategory pattern recognition. J . Ass. C m p u t . Machinery 13, 304. Blum, H. (1967). A transformation for extracting new descriptors of shape. I n “Modela for the Perception of Speech and Visual Form” (W. Wathen-Dunn, ed.), M.1.T Press, Cambridge. Bow, R. C., and Roy, S. N. (1938). The distribution of studentised Ds-statistic. Sankhyd 4, 19. Bostrom, R. C., and Holcomb, W. G. (1963):CYDAC-A digital scanning cytophotometer. ZEEE Znt. Conv. Rec., Part 9, 110.
410
JUDITH M. S. PREWITT
Bostrum, R. C., Sawyer, H. S., and Tolles, W. E. (1959). Instrumentation for automatically pre-screening cytological smears. Proc. I R E 47, 1895. Bourk, T. (1970). Automated characterization of leukocyte nucleus morphology. S. M. Thesis, Dept. of Electrical Engineering, Mass. Inst. Tech., Cambridge, Mass. Bowie, J. E. (1970). Differential leukocyte classification using an optical processing system, dSrd ACEMB, Washington, D.C. 1970, p. 20.1; S. M. Thesis, Department of Electrical Engineering, Mass. Inst. Tech., Cambridge, Mass. Box, G. E. P. (1949). A general distribution theory for a class of likelihood criteria. Biometrika 36, 317. Bplyum, A. (1964). Separation of white blood cells. Nature 204,793. Butler, J. J., and Wall, L. C. (1959). Photomicrography of blood specimens, I n “Medical Radiography and Photography,” Vol. 35, pp. 20-34. Eastman Kodak, Rochester, N.Y. Caspersson, T. O., and Lomakka, G. M. (1962). Scanning microscopy techniques for high resolution quantitative cytochemistry, Ann. N . Y . Acad. Sci. 97,449. Chien, Y. T., and Fu, K. S. (1967). On the generalized Karhunen-LoCve expansion. ZEEE Trans. Inform. Theory IT-13, 518. Chien, Y. T., and Fu, K. S. (1968). Selection and ordering of feature observations in a pattern recognition system. Inform.Control 12, 394. Chow, C. K. (1957). An optimum character recognition system using decision functions. I R E Trans. Electron. Comput. EC-6, 247. Chu, J. T. (1965). Optimal decision functions for computer character recognition. J . Ass. Comput. Machinery 12,213. Chu, J. T., and Chueh, J. C. (1967). Error probability in decision functions for character recognition. J. Ass. Comput. Machinery 14, 273. Cochran, W. G., and Hopkins, C. E. (1961). Some classification problems with multivariate qualitative data. Biometrics 17, 10. Cochran, W. G. (1962). On the performance of the linear discriminant function, Bull. Znt. Stat. Inst. 39, 435. Cooper, P . W. (1964). Hyperplanes, hyperspheres, and hyperquadrics as decision boundaries. I n “Computers and Information Sciences” (J. T. Tou and R. H. Wilcox, eds.), Chapter 4, pp. 111-138. Spartan, Washington, D.C. Coulter, W. H. (1956). High speed automatic blood cell counter and cell size analyzer. Proc. 1956 Nat. Electron. Conf., Chicago, Ill. Daland, G. A. (1959). “A Color Atlas of Morphological Hematology.” Harvard Univ. Press, Cambridge, Mass. Day, N. E. (1969). Linear and quadratic discrimination in pattern recognition. IEEE Trans. Inform. Theory IT-15, 419. Deitch, A. D. (1955). Microspectrophotometric study of the binding of the anionic dye, naphthol yellow S, by tissue sections and by purified proteins. Laboratory Invest. 4, 324. Diacumakos, E. G., Day, E., and Kopac, M. J. (1962). Exfoliated cell studies and the Cytoanalyzer. Ann. N . Y . Acad. Sci. 97, 498. Dunn, 0. J., and Varady, P. (1966). Probabilities of correct classification in discriminant analysis. Biometrics 22, 908. Efroymsen, M. A. (1962). Multiple regression analysis. I n “Mathematical Methods for Digital Computers” (A. Ralston and H. Wilf, eds.), Vol. 1. Wiley, New York. Einarson, L. (1951). On the theory of gallocyanin-chromalum staining and its application for quantitative estimation of basophilia. A selective staining of exquisite progressivity. Acta Pathol. Microbiol. Scand. 28, 82.
PARAMETRIC AND NONPARAMETRIC RECOGNITION
41 1
Elashoff, J. D., Elashoff, R. M., and Goldman, G. E. (1967). On the choice of variables in classification problems with dichotomous variables. Biomdrika 54, 668. Elias, H. (1971). Three-dimensional structure identified from single sections. Science 174, 993. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Ann. Eugenics, London 7,179. Fisher, R. A. (1937). The statistical utilization of multiple measurements. Ann. Eugenics, London 8 , 376. Freeman, H. (1961). On the encoding of arbitrary geometric configurations. IRE Trans. Electron. Computws EC-10, 260. Fu, K. S. (1968). “Sequential Methods in Pattern Recognition and Machine Learning.” Academic Press, New York. Gaffey, W. R. (1951). Discriminatory analysis: Perfect discrimination as the number of variables increases. Rep. No. 5, Project No. 21-49-004, USAF School of Aviation Medicine, Randolph Field, Texas. Green, J. (1970). Computer methods for erythrocyte analysis, Proc. ZEEE Symp. Feature Extraction and Selection in Pattern Recognition, Argonne, Ill. Ham, A. W. (1969). “Histology,” 6th Ed. Lippincott, Philadelphia. Highleyman, W. H. (1962). The design and analysis of pattern recognition experiments. Bell Sys. Tech. J . 41, 723. Hoffman, J. G. (1962). Flying spot methods for measuring DNA and the nucleocytoplasmic volume ratio in tissue cells. Ann. N . Y . Acad. Sci. 97, 380. Hotelling, H. (1931). The generalization of Student’s ratio. Ann. Math. Stat. 2,360. Hsu, P. L. (1938). Notes on Hotelling’s generalized T2. Ann. Math. Stat. 9,231. Hughes, G. F. (1968). On the mean accuracy of statistical pattern recognizers. ZEEE Trans. Inform. Theory IT-14, 55. Hughes, G. P. (1969). Number of pattern classifier design samples per class. ZEEE Trans. Inform. Theory IT-15, 615. Ingram, M., and Preston, K., Jr. (1964). Importance of automatic pattern recognition techniques in the early detection of altered blood cell production. Ann. N . Y . Acad. Sci. 113, 1066. Ingram, M., and Preston, K., Jr. (1970). Automatic analysis of blood cells. Sci. Amer. 223, 72. Ingram, M., Norgren, P. E., and Preston, K., Jr. (1969a). Advantages of topology as a basis for automatic analysis of blood cell images. Ann. N . Y . Acad. Sci. 157,275. Ingram, M., Preston, K., Jr., and Norgren, P. E. (1969b). Automatic differentiation of white blood cells. In “Image Processing in Biological Science” (D. Ramsay, ed.), pp. 97-117. UCLA Press, Los Angeles, California. Ito, K., and Schull, W. J. (1964). On the robustness of the To2test inmultivariate analysis of variance when variance-covariance matrices are unequal. Biometrika 51, 71. Izzo, N. F., and Coles, W. (1962). Blood-cell scanner identifies rare cells. Electron& 35, 52. John, S. (1961). Errors in discrimination. Ann. Math. Stat. 32, 1125. Kain, R. Y. (1969). The mean accuracy of pattern recognizers with many pattern classes. ZEEE Trans. Inform. Theory IT-15,424. Kamentsky, L. A., and Melamed, M. R. (1969). Rapid Multiple Mass Constituent Analysis of Biological Cells. Ann. N . Y . Acad. Sci. 157,310. Kamentsky, L. A., Melamed, M. R., and Derman, H. (1965). Science 150,630. Kanal, L., and Chandrasekaran, B. (1971). On dimensionality and sample size in st& tistical pattern classification. Pattern Recognition 3, 225.
412
JUDITH M. S. PREWlll
Karhunen, K. (1947). iiber lineare Methoden in der Wahrscheinlichkeitsrchnung. Ann. A d . Sci. Fern., Ser. A 137. Kato, K. (1935). Leucocytes in infancy and childhood. J . Pediat. 7, 7-15. Kendd, M. G., and Stuart, A. (1961). “The Advanced Theory of Statistics,” Vol. 1. Griffin, London. Kullback, S. (1959). “Information Theory and Statistics.” Wiley, New York. Lachenbruch, P. A,, and Mickey, M. R. (1968). Estimation of error rates in discriminant analysis. Techwtnetr& 10, 715. Lebo, J. A., and Hughes, G. F. (1967). Data reduction using information theoretic techniques. Tech. Rep. TR-67-67, Rome Air Development Center, Rome, New York, pp. 45-46. Lewis, P. M. (1962). The characteristic selection problem in recognition systems. IRE Trans. Inform. Theory IT-8, 171. Lipkin, L. E., Watt, W. C., and Kirsch, R. A. (1966). The analysis, synthesis, and description of biological images. Ann. N . Y . A d . Sci. 128,984. Lotto, G. (1961). On the generation of all possible stepwise combinations. Math. Cornput. 16, 241. Lukomski, J. (1939). On some properties of multidimensional distributions. Ann. Math. Stat. 10, 236. Lurie, 0. B., Bykov, R. E., and Popetchitelev, E. P. (1969). Automatic recognition of leukocyte cells in blood preparations. Proc. 8th Int. Cmf. Med. Biol. Engineering, Chicugo, 1969. Mahalanobis, P. C. (1936). On the generalized distance in statistics. PTOC.Nat. Inst. Sci. (India) 12,49. Mansberg, H. P., and Segarra, J. M. (1962). Counting of neurons by flying spot microscope. Ann. N . Y . A d . Sci. 99, 309. Marill, T. and Green, D. M. (1963). On the effectiveness of receptors in recognition systems. IEEE Trans. Inform. Theory IT-9, 11. Mendelsohn, M. L., Mayall, B. H., Prewitt, J. M. S., Bostrom, R. C., and Holcomb, W. G. (1968). Digital transformation and computer analysis of microscopic images. I n “Advances in Optical and Electron Microscopy.” (V. E. Cosslett and R. Barer, eds.), Vol. 2, pp. 77-150. Academic Press, London. Montgomery, P. O’B., (ed.) (1962). Scanning techniques in biology and mediaine. Ann. N . Y . A d . Sci. 97, 329. Morrison, D. F. (1967). “Multivariate Statistical Methods.” McGraw-Hill, New York. Nadel, E. M. (1965). Computer analysis of cytophotometric fields by CYDAC and its historical evolution from the CYTOANALYZER. Acta Cytologica 9,203. Nilsson, N. T. (1965). “Learning Machines.” McGraw-Hill, New York. Page, L. B., and Culver, P. J. (eds.). (1961). “A Syllabus of Laboratory Examinations in Clinical Diagnosis.” Harvard Univ. Press, Cambridge, Mass. Preston, K., Jr. (1962). Machine techniques for automatic leukocyte analysis. Ann. N . Y . Acad. Sci. 97, 482. Preston, K., Jr. (1971). Feature extraction by Golay hexagonal pattern transforms. IEEE Trans. Computers C-20, 1007. Prewitt, J. M. S. (1965a). I n “Progress Report for 1965, Scanning Cytophotometry and Digital Computers” (M. L. Mendelsohn, J. M. S. Prewitt, W. Kolman, and B. H. Perry, eds.). Dept. Radiology, Univ. of Pennsylvania, Philadelphia, Pennsylvania. Prewitt, J. M. S. (196513). The selection of sampling rate for digital scanning. ZEEE Trans. Bimed. Eng. BME-12, 14.
PARAMETRIC AND NONPARAMETRIC RECOGNITION
413
Prewitt, J. M. S. (1967). I n “Progress Report for 1967, Scanning Cytophotometry and Digital Computers” (M. L. Mendelsohn, J. M. S. Prewitt, B. H. Mayall, B. H. Perry and T. J. Conway, eds.). Dept. of Radiology, University of Pennsylvania, Philadelphia, Pennsylvania. Prewitt, J. M. S. (1968). Machine perception of white blood cell images. Rep. Znd IEEE Pattern Rewgnitirm Workshop, The Hague, Netherlands, August. Prewitt, J. M. S. (19708). Object enhancement and extraction. I n “Picture Processing and Psychopictorics” (B. S. Lipkin and A. Rosenfeld, eds.). Academic Press, New York. Prewitt, J. M. S. (1970b). Selection of variables and prediction of performance in decision-theoretic approaches to diagnosis. Proc. Conf. Computer Applications Radwl. University of Missouri-Columbiu, September 19YO. Prewitt, J. M. S. (1971). Diagnostic and predictive applications for cluster analysis. J. Inform. Med., IRIA (Institut d’Informatique et d’Automatique), Rocquencourt, France. Prewitt, J. M. S. (1972a). Experiments with statistical and quasi-statistical methods in diagnosis. In “Computer Diagnosis and Diagnosis Methods” (J. Jacquea, ed.), Chapter 17. C. C. Thomas, Springfield. Ill. Prewitt, J. M. S. (1972b). A versatile hierarchical clustering algorithm with objective function and objective measure. Comput. Programs Biomedicine, in press. F’rewitt, J. M. S. (1972~).Cytotaxonomics of leukocytes based on photometric properties: I and 11, in preparation. Prewitt, J. M. S., and Mendelsohn, M. L. (1966a). A general approach to image analysis by parameter extraction. Proc. Cmf. Uses Computers Radwl.Univ. Missouri-Columbiu, pp. A2-A41. Prewitt, J. M. S., and Mendelsohn, M. L. (1966b). The analysis of cell images. Ann. N . Y . Acad. Sci. 128, 1035. Prewitt, J. M. S., Mayall, B. H., and Mendelsohn, M. L. (1966). Pictorial data processing methods in microscopy. Filmed Data Comput., B o s h , paper No. 15. SOC.PhotoOptical Instrumen. Eng. Redondo Beach, California. Prewitt, J. M. S., Mayall, B. H., and Mendelsohn, M. L. (1968). A morphological interpretation of machine oriented parameters for leukocyte determination. 3rd Int. Congr. Histochem. Cytochem., New York. Rao, C. R. (1948). The utilization of multiple measurements in problems of biological classification. J. Roy. Stat. SOC.,Ser. B 10, 159. Rao, C. R. (1952). “Advanced Statistical Methods in Biometric Research.” Wiley, New York. Reagan, J. W., Hamonic, M. J., and Wentz, W. B. (1957). Analytical study of cells in cervical squamous-cell cancer. Laboratory Invest. 6, 241. Rosenfeld, A., and Pfalta, J. L. (1966). Sequential operations in digital picture processing. J. Ass. C o m p t . Machinery 13,471. Sampson, P. (1965). Stepwise multivariate linear discrimination. In “BMD Biomedical Computer Programs” (W. Dixon, ed.). Univ. of California Press, Berkeley. Sandritter, W., Kiefer, G., and Rick, W. (1963). uber die Stochiometrievon Gallocyanin chromalaun mit Desoxyribonukleinsaure. HiYlochemie 3, 315. Sebestyen, G. S. (1962). “Decision-Making Processes in Pattern Recognition.” Macmillan, New York. Sokal, R. R., and Sneath, P. H. A. (1963). “Principles of Numerical Taxonomy.” Freeman, San Francisco. Spencer, C. C., and Bostrom, R. C. (1962). Performance of the Cytoadyzer in recent clinical trails. J. Nut. Cancer Inst. 29,267.
414
JUDITH M. S. PREWITT
Stein, P. G., Lipkin, L. E., and Shapiro, H. M. (1969). Spectre 11: General-purpose microscope input for computer. Science 166, 328. Tolles, W. E. (1954). A multi-dimensional analysis of some quantitative characteristics of exfoliated cells in Papanicolaou smears. Airborne Instrum. Lab., Cutler-Hammer, Deer Park, Long Ialand, New York. Tolles, W. E. (1955). The Cytoanalyzer: An example of physics in medical research. Trans. N . Y . Acad. Sci. 2,250. Tolles, W. E., (ed.). (1962). Applications and methods of counting and sizing in medicine and biology. Ann. N.Y. Acad. Sci. 99, 231. Tolles, W. E., and Mansberg, H. P. (1962). Size and shape determination in scanning microscopy. Ann. N . Y . Acad. Sci. 97,516. Tolles, W. E., Horvath, W. J., and Bostrum, R. C. (1961). A study of the quantitative characteristics of exfoliated cells from the female genital tract. I. Measurement methods and results. Cancer 14, 437; 11. Suitability of quantitative cytological measurements for automatic prescreening. Cancer 14,455. Tou, J. T., and Heydon, R. P. (1967). Some approaches to optimum feature extraction. I n “Computers and Information Sciences-11” (J. T. Tou, ed.), pp. 57-89. Academic Press, New York. von Mises, R. (1945). On the classification of observation data into distinct groups. Ann. Math. Stat. 16, 68. Wald, A. (1944). On a statistical problem arising in the classification of an individual in one of two groups. Ann. Math. Stat. 15, 145. Wald, A. (1947). “Sequential Analysis.” Wiley, New York. Ward, J. H., Jr. (1963). Hierarchical grouping to optimize an objective function. J . Amer. Stat. Ass. 58, 236. Watauabe, S. (1967). Karhunen-LoBve expansion and factor analysis-theoretical remarks and applications. Proc. 4th Conf. Inform. Theory, Stat. Decision Functions. Prague. 1965. Czechoslovak Academy of Sciences, Prague, 1967. Weiner, J. M., and Dunn, 0. J. (1966). Elimination of variates in linear discrimination problems. Biometrics 2, 268. Welch, B. L. (1939). Note on discriminant functions. Biometrika 31, Pts I, 11, 218. Wied, G. L., Bartels, P. H., Bahr, G. F., and Oldfield, D . G. (1968). Taxonomic intracellular analytic system (TICAS) for cell identification. Acta Cytolog. 12, 180. Wilks, S. S. (1932). Certain generalizations in the analysis of variance. Biometrika. 24, 471. Wintrobe, M. M. (1962). “Clinical Hematology.” Lea and Febiger, Philadelphia. Young, I. T . (1970). Automated leukocyte recognition. I n “Automated Cell Identification and Cell Sorting” (G. L. Wied and G. F. Bahr, eds.), pp. 187-194. Academic Press, New York.
Author Index Numbers in parentheses are reference numbers and indicate that an author’s work is referred to although his name is not cited in the text. Numbers in italics show the page on which the complete reference is listed.
A Abate, J., 166 Abend, K., 351,409 Abraham, C. T., 166, 169,173 Abramson, N., 339, 409 Ackoff, R. L., 48(1), 53(1), 60(1), 62(1), 63(1), 71 Albers, H. H., 39(2), 71 Albrecht, R., 350, 409 Allais, D. C., 352, 409 Amey, G. X., 166 Anderson, R., 166 Anderson, T. W., 328, 336, 409 Angell, T., 166 Anthony, R. N., 41(3), 71 Arden, B., 210(1), 283 Arora, S. R., 154(8), 166 Ashton, A. C., 92, 110 Asuar, J. V., 89, 110 B Baber, R. L., 166 Bacus, J. W., 409 Bachman, C. W., 151(10), 166 Bahr, G. F., 414 Bakeman, R., 94, 115 Baker, F. T., 173 Baker, R., 76, 112 Balzer, R. M., 278(2), 283 Baran, P., 34(1), 36 Barbaud, P., 77, 89, 110 Baroni, M., 88, 110 Bartee, T. C., 151, 166 Bartels, P. H., 414 Bartlett, M. S., 337, 347, 358, 363, 409 Baskin, H. W., 166 Bates, F., 283 Batstone, P., 83, 110
Bauer-Mengelberg, S., 88(5), 90, 110 Baum, C., 166 Bayes, A. J., 166 Beauchamp, J. W., 110 Becker, J., 166 Belar, H., 75, 11.3 Belady, 166 Bell, J. R., 239, 28.9 Benner, F. H., 166 Benningson, L. A., 174 Bergeron, R. D., 279(36), ,284 Berkley, C., 401, 409 Bernstein, L., 99, 100(7), 110 Bernzott, P., 89, 112 Berul, L., 132(17), 148(17), 166 Birkhoff, G., 151, 166 Black, W. L., 166 Bledsoe, W. W., 409 Bleier, R. E., 149(20), 166 Bloom, B. H., 166 Blum, H., 316, 321, 409 Blunt, C. R., 167 Boettner, D. W., 283 Boyurn, A., 296, 410 Bolas, B. J., 282 Book, E., 279 Bose, R. C., 344, 409 Bostrom, R. C., 298, 402, 403, 409, 410, 412,41.3, 414 Bourk, T., 404, 410 Bourne, C. P., 121, 132, 136(25), 140, 167 Bowie, J. E., 404, 410 Bowles, E., 73, 110 Box, G. E. P., 337, 410 Braverman, D., 339, 409 Brender, M., 95, 110 Brender, R., 95, 110 Bronson, B. H., 73, 110 Brook, B., 88(12), 89, 110
41 5
416
AUTHOR INDEX
Brooks, F. P., Jr., 75, 111 Brown, H., 281 Bryman, M . R., 173 Buchhole, W., 167 Buckerfield, P. S. T., 167 Butler, J. J., 410 Bykov, R. E., 404,419 Byrd, D., 107, 111 Byrnes, C. J., 167 C Campi, A. V., 167 Carlos, W., 78, 111 Carr, J. W., 167 Carroll, J. M., 29(2), 33, 36 Caspersson, T. O., 401, 410 Cass, J. L., 167 Cau, A. L. C., 36 Chandrasekaran, B., 351, 364, 409, 411 Chandy, K . M., 167 Chapin, N., 167 Chapman, R. L., 167 Charnes, A., 48(4), 71 Cheatham, T. E., 225(6), 238(6), 239(6), 883
Chen, F. C., 167 Chien, Y. T., 363, 410 Childs, D. L., 167, 194(8), 283 Chow, C. K., 329, 410 Chu, J. T., 329, 350, 410 Chu, W. W., 167 Chueh, J. C., 329, 350, 410 Clampett, H. A., Jr., 167 Climenson, W. D., 167 Clough, J., 85, 111 Cochran, W. G., 410 Codd, E. F., 167 Coffman, E. G., Jr., 167 Cohen, J. E., 75,111 Coles, W., 402, 411 Collins, W., 98, 111 Collmeyer, A. J., 167 Comber, E. V., 7, 9, 36 Conger, C. R., 167 Cooper, P. W., 334, 335, 410 Cooper, W. W., 48(4), 71 Corbato, F. J., 178(10), 210(10), 243(10), 283
Coulter, W. H., 404, 410 Craig, J., 167
Craig, S. W., Jr., 167 Crane, F., 98, 111 Curtice, R. M., 168 D Daland, G. A., 287,288, 410 Daley, R. C., 168 Davidson, L., 121(49), 137, 168 Davison, G., 168 Day, E., 402, 410 Day, N. E., 334, 410 Dearden, J., 41(3), 43(5), 71 Deitch, A. D., 296, 410 de la Briandans, S. R., 168 Delgalvis, I., 168 Denning, P. J., 17(5), 25(5), 36 Dennis, J. B., 168 Dent, W. T., 154(8), 166 Derman, H., 404, 411 Diacumakos, E. G., 402,410 Dijkstra, E. W., 22, 35 D’Imperio, M. E., 168 diRoccaferrera, G. F., 38(6), 40(6), 49(7), 55(6), 71 Divilbis, J. L., 80, 111 Dodd, G. G., 117,168 Dodge, C., 83, 97(22), 111 Donovan, J. J., 250(11), 883 Dopping, O., 168 Dougherty, R. L., 167 Douglas, M . L., 283 Drake, R. J., 173 Driver, M . J., 43(24), 72 Dubner, H., 166 Dumey, A. I., 168 Dunn, 0. J., 341, 354, 410, 414 Dunn, R. M., 167 Dyckman, T . R., 52(8), 71 Deubak, B. J., 168 E Earley, J., 238(12), 280(12), 280, 283 Efroymsen, M. A., 354, 410 Einarson, L., 4 l O Elashoff, J. D., 411 Elashoff, R. M., 411 Elcock, E . W., 168 Elias, H., 304, 4 l l Elliot, C. O., 44(9), 71
417
AUTHOR INDEX Erickson, R., 97, 111 Evans, D., 168 Eve, J., 167 F Farber, D. J., 278(2), 283 Feigenbaum, E. A., 46(10), 71 Feingold, R. S., 36 Feldman, J., 46(10), 71 Feldman, J. A., 194(13), 237, 283 Feller, W., 168 Ferguson, D. E., 283 Ferris, R. J., 168 Fiehler, J., 98, 111 Fife, D. W., 168 Fiore, M., 103, I l l Fischer, A., 283 Fischler, M., 168 Fisher, F. P., 41(11),62(11),72 Fisher, R. A., 328, 340, 343, 411 Flores, I., 168 Ford, D. F., 121, 132, 136(25), 140, 167 Forte, A., 101, 102, 111 Fosaum, E. G., 168 Fredkin, E., 169 Freeman, H., 325, 411 Frer, E. H., 169 Friedman, T. D., 25, 36 Fu,K . S., 363,410,411 Fuller, R., 103, 111 Fuller, R. H., 169 G Gabrini, P. J., 169 Gabura, A. J., 102, 107, 111 Gaffey, W . R., 350, 411 Galler, D., 238(18), 239(18), 283 Gannon, J. D., 279(36), 284 Garner, H. L., 169 Garret, L. J., 40(12), 44(21), 72 Garwick, J. V., 239, 283 Gerlernter, H., 145, 169 Ghosh, S. P., 166, 169, 173 Glusa, B., 169 Goldberg, J., 169 Goldman, G. E., 411 Gotlieb, C. C., 169 Gould, M., 88(32), 89, 90, 91, 111 Graham, R. M., 26,36
Gray, B. H., 167 Gray, H. J., 172 Green, D. M., 354, 412 Green, J., 403, 411 Green, M . W., 169 Greenberger, C. B., 169 Grems, M., 169 Gries, D., 237, 883 Griffith, J. E., 169 Griffiths, T. V., 169 Gurk, H. M., 169 Guttag, J. V., 279(36), 284 H Hall, A. D., 41(13), 72 Halpern, M., 28.9 Ham, A. W., 288, 411 Hamilton, J., 210(1), 283 Hamonic, M . J., 402, 413 Hampton, D. R., 48(14), 72 Haney, F., 238(12), 280(12), 883 Harley, T . J., Jr., 351, 409 Harrison, A., 36 Hay, R. E., 279 Hayes, R. M., 166 Hays, D. G., 169 Head, R. V., 39(15), 72, 169 Heckler, C. H., Jr., 174 Heising, W. P., 169 Henke, W., 84, 111 Henry, W . R., 169 Herner, S., 169 Hess, H., 170 Hess, J., 172 Heydorn, R. P., 414 Hibbard, T . M., 170 Highleyman, W. H., 342, 363, 411 Hiller, L., 73, 75, 76, 77, 78, 84, 111, 11% Hoare, C. R., 170 Hoffman, J. G., 402, 411 Hoffmann, L. J., 36 Holcomb, W. G., 298, 403, 409, 41.8 Holt, A. W., 170 Hopkins, A. L., Jr., 75, 111 Hopkins, C. E., 410 Hopkins, M., 280,283 Horvath, W. J., 402, 414 Hotelling, H., 343, 411 Howe, H . S., Jr., 83, 103, 112 HSU, P. L., 344,411
418
AUTHOR INDEX
Hudson, B., 99, 105, 112 Hughes, G. F., 350, 351, 364, 411, 412 Hultberg, W. E., 92, 112 Humphrey, T., 170
I Ingram, M., 402,404, 411 Irons, E. T., 281, 284 Isaacson, L., 73, 75, 112 rto, K., 346,411 Izzo, N. F., 402, 411
J Jackson, R., 89, 102, 112 Jenkins, D. P., 174 John, S., 341, 411 Johnson, C. I., 170 Johnson, L. R., 170 Johnson, R. A., 41(16), 72 Johnson, T. G., 170 Jorrand, P., 285 K Kahn, D., 30,56 Kain, R. Y., 352, 4 l l Kamentsky, L. A., 404, 411 Kanal, L., 364, 411 Kapps, C. A., 170 Karhunen, K., 363, 412 Karp, T., 94, 112 Kaskey, G., 168 Kassler, J., 77, 112 Kassler, M., 90, 93, 112 Kast, F. E., 41(16), 72 Kato, K., 288, 4 f 2 Kay, M., 170 Kelly, J., 281 Kememy, J. G., 170 Kendall, M. G., 412 Kiefer, G., 296, 415 Kirsch, R. A., 412 Klem, L., 174 Knowlton, K., 225(27), 238(27), 284 Knowlton, P. H., 92, 112 Knuth, D. E., 122(107), 128(107), 143, 145, 151, 156(107), 160(107), 170 Koller, H. R., 170 Kopac, M. J., 402, 4 l O
Koster, C. H. A., 239(37), 284 Kostka, S., 73, 112 Kronmal, R., 170 Kronmal, R. A., 174 Kullback, S., 354, 412 Kurki-Suonio, R., i70
L Lachenbruch, P. A., 341, 364, 412 Lampson, B. W., 28, 29, 36 Landauer, W. I., 157, 170 Lang, C. A., ,281 LaRue, J., 101, 112 Leavenworth, S. M., 284 Lebo, J. A., 351, 412 Lee, C., 170 Lefkoff, G., 103, 112 Lefkovitz, D., 150(113),170 Levien, R. E., 150(115), 171 Lewin, M. H., 171 Lewis, P. M., 354, 412 Lieberman, F., 104, 112 Lin, A. D., 132(117), 139(117), 140(117), 1 ri Lincoln, H., 99, 106, 108, 115 Lind, R. A,, 175 Lindsey, C. H., 281 Ling, H., 175 Lipkin, L. E., 412, 414 Lockwood, L., 98, 113 Logemann, G., 88(32), 89, 100, 101, 111, 115 Lomakka, G. M., 401, 410 Lombardi, L., 171 Lotto, G., 412 Lowe, T. C., 123(120), 128(120), 1'71 Lubin, J. F., 174 Lukomski, J., 350, 412 Lum, V. Y., 171, 17s Lurie, 0. B., 404, 412 M MacKenzie, R. A., 38(17), 72 Mahalanobis, P. C., 336, 41%' Mailloux, B. J., 239(37), 284 Mandelbaum, D., 171 Mansberg, H . P., 403, 412, 414 Marill, T., 354, 412 Maron, M. E., 150(115), 171
AUTHOR INDEX Martin, J., 62(18), 72 Masher, D. P., 174 Mathews, M. V., 78, 79, 80, 84, 85, 86, 87,113 Maurer, W. D., 139(132), 140(132), 171 Mayall, B. H., 296, 298, 395, 412, 4 l S Mayer, R. R., 56(19), 72 McAdams, A. K., 52(8), 71 McCarthy, J., 145, 171 McDonough, A. M., 44(21), 45(20), 50(20), 72 McFarlan, F. W., 43(5), 71 McGee, W. C., 171 McIlroy, M. D., 171, 284 McKeeman, W . M., 36, 280 McLelland, P. M., 29(2), 33, 36 Meadow, C. T., 171 Meadow, H . R., 173 Mealy, G. H., 171 Mei, V., 172 Meisel, R. M., 171 Melamed, M. R., 404, 411 Mendelsohn, M. L., 296, 298, 304, 354, 365, 368, 395, 405, 412, 418 Meyer, B. C., 173 Meylan, R., 106, 113 Mickey, M. R., 341, 364, 4 l S Miller, D. W., 52(22), 72 Miller, L., 171 Miller, S. W., 171 Minker, J., 169, 171, 17% Mintz, H . K., 36 Molho, L. M., 36 Moore, F., 85, 86, 87, 113 Morris, R., 121(142), 132(142), 139(142), 140(142), 142, 172 Morrison, D. F., 336, 363, 412 Morse, S. P., 166 Moxham, J . G., 172 Munn, W . J., 172
N Nadel, E. M., 403, 412 Nagosky, J., 106, 113 Naman, J., 172 Nelson, E . C., 280 Neumann, P. G., 75,111,168 Newell, A., 73, 113, 238(12), 280(12), 283 Newey, M. C., 284
419
Nilsson, N. T., 328, 412 Nolan, J., 172 Norgren, P. E., 402, 411
0 Oldfield, D. G., 414 Olive, J., 100(7), 110 Olson, H., 75, 113 Opler, A., 172 Optner, S. L., 41(23), 72 Owens, P. J., 173 Owens, P. W., 173 P Patt, Y. N., 155, 172 Peck, J. E. L., 239(37), 284 P e r k , A. J., 172, 238(18), 239(18), 283 Peters, B., 7, 15(17), 36 Petersen, H. E., 2, 36 Peterson, W. W., 172 Petrick, S. R., 169 Pfaltz, J . L., 315, 413 Phillips, J. A., 172 Pinkerton, R. C., 75, 113 Poland, C. B., 172 Pomper, I . H., 173 Popetchitelev, E. P., 404, 412 Postley, J. A., 172 Preston, K., Jr., 402, 404, 411, 416 Prewitt, J . M. S., 296, 297, 298, 299, 301, 304, 305, 311, 315, 321, 338, 354, 365, 368, 374, 375, 376, 379, 380, 387, 389, 390, 392, 395, 405, 412, 413 Prywes, N. S., 170,172 R
Ramamoorthy, C. V., 167 Randall, J . K., 83, 113 Randell, B., 174 Randell, T . M., 166 Rao, C. R., 328, 329, 333, 334, 335, 336, 348, 354, 363, 413 Raskin, J., 107, 113 Raver, N., 172 Ray-Chaudhuri, D. K., 166 Reagan, J . W., 402, 413 Reed, W. G., 171 Reiter, A,, 168
AUTHOR INDEX
420
Reynolds, J . C., 172 Richards, M., 281 Rick, W., 296,413 Ricour, D. H., 172 Rivett, P., 48(1), 53(1), 60(1), 62(1), 63(1), 71 Roberts, D. C., 171 Rogriguez, J . E., 283 Rosenboom, D., 85,113 Rosenfeld, A., 315, 413 Rosenzweig, J. E., 41(16), 72 Rosler, L., 84, 113 Ross, D. T., 216(31), 283, 284 Rothenberg, D. H., 172 Roy, S. N., 344, 409 Rubey, R. J., 210,284 Ruiz, P., 84, 112 Rulifson, J . F., 279 Rybak, F. M., 172
S Sable, J., 172 Sable, J . D., 172 Sampson, P., 354, 413 Sandritter, W., 296, 4 l S Satterthwaite, E., 233(39), 284 Savas, M . A., 172 Sawyer, H . S., 402, 410 Schay, G., 172,175 Schenken, J . D., 173 Schidt, N., 91, 113 Schorr, H., 145, 173 Schorre, D. V., 279 Schroder, H . M., 43(24), 72 Schull, W. J., 346, 411 Scidmore, A. K., 155, 173 Scott, W. G., 41(25), 72 Seaman, P. H., 173 Sebestyen, G. S., 363, 41s Segarra, J. M., 403, 41% Selleck, J., 94, 113 Senko, M. E., 169,172,173 Seppala, Y., 173 Shafer, P., 173 Shapiro, H . M., 403, 414 Sharma, R. L., 173 Shindle, W . E., 171 Shoffner, R. M., 173 Silver, M., 40(12), 72
Simon, H. A., 73,113 Skatrud, R. O., 29(19c), 36 Slawson, W., 81, 82, 113 Smith, J. L., 168 Smith, S., 52(8), 71 Sneath, P. H . A., 413 Snell, J . L., 170 Sokal, R. R., 413 Spencer, C. C., 402, 413 Sprague, R. E., 62(26), 72 Springer, A. L., 282 Spruth, W . G., 173 Srodawa, R. J., 282 Standish, T. A., 284 Stanfel, L. E., 155, 173 Starr, M. K., 52(22), 72 Steig, D. B., 167 Stein, P. G., 403, 414 Stelwagon, W. B., i7S Stevenson, D. A., 173 Stoney, W., 104, 114 Strenfert, S., 43(24), 72 Stuart, A., 412 Suchoff, B., 104(77), 105, 106, 114 Sussenguth, E. H., Jr., 129(183), 151, 153(183), 155(183), 173 Svejgaard, B., 91, 113 Swed, R. E., 173 Swindle, G. F., 41(11), 62(11), 72 Symonds, A. J., 173
T Tarter, M. E., 174 Tarter, M. F, 170 Taunton, B. W., 174 Taylor, R. L., S6 Taylor, S. M., 126(195), 151(195), 174 Teichrow, D., 174 Thompson, D. A., i74 Tolles, W. E., 402, 403, 410, 414 Tompa, F . W., 225(35), 272(35), 284 Tou, J. T., 414 Trowbridge, L., 106, 114 Turn, R., 2, 36
U Unk, J . M., 174
AUTHOR INDEX V Vancil, R. F., 41(3), 71 Van Dam, A., 168 van Dam, A., 279(36), 284 Vanderbilt, D. H., 26, 56 Vander Meulen, S. G., 281 VanTassel, D., 29, S6 Varady, P., 341, 410 Vermillion, W. H., 17.9 von M k s , R.,333, 414 Vorhaus, A. H., 149(20), 166
W Wagner, H. M., 39(28), 48(27), 52(28), 72 Waite, W. M., 145,17S Wald, A., 341, 414 Walker, T. P., 17.3 Wall, L. C., 410 Wallace, E. M., 174 Warburton, C. R., 168 Ward, J. H., Jr., 388, 414 Ware, W. H., 3, 9, S6 Wasley, R. S., 71 Wasserman, J. J., 31, S6 Watanabe, S., 363, 414 Watt, W. C., 412 Weber, H., 280 Weinberg, P. L., 155, 17s Weinberg, S.,166 Weiner, J. M., 354, 414 Weingarten, A,, 174 Weisaman, C., 11, 27, S6 Weizenbaum, J., 174 Welch, B. L., 328, 414 Wells, J. W., 233(39), 284
42 1
Wenker, J., 89, 92, 114 Wentz, W. B., 402, 4lS Werner, W., 350, 409 Weston, P., 126(195), 151(195), 174 Whitman, D., 174 Wied, G. L., 403, 414 Wilcox, T., 281 Wilkes, M. V., 174 Wilks, S. S.,347, 354, 358, 414 Williams, S.B., 151(10), 16'6 Willis, T. C., 107, ll4 Wilson, T. L.,173 Wmdley, P. F., 174 Wintrobe, M. M., 288, 289, 414 Wirth, H., 232(38), 235(38), 884 Wirth, N., 233(39), 280, 284 Witkin, N., 174 Woodward, P. M., 174 Wortman, D. B., 280 Wright, W., 75, 111 Wulf, W., 231(40), 284
X Xenakis, Y., 77, 114 Y
Ysrborough, J., 174 Yntema, D. B., 174 Young, I. T., 403,414 Younker, E. L., 174 Z Ziehe, T., 170 Zurcher, F. W., 174
Su biect Index A Abbreviation, in file organization, 121 Access controls descriptors and, 26 hierarchical, 27-28 Achorripsis, 77 ADEPT-50 system, 27 AED system, 211, 220-221, 224 AED-0 language, 21 1 Airline reservation systems, computers in, 66 Algol compiler, 8 Algorithms, coding of, 211 ALMA system, in music representation, 88-89 Analysis of Cell Images, The (PrewittMendelsohn), 368 Anti-intrusion alarms, 33 Arithmetic operations, in LSD base language, 264 Assembler language, optimizing of, 182184 Audit trails, in information security, 31-32 Authorized users, in computer security program, 18, 24 Automated Engineering Design (AED) System, 211 Automatic storage, in LSD base language, 249-250 Automatized procedures, computers in, 65 Avalanche for Pitchman, 76-77 B Back-up data, in computer security, 5 4 Banks, computer use in, 65-69 BASIC language, computer security and, 8 Basophils, 288 histograms for, 301-309 Batch systems, computer security and, 7 Bayesian rules, in leukocyte studies, 330-332, 33&339, 406 Benign environment, in computer security, 21
Blood cells, Wright’s stain for, 286-287, 290-297, 402-403 see also Leukocytes Boolean operators, 219 Brainwaves, music generated by, 85 Brown University, software technology group at, 176, 183-184, 279 Buchla Synthesizer, 78 Burroughs B5500 system, 8 Business games, algorithms for, 59 Byzantine chant, computer analysis of, 91,94 C CALCOMP plotter, 76-77 Calculation addressing methods, 132 California, University of, 104 Card identification, computers in, 67 CELLS recognition program, in leukocyte studies, 300, 321, 373-374, 380-383, 393, 403, 407408 CELLSCAN system, 402403 CELLSCAN-GLOPR system, 403404 Cellular multilist, 126 Change rining, problems of, 77-78 Chant melody, “cells” in, 94 Character manipulation, in LSD base language, 265-267 Cheatam-Sattley compiler, 97 Chicago Linear Music Language (CLML), 100 Circuit, defined, 151 Classical music, computer analysis of, 101-102 see also Music CLML (Chicago Linear Music Language), 100 Cluster synthesis, in leukocyte taxonomy, 390-392 CMS program, 188 Cobol compilers, computer security and, 8 COBOL language, file definitions in, 117 Collision, in file t,ransformation, 121
422
SUBJECT INDEX COMMON society, for programming knowledge, 54 Communications problems, information security and, 33-35 COMPASS, 84 Competition problems, 58-59 Compiler facilities in LSD base language, 273-278 programmer control, 181-182 Compression technique compressed keys in, 135-136 in random file structures, 121, 132-138 selective dropout in, 133 surnames in, 138 Computer (s) in airline reservation system, 66 in automatized procedures, 65 in banking operations, 6 5 4 9 in bibliographies and information retrieval systems, 106-107 in blood cell studies, 285409 decision making and, 39-40, 68-71 depersonalization through, 69 deterministic models in, 37-71 in education, 65 future utilization of, 6 7 4 8 invasion of privacy by, 71 mathematical models of, 62 memory storage and retrieval in, 68 in music composition and research, 73-110 parametric and nonparametric recognition by, in blood cell studies, 285-409 power failure and, 70 in push button factory, 68 quantified problems and, 70 real-time operative system and, 62 in thematic indexing, 105-106 in ticket reservations and sales, 67-68 Computer Cantata, 75-76 Computer failure, effects of, 70 Computer-generated music, 73-88 Computer plotter, music printing by, 108 Computer room, location of, 5 Computer security, 2-10, 30-33 authorized users in, 18 back-up data for, 5 4 basic assumptions in, 6-7 bening environment in, 21 data categorizing for, 5-6
423
direct control systems and, 8 equipment protection and, 4-5 hardware in, 14 input-output machines and, 16-17 privacy issue in, 9 privileged instructions for, 14 program/user isolation mechanics in, 14-15 sabotage and, 4-5 software failure and, 3 sources of problems in, 21-23 supervisory service calls in, 19-20 system access control in, 9-13 technical aspects of, 2 4 in time-shared programs, 7-9 two-state operations in, 15-16 user authentication in, 13 wiretapping and, 3, 33-34 Computer systems see also Computer(s) h e protection for, 4-5 real-time time-sharing in, 64 Computer utilization, expansion of, 63-67 Conditional expressions, 219 “Confidential” classifications, 9 CONHAN (Contextural Harmonic Analysis), 103 Conjunction-disjunction operations, on inverted lists, 148 Consonant Coding, 136 Control, in systems approach, 41 Coroutines, in LSD base system, 257-258 Covariance, discrimination and, 349 CPM (Critical Path Method), 56 Credit cards, computers and, 67 Critical Path Method, 56 Cryptographic “keys,” information security and, 29 Current Musicology, 77 CYDAC scanner, in blood cell studies, 296-297, 403, 406408
D DARMS (Digital Alternate Representation of Music), 90, 107 Data access, in computer security programs, 18-19 Data attributes, systems programming languages and, 196-202
424
SUBJECT INDEX
Data categorizing, for computer security, 5-6 Datacoder DC 110, 34 Data management system, file organization in, 117 Data sequestor JJC-3, 34-35 Data set organization, types of, 161-164 Data storage, in computer utilization, 64 DC 110 Datacoder, 34 Debugging facilities in LSD base language, 269-273 in systems programming languages, 186-187, 195-196, 207-209, 230 Decision making computer and, 39-40, 63-71 information systems and, 4 4 4 5 management systems and, 43-46 by manager, 38-39 Decision theory, in leukocyte parameter studies, 368-383 Declaration Statement, in LSD base language, 242-247 Dendrogram, chansons in, 99 Descriptors, in information security, 25 Deterministic models computer and, 37-71 management science and, 46-50 Dictionary lookup, 119-120 random file structures and, 131 Digital-analog computer, real-time composition by, 85-88 see also Computer(s) Digital-analog converter in MVSIC 4BF program, 83 sampling by, 78-81 Digital-Analog Representation of Music Symbols (DARMS), 90, 107 Digital computer see also Computer(s) in blood cell analysis, 290-300 mathematical models of, 48 Digital Equipment Corporation, 85 Digital information, conversion of t o sound, 79 Direct address, 119 Direct data set organization, 164 Directory trees, 152-155, 160 Discrimination, covariance in, 349 Dolby A-301 noise reduction system, 83 Dynamic calls, in LSD base language, 256
E Education, computer use in, 65 8 M’s, as resources in managerial decision, 39-40,43, 50, 62 Electroencephalographic data, music generated from, 85 Electronic synthesizer, 78-88 Elementary data item, 117 Encryption, in information security, 2930, 34 Eosinophils, 288 EPL language, 178 Erythrocytes, 289-295 Ethnomusicology, computer in, 92, 104105 Example file directionary for, 120 inverted organization and, 124 list organization of, 123 multilist organization of, 125 ring organization of, 127 symbol trees in, 129 Extensible Language for Systems Development (Extensible LSD), 238240
F Feature selection and ordering, in leukocyte studies, 347-364 Feedback, in systems approach, 41 FIFO queue, 122 File calculation method in, 120-121 in data management system, 117 dictionary lookup and, 119-120 direct address in, 119 example, see Example file inverted, 123-126, 148-150 random organization of, 118-121 sequential organization in, 117-1 18 File design, procedure for, 164-166 File encryption, 29-30 File organization, 115-166 example file in, 116 implementation of file structures in, 160-166 list file structures in, 143-151 list organization in, 121-128 minimizing search time in, 155-157
SUBJECT INDEX random list and file structures in, 130143 terminology in, 117 tree structures in, 128-130, 151-160 File protection see also Computer security access control and, 26-28 problems of, 24-28 techniques of, 28-30 File structures implementation of, 160-166 sequential data set organization in, 161-162 Filial set, 152 Fire protection, of computer systems, 4-5 Fischer’s function, in leukocyte study, 336 Ford-Columbia Representation, 88, 90, 97, 100-101, 103, 106 FORTRAN in list file structures, 144-145 in music analysis studies, 95 in music composition, 83-84, 88 pitch structures and, 103 in ring organizations, 150-151 in systems programs, 177-178 as “universal programming solvent,’’ 176, 189 Fortran compilers, computer security and, 8 French chanson, ordering in, 105-106 G Garbage collection, in list file structures, 145 GE-645 system, 23 GECOS I11 system, 19-20 Gradient-tracking algorithms, in leukocyte studies, 314415 Granulocyte, 288 synthesized facsimile of, 401 Graph, defined, 151 Gravesaner Bliitter, 77 Gregorian chant, computer analysis of, 91, 93-94 GRIPHOS program, 106 GROOVE system, 84, 86-88 GUIDE society, 54 H Harmonic analysis, computer in, 95-96 Harmonicity, degree,s of 101
425
Hardware malfunction, in computer security, 3 Hardware system, vs operating system, 23 Hashing, in file organization, 121, 138-139 Histograms, for leukocytes, 301-309 HPSCHD, 76
I IBM-CMS program, 188 IBM System 360, 17 IDS (Integrated Data Store), 150-151 Ill& Suite for String Quurtet, 73, 75 IML (Intermediary Music Language), 90 Immediate decoding trees, 152, 154 double-chained, 159 Incipits, ordering of, 106 Indexed sequential data set organization, 162-164 Information manager’s needs in, 45-46 technical threats to, 2-4 Information network, marginal decisions and, 43 Information processing, shared, 24-27 Information security see also Computer security access controls in, 26 audit trails and, 31-32 authorizations in, 24 basic assumptions in, 6-7 communicationsproblems and, 33-35 completeness of design in, 22-23 computer characteristics in, 13-18 descriptors in, 25 file encryption in, 29-30, 34 file protection and, 24-30 incomplete program design in, 21-22 in multi-user computer environment, 1-35 program validation in, 32-33 operating system functions relating to, 18-23 output routing in, 20-21 software failure and, 3, 22 virtual machines in, 17-18 wiretapping countermeasuresand, 33-34 Information structures, identifying of, 165 Information systems characteristics and performance of, 40-43
SUBJECT INDEX
426
decision making and, 4 4 4 5 “noise” in, 44 Input-output characteristics in comput.er security, 16-17 in systems approach, 41 Input-output routines debugging and, 207-208, 222, 230, 234 LSD base language and, 269-273 Integrated Data Store (IDS), 150-151 Interactive Real-Time Music Assembler, 85 Interrupt system, in computer security, 14 Inventory problems, 54-55 Inverted file, 123-126, 148-150 timing comparison with multilist organization, 150 updating of, 149 IOCS centralized operating service, 19-20 IRMA (Interactive Real-Time Music), 85 Isolation mechanisms, in computer security, 14-15
J Javanese gamelan music, 104 JJC-3 data sequestor, 34 Josquin Project, 89-90, 98 JOSS languages, computer security and, 8 JOVIAL language, 178
K Key, symbols in, 153 see also Trees L Learning methods, computers in, 66 Leukocyte image processing, 285-409 Leukocytes automatic microimage analysis for, 400-404 autoradiograph for, 319 Bayesian rules for, 330-332, 338-339, 371, 406 in best three-space, 374 boundary-construction procedure for, 316-318 canonical analysis of parameters for, 377 cell count for, 288 cells parameter list for, 322-323
cluster formation in, 388-389 cluster synthesis in, 390-392 computer-oriented image properties for, 393400 cytotaxonomy for, 383-393 discriminating logic for, 327-328 as disease indicators, 285 edge detection and spatial differentiation in, 306-309 exoskeleton technique in, 320 feature selection in, 347-364 five categories of, 288, 407 histograms for, 301-310 human visual discrimination in, 393-400 image art,iculation in, 301-321 image description for, 321-327 linear discrimination table for, 384386 linear logic and performance evaluation in, 338-347 linear separability for, 378-379 Mahalanobis distances in, 336-337, 343, 353-357, 363 microscope image analysis in, 400-404 morphologic progression in, 289 nonparametric pattern detection in, 383393 nuclear absorbance for, 318 optical density frequency distributions for, 301-309 parameter discrimination using decision theory for, 368-383 parameters for, 321, 324-325, 364-367 parameter variability in, 364-368 probabilist.ic concepts in, 328 redundancy of discriminatory information in, 376 spatial differentiation in, 312-313 SYNTEX program for, 395 Wright’s stain in, 286-287, 290-297, 402403 Liber Usualis, keypunching of, 93 Librascope L-3055 computer, 83 LIFO queue, 122 Linear list, in file organization, 122, 146-148 Linear probing, 140-141 Linked allocation, 122 List defined, 145-146 linear, 122, 146-148
SUBJECT INDEX List file structures, 143-151 List organization, 121-128 Logical record, file organization and, 130-131 Loop control, in LSD base language, 262-264 LSD (Language Systems Development) base language, 240-279 arrays and structures in, 246-247 arithmetic operations in, 264 automatic storage in, 249-250 basic orientation in, 276 basic philosophy of, 240-241 basic variables in, 250-253 character manipulation in, 265-267 code generation extensions in, 278 compiler and extensibility in, 277 compile-time facilities and, 273-278 conversions in, 267 declaration statement in, 242-247 defined variables in, 244-245 external and internal procedures in, 248-249 format of, 242 implementation of, 278-279 initialization and alignment in, 245-246 1/0 and debugging facilities in, 269-273 machine-dependent facilities in, 267-269 multiprogramming in, 258-260 parameters and flexibility of, 260-261 pre-parse and post-parse extensions in, 277-278 procedures and invocation in, 255-262 program segmentation and storage allocation in, 247-248 RETURN statement in, 261-262 stacked variables in, 253-254 statements and operators in, 262-269 Lymphocytes, 288
M Machine language, see Systems programming language MACRO facilities, AED system and, 222-223, 230-231, 238 Magnus liber organi, computer study of, 97 Mahalanobis distance, in leukocyte study, 336-337, 343, 353-357, 363 Management, systems approach to, 40-43 Management science
427
deterministic model in, 46-50 generalities in, 38 manager’s implementation of models in, 50-63 model building in, 48-53 problem-solution interpretation in, 5053 Management systems, 43-46 Manager data needed by, 4 5 4 6 as decision maker, 37-38 duties of, 38-39 informing of, 51 operative functions and, 52 problem solving by, 50-51 Managerial decisions computer elimination of, 63-67 8 M’s of, 39-40 Managerial problems, computer utilization in, 62-63 see also Problem solving Master mode-slave mode states, 15 Mathematical models, computers and, 48, 62 Medieval music, computer analysis of, 91-97 Memory, auxiliary, 70 Memory protect systems, 14-17 Memory storage destruction of, 70 subprograms and, 64 Microscopic image analysis, in leukocyte study, 400404 MIR (Music Information Retrieval) program, 90 MITSYN (Musical Interactive Tone Synthesis) program, 84 Mixed knowledge team concept, 63 Mixed problems, 59 Mnemonic system, in pitch designation, 89 Model computer utilization and, 50 deterministic, see Deterministic model management science and, 4 8 4 3 validation of, 49 Modern music, computer analysis of, 102-103 MOL 360, 178 Monocytes, 288 Moog synthesizer, 74, 78-82
SUBJECT INDEX MULTICS system, information security in, 23 Multilist, defined, 126 Multilist organizations, 148-150 Multiprogramming hardware facilities for, 13-15 in LSD base language, 258-260 Multi-user computer environment, security and, 1-35 Multi-user systems access to, 9-13 general security requirements for, 9-10 hardware in, 13-15 systems programming languages and, 190-191 types of, 7-9 wiretapping and, 2 4 Music computer-generated, 73 notation in, 88-92 traditional composition of, 75-78 twentieth-century, 100-101 Musical Oflring (Bach), 100 Music bibliographies, computer in, 106107 MUSICOMP, 76 Music composition automated typography in, 107 brainwave feedback in, 85 computer in, 73-110 stored functions in, 80 Music Information Retrieval (MIR) system, 90 Music notation Ford-Columbia Representation in, 8892 pitch and duration in, 89 Music research automated procedures in, 73, 107-110 computer in, 8S-107 problem of representation in, 88-92 Music style analysis, computer in, 92-104 Music theory, computer in, 92-104 Music typography, automated, 107-110 MUSIC IV program, 83 MUSIC 4B program, 83 MUSIC 4BF program, 83 MUSIC V program, 78, 80, 84
N National Fire Prevention Association, 4 NELIAC language, 178 Neutrophils, 288 exoskeleton technique and, 320 gradient-tracking algorithm for nucleus of, 314-315 stylized, 394, 397 Neyman-Pearson rule, in leukocyte studies, 332-333 Note concept, 80 Nuclear absorbance, in leukocyte studies, 318 0 ON ACCESS computer-time facility, in LSD base language, 274-276 Operating system functions, information security and, 18-23 Operations research algorithms in, 6 2 4 3 models in, 4&49 Operative system, performance of, 42-43 Opera Workshop Codex, 106 Optical density frequency distributions, for leukocytes, 301-309 Optimizing Computer, 210 Orchestra, in computer composition, 80 OS/360 system, file protection for, 28 Output routing, in information security, 20-21 P
Partitioned data set organization, 162 Password(s) in computer system access control, 11-13 design of, 11-12 incorrect, 21 system access control and, 12-13 Path, defined, 151 Pmcussion&t, 76 PERT (Program Evaluation and Review Technique), 56 Physical record, fde organization techniques and, 130-131 Pitch, mnemonic system for, 89 Plain bob major change ringing, 78
SUBJECT INDEX
Player Piano, 76 PL/1 language, 178, 182-183, 186 in LSD base language, 278-279 teleprocessing support and, 208 trouble spots in, 210 PLOTZ 9 program, 107 Pointer, in list organization, 121-122 Polyphonic music, in synthesizer, 81 Prerecorded Tape, 76 P r i m Donna, 76 Princeton University, Josquin Project at, 89-90,98 F’rivacy computer security and, 9 invasion of, 71 Privileged instructions, in computer security, 14 Problems categories of, 53 competition, 58-59 inventory, 54-55 mixed, 59 quantified, 70 queuing, 58 replacement, 57-58 routing, 56-57 search, 58 sequencing, 55-56 transportation, 55 Problem-solution interpretation, in management science model, 50-53 Problem solving computers in, 62-63 content in, 60 financial plans and, 60 marketing and, 61 overall planning in, 60 personnel in, 60-61 priorities in, 59-62 production and, 61-62 purchasing and, 61 research and development in, 61 system orientation in, 63 types of problems in, 53-59 Processor, in systems approach, 41 Program access, in computer security, 18-19 Program Evaluation and Review Technique (PERT), 56 Program identity, information security and, 22
429
Programming checks, computer security and, 16 Programming languages, see Systems programming languages Program segmentation, 215 Program time, in music composition, 86-87 Program/user isolation mechanisms, 14-15 Program validation, iuformation security and, 32-33
Q Quantitative expansions, decision making and, 63 Queuing problems, 58
R Random file structures, 130-143 calculation methods in, 131-142 compression techniques in, 132-138 consonant coding in, 136-137 direct address in, 131 dictionary lookup in, 131 hash addressing in, 13&139 scatter table construction in, 140-143 Randomized check code, 141-142 Random probing, 141 RCA Electronic Music Synthesizers, 75 “Real-time” time-sharing systems, 64 Record, in file organization, 117 Recurrent chord content, 102 Register allocation, in LSD base language, 274 Remote access, control of, 10-13 Remote batch systems, computer security and, 7 Renaissance music, notation and rhythm in, 97-98 Replacement problems, 57-58 Resources, 8 M’s of, 62 Rings, in file organization, 126-128, 150 Root, of symbol tree, 152 Routing problems, 56-57 Run-time error checking, 185
S Sabotage, computer security and, 4-5 Scatter table construction, in random file structures, 140-143
430
SUBJECT INDEX
Scope attributes, systems programming language and, 198-199 Search problems, 58 Sequential data set organization, 161-164 Sequencing problems, in file system, 55-56, 117-118, 122 “Secret” classifications, 9 Security assurance, techniques of, 30-33 see also Computer security; Information security Selection statements, in LSD base language, 264 Selective dropout, in compression technique, 133-134 SHARE (Society to Help Avoid Redundant Effort), 54 Shared informat,ion processing, models of, 24-27 Sliding pattern match, in music analysis, 95 Slovak Academy of Sciences, 104 SNOBOL programming language, 101, 103 Software failure, information security and, 3, 22 Sound, conversion of digital information to, 79 Soundex system, 136 Source-transfer function model, 81 Space efficiency, in systems programming, 179 Speech-oriented synthesizer, 81 Stack manipulation, in systems programming language, 219 Storage allocation, in systems programming language, 192-194,221-222 Storage classes, 199-202, 214 Storage locks, 15 Stored functions, in music composition, 80 Stream 1/0 debugging and, 195-196 systems programming languages and, 207-208 Students, computer use by, 65 Subprograms, in music composition, 80-81 Subroutines, use of, 64 Supervisor-user states, in computer security, 15 Supervisory service calls, 19-20 Switched-on Bach, 78
Symbols, strings of, 153 Symbol trees, 151-160 Synthesizers, 81-82 Syntax macros, 238 SYNTEX program, 395 System classification of, 42 components of, 4 1 4 2 defined, 176-177 System Development Corporation, 149 Systems access control computer security and, 10-13 passwords and, 11-13 Systems approach feedback in, 41 management systems and, 4 3 4 6 Systems development, languages for, 239-279 see also Systems programming languages System orientation concept, 63 Systems program characteristics and definition of, 177 run-time environment and, 184-185 space efficiency and, 179 Systems programming languages, 175-279 adaptability in, 188-189 control statements in, 204 criteria for, 180-191 data attributes in, 196-202, 212-214 data elements in, 224-227 data facilities in, 192 data manipulation in, 194, 205-207, 217-222, 230 debugging facilities in, 186-187, 207209, 222-223, 230, 234 designer’s problems in, 191 error checking in, 185-186 extensibility and systems programming in, 235-239 goal of, 178 machine dependence in, 190 MACRO facilities in, 209, 223, 230-231 multiple-user systems and, 190-191 parallel structures and, 228-229 program control and segmentation in, 194-195, 202-204, 215, 227 program modularity in, 189-190 program segmentation in, 194-195, 202-294, 215, 227 reviews of, 196-235
SUBJECT INDEX specific constructs in, 192-196 storage allocation in, 192-194, 220-222 syntax considerations in, 187-188 target code in, 181 trouble spots in, 210-212 universal language or processor in, 236-238 User’s Guide for, 181
431
structures of in file Organization, 128130, 151-160 Twelve-tone technique, 77 Two-state computer operation, security and, 15-16 TWS (translator writing system), 236-237 Typography, automated, in music composition, 107-110
T
U
Target code efficiency, of systems programming language, 180-181 Telegraph, wiretapping and, 33 Teleprocessing support, PL/I language and, 208 Thematic indexing, 105-106 Three Pieces for Solo Percussionist, 77 Ticket reservations, computers for, 67 Time-shared Data Management System (TDMS), 149 Time-sharing systems, computer security and, 7 Transfer function, in synthesizer, 81-82 Transformation algorithm, in random file structure, 139-140 Translator writing systems (TWS), 236237 Transportation problems, 55 Tree(s) defined, 151 directory, 152-157 immediate decoding, 152-157 representations of, 157-160
Universal language or processor, 236-238 User isolation, in computer security, 14-15
v Virtual machines, information security and, 17-18
W Waldorf list, 123 Warning message, in systems programming, 185 White cells, see Leukocytes Wilks-Bartlett lambda, in leukocyte studies, 358 Wiretapping computer security and, 3 countermeasures for, 33-34 file encryption and, 29 multi-user system and, 2-3 system vulnerability to, 33 Wright’s stain, blood cells and, 286-287, 290-297, 402403
Contents of Previous Volumes Volume 1 General-Purpose Programming for Business Applications CALVINC. GOTLIEB Numerical Weather Prediction A. PHILLIPS NORMAN The Present Status of Automatic Translation of Languages YEHOSHUA BAR-HILLEL Programming Computers to Play Games ARTHURL. SAMUEL Machine Recognition of Spoken Words RICHARDFATEHCHAND Binary Arithmetic GEORGEW. REITWIESNER Volume 2
A Survey of Numerical Methods for Parabolic Differential Equations JIMDOUGLAS, JR. Advances in Orthonormalizing Computation PHILIPJ. DAVISAND PHILIPRABINOWITZ Microelectronics Using Electron-Beam-Activated Machining Techniques KENNETHR. SHOULDERS Recent Developments in Linear Programming SAULI. GLASS The Theory of Automata, a Survey ROBERTMCNAUGHTON Volume 3
The Computation of Satellite Orbit Trajectories SAMUELD. CONTE Multiprogramming E. F. CODD Recent Developments of Nonlinear Programming PHILIP WOLFE Alternating Direction Implicit Methods GARRET BIRKHOFF,RICHARD S. VARGA,AND DAVIDYOUNG Combined Analog-Digital Techniques in Simulation HAROLDF. SKRAMSTAD Information Technology and the Law REEDC. LAWLOR Volume 4
The Formulation of Data Processing Problems for Computers WILLIAMC. MCGEE All-Magnetic Circuit Techniques DAVIDR. BENNIONAND HEWITT D. CRANE Computer Education HOWARD E. TOMPKINS 4 32
CONTENTS OF PREVIOUS VOLUMES Digital Fluid Logic Elements H. H. G L A E ~ Multiple Computer Systems WILLIAMA. CURTIN Volume 5
The Role of Computers in Election Night Broadcasting JACK MosEaam Some Results of Research on Automatic Programming in Eastern Europe WLADYSLAW TURKSI A Discussion of Artificial Intelligence and Self-organization GORDON PASK Automatic Optical Design ORESTES N. STAVROUDIS Computing Problems and Methods in X-Ray Crystallography CHARLES L. COULTER Digital Computers in Nuclear Reactor Design ELIZABETH COTHILL An Introduction to Proeedure-oriented Languages HARRYD. HUSKEY Volume 6
Information Retrieval CLAUDEE. WALSTON Speculations Concerning the First Ultraintelligent Machine IRVING JOHN GOOD Digital Training Devices CHARLES R. WICKMAN Number Systems and Arithmetic HARVEYL. GARDER Considerations on Man versus Machine for Space Probing P. L. BARGELLINI Data Collection and Reduction for Nuclear Particle Trace Detectors HERBERT GELERNTER Volume 7
Highly Parallel Information Processing Systems JOHN C. MURTHA Programming Language Processors RUTHM. DAVIS The Man-Machine Combination for Computer-kssisted Copy Editing WAYNEA. DANIELSON Computer-Aided Typesetting WILLIAMR. BOZMAN Programming Languages for Computational Linguistics ARNOLDC. SAITERTHWAIT Computer Driven Displays and Their Use in Man/Machine Interaction ANDRIESVAN DAM
433
434
CONTENTS OF PREVIOUS VOLUMES
Volume 8
Time-shared Computer Systems THOMAS N. PYKE,JR. Formula Manipulation by Computer JEAN E. SAMMET Standards for Computers and Information Processing T. B. STEEL,JR. Syntactic Analysis of Natural Language NAOMISAGER Programming Languages and Computers : A Unified Metatheory R. NARASIMHAN Incremental Computation LIONELLOA. LOMBARDI Volume 9
What Next in Computer Technology? W. J. POPPELBAUM Advances in Simulation JOHN MCLEOD Symbol Manipulation Languages PAULW. ABRAHAMS Legal Information Retrieval AVIEZRIS. FRAENKEL Large Scale Integration-an Appraisal L. M. SPANDORFER Aerospace Computers A. S. BUCHMAN The Distributed Processor Organization L. J. KOCZELA Volume 10
Humanism, Technology, and Language CHARLESDECARLO Three Computer Cultures : Computer Technology, Computer Mathematics, and Computer Science PETERWEGNER Mathematics in IW-The Impact of Computers BRYANTHWAITES Computing from the Communication Point of View E. E. DAVID,JR. Computer-Man Communication : Using Computer Graphics in the Instructional Process FREDERICK P. BROOKS,JR. Computers and Publishing: Writing, Editing, and Printing ANDRIESVAN DAMAND DAVIDE. RICE A Unified Approach to Pattern Analysis ULF GRENANDER Use of Computers in Biomedical Pattern Recognition ROBERTS. LEDLEY
CONTENTS OF PREVIOUS VOLUMES Numerical Methods of Stress Analysis WILLIAMPRAGER Spline Approximation and Computer-Aided Design J. H. AHLBERG Logic per Track Devices D. L. SLOTNICK Volume 11
Automatic Translation of Languages Since 1960: A Linguist’s View HARRYH. JOSSELSON Classification, Relevance, and Information Retrieval D. M. JACKSON Approaches to the Machine Recognition of Conversational Speech KLAUSW. OTPEN Man-Machine Interaction Using Speech DAVIDR. HILL Balanced Magnetic Circuits for Logic and Memory Devices R. B. KIEBURTZAND E. E. NEWHALL Command and Control: Technology and Social Impact ANTHONY DEBONS
435
This Page Intentionally Left Blank