Highly Available Storage for Windows® Servers (VERITAS Series)
Paul Massiglia
Wiley Computer Publishing
John Wiley &...
29 downloads
574 Views
12MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Highly Available Storage for Windows® Servers (VERITAS Series)
Paul Massiglia
Wiley Computer Publishing
John Wiley & Sons, Inc. N EW YOR K • CH ICH ESTER • WEI N H EI M • B R ISBAN E • SI NGAPOR E • TORONTO
Highly Available Storage for Windows® Servers (VERITAS Series)
Paul Massiglia
Wiley Computer Publishing
John Wiley & Sons, Inc. N EW YOR K • CH ICH ESTER • WEI N H EI M • B R ISBAN E • SI NGAPOR E • TORONTO
Publisher: Robert Ipsen Editor: Carol A. Long Assistant Editor: Adaobi Obi Managing Editor: Micheline Frederick Text Design & Composition: North Market Street Graphics Designations used by companies to distinguish their products are often claimed as trademarks. In all instances where John Wiley & Sons, Inc., is aware of a claim, the product names appear in initial capital or ALL CAPITAL LETTERS. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration. This book is printed on acid-free paper. Copyright © 2002 by Paul Massiglia. All rights reserved. Published by John Wiley & Sons, Inc. Published simultaneously in Canada. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4744. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 605 Third Avenue, New York, NY 101580012, (212) 850-6011, fax (212) 850-6008, E-Mail: PERMREQ @ WILEY.COM. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold with the understanding that the publisher is not engaged in professional services. If professional advice or other expert assistance is required, the services of a competent professional person should be sought. Library of Congress Cataloging-in-Publication Data: Massiglia, Paul. Highly available storage for Windows servers / Paul Massiglia. p. cm. ISBN 0-471-03444-4 1. Microsoft Windows server. 2. Client/server computing. 3. Computer storage devices. I. Title. QA76.9.C55 M394 2002 004.4'476—dc21 2001006393 Printed in the United States of America. 10 9 8 7 6 5 4 3 2 1
C O N T E N TS
Acknowledgments Foreword
xi xiii
Part One
Disk Storage Architecture
1
Chapter 1
Disk Storage Basics
3
Data Basics
3
Transient Data Persistent Data
3 4
Disk Basics
5
Disks, Data and Standards Magnetic Disk Operation Pulse Timing and Recording Codes Error Correction Codes Locating Blocks of Data on Magnetic Disks Logical Block Addressing Zoned Data Recording Disk Media Defects Writing Data on Magnetic Disks Intelligent Disks Other Applications of Disk Intelligence: SMART Technology
6 7 8 10 11 13 14 15 16 17 18
Disk Controller and Subsystem Basics
19
External and Embedded Array Controllers Host-Based Aggregation
Chapter 2
20 22
Volumes
25
The Volume Concept
25
Virtualization in Volumes Why Volumes? The Anatomy of Windows Disk Volumes Mapping and Failure Protection: Plexes
27 27 28 28 iii
iv
Contents
Chapter 3
Volumes That Are Not Failure Tolerant
31
Simple Volumes
31
Spanned Volumes
33
Spanned Volumes and Failure Tolerance Spanned Volumes and I/O Performance
Applications for Simple and Spanned Volumes
36
Striped Volumes
37
Striped Volumes and Failure Tolerance Striped Volumes and I/O Performance Applications for Striped Volumes
Why Striped Volumes Are Effective Striped Volumes and I/O Request-Intensive Applications Striped Volumes and Data Transfer-Intensive Applications Stripe Unit Size and I/O Performance A Way to Categorize the I/O Performance Effects of Data Striping An Important Optimization for Striped Volumes: Gather Writing and Scatter Reading
Chapter 4
35 35
39 40 41
43 43 47 48 49 51
Failure-Tolerant Volumes: Mirroring and RAID
53
RAID: The Technology
53
RAID Today
Mirrored Volumes Mirrored Volumes and I/O Performance Combining Striping with Mirroring Split Mirrors: A Major Benefit of Mirrored Volumes
RAID Volumes RAID Overview RAID Check Data The Hardware Cost of RAID
Data Striping with RAID Writing Data to a RAID Volume An Important Optimization for Small Writes to Large Volumes An Important Optimization for Large Writes The Parity Disk Bottleneck A Summary of RAID Volume Performance
Failure-Tolerant Volumes and Data Availability Mirroring and Availability RAID and Availability
54
55 57 59 61
63 63 63 67
70 71 71 74 75 76
77 78 79
What Failure-Tolerant Volumes Don’t Do
80
I/O Subsystem Cache
82
Disk Cache RAID Controller Cache
83 84
Contents
Operating System Cache File System Metadata Cache Database Management System and Other Application Cache
v
85 86 88
Part Two
Volume Management for Windows Servers
89
Chapter 5
Disks and Volumes in Windows 2000
91
The Windows Operating Systems View of Disks
91
Starting the Computer Locating and Loading the Operating System Loader Extended Partitions and Logical Disks Loading the Operating System
Dynamic Disks: Eliminating the Shortcomings of the Partition Structure Dynamic Volume Functionality
Volumes in Windows NT Operating Systems
98
99 101
Update Logging for Mirrored Volume Update Logging for RAID Volumes Crash Recovery of Failure-Tolerant Volumes
101 102 102
Windows Disk and Volume Naming Schemes
Volume Manager Implementations Common Features of All Volume Managers Volume Manager for Windows NT Version 4
Volume Managers for Windows 2000 Windows 2000 Volume Manager Capabilities
Array Managers Volumes Made from Disk Arrays
103 105
106 107 108
108 108
110 111
Summary of Volume Manager Capabilities
118
Host-Based Volumes in Windows Servers
119
Starting the Logical Disk Manager Console
120
Disk Management Simplified
126
Creating and Reconfiguring Partitions and Volumes
126
Invoking Logical Disk Manager Wizards
Chapter 7
96
Recovering Volumes from System Crashes
Where Volume Managers Fit: The Windows OS I/O Stack
Chapter 6
91 92 94 95
128
Upgrading Disks to Dynamic Format
130
Basic Volumes
133
Creating a Simple Volume
133
Management Simplicity
140
Creating a Spanned Volume
142
vi
Contents
Creating a Striped Volume
146
Creating a Mirrored Volume
150
Splitting a Mirror from a Mirrored Volume Adding a Mirror to a Logical Disk Manager Volume Removing a Mirror from a Mirrored Volume
Chapter 8
Advanced Volumes
163
The Volume Manager for Windows 2000
163
Three-Mirror Volumes and Splitting
175
Part I: Adding a Mirror Part II: Splitting a Mirror from a Mirrored Volume
Chapter 9
156 158 160
177 181
More Volume Management
189
Extending Volume Capacity
189
Volume Extension Rules
189
Features Unique to Windows 2000 Volumes Mount Points FAT32 File System
Mirrored-Striped Volumes The Volume Manager and Mirrored-Striped Volumes Dynamic Expansion of Mirrored Volumes Splitting a Striped Mirror
Creating and Extending a RAID Volume RAID Volumes and Disk Failure Extending a RAID Volume (Volume Manager Only)
198 199 201
203 204 209 210
212 216 220
Multiple Volumes on the Same Disks
224
Monitoring Volume Performance
225
Relocating Subdisks
231
Disk Failure and Repair
235
Volume Management Events
240
Using Windows Command-Line Interface to Manage Volumes
241
Chapter 10 Multipath Data Access
Physical I/O Paths Chapter 11 Managing Hardware Disk Arrays
RAID Controllers Embedded RAID Controllers
Array Managers RAID Controllers and the Volume Manager Dealing with Disk Failures
243
243 253
253 254
255 262 263
Contents
vii
Chapter 12 Managing Volumes in Clusters
271
Clusters of Servers
271
Cluster Manager Data Access Architectures Resources, Resource Groups, and Dependencies Clusters and Windows Operating Systems How Clustering Works
Microsoft Cluster Server MSCS Heartbeats and Cluster Partitioning Determining MSCS Membership: The Challenge/ Defense Protocol MSCS Clusters and Volumes Volumes as MSCS Quorum Resources
Volume Management in MSCS Clusters Preparing Disks for Cluster Use MSCS Resource Types: Resource DLLs and Extension DLLs Using Host-Based Volumes as Cluster Resources Multiple Disk Groups Cluster Resource Group Creation Making a Cluster Disk Group into a Cluster Resource Controlling Failover: Cluster Resource Properties Bringing a Resource Group Online Administrator-Initiated Failover Failback Multiple Disk Groups in Clusters Making a Volume Manager Disk Group into an MSCS Cluster Resource Making Cluster Resources Usable: A File Share
MSCS and Host-Based Volumes: A Summary Disk Groups in the MSCS Environment Disk Groups as MSCS Quorum Resources Configuring Volumes for Use with MSCS
VERITAS Cluster Server and Volumes VCS and Cluster Disk Groups VCS Service Groups and Volumes Service Group Failover in VCS Clusters Adding Resources to a VCS Service Group Troubleshooting: The VCS Event Log Cluster Resource Functions: VCS Agents
Volume Manager Summary Chapter 13 Data Replication: Managing Storage Over Distance
Data Replication Overview Alternative Technologies for Data Replication Data Replication Design Assumptions Server-Based and RAID Subsystem-Based Replication
273 273 276 277
278 279 280 282 283
283 284 284 287 288 290 291 293 294 296 297 298 298 301
304 305 305 306
306 307 307 311 313 318 318
319 321
321 322 323 323
viii
Contents
Elements of Data Replication Initial Synchronization Replication for Frozen Image Creation Continuous Replication
What Gets Replicated?
326 326 326 327
328
Volume Replication File Replication Database Replication
329 333 334
How Replication Works
336
Asynchronous Replication Replication and Link Outages Replication Software Architecture Replicated Data Write Ordering Initial Synchronization of Replicated Data Initial Synchronization of Replicated Files Resynchronization
Using Replication Bidirectional Replication Using Frozen Images with Replication
Volume Replication for Windows Servers: An Example Managing Volume Replication Creating a Replicated Data Set VVR Data Change Map Logs Initializing Replication Sizing the Replication Log Replication Log Overflow Protection Protecting Data at a Secondary Location Network Outages Using Replicated Data RVG Migration: Converting a Secondary RVG into a Primary
File Replication for Windows Servers Replication Jobs Specifying Replication Sources and Targets Specifying Data to be Replicated Replication Schedules Starting Replication Administratively Troubleshooting File Replication
Chapter 14 Windows Online Storage Recommendations
339 342 343 344 345 347 347
348 348 350
351 352 353 356 360 362 365 366 366 368 369
371 372 374 376 377 378 381
383
Rules of Thumb for Effective Online Storage Management
383
Choosing an Online Storage Type
383
Basic Volume Management Choices
384
Just a Bunch of Disks Striped Volumes
384 388
Contents
ix
Failure-Tolerant Storage: RAID versus Mirrored Volumes RAID Volume Width Number of Mirrors
389 391 392
Hybrid Volumes: RAID Controllers and Volume Managers
394
Host-Based and Subsystem-Based RAID Host-Based and Subsystem-Based Mirrored Volumes Using Host-Based Volume Managers to Manage Capacity Combining Host-Based Volumes and RAID Subsystems for Disaster Recoverability
Unallocated Storage Capacity Policies
395 395 396 397
398
Determination of Unallocated Storage Capacity Distribution of Unallocated Storage Amount of Unallocated Capacity Spare Capacity and Disk Failures Disk Groups and Hardware RAID Subsystems
398 398 399 400 400
Failed Disks, Spare Capacity, and Unrelocation
401
Using Disk Groups to Manage Storage
403
Using Disk Groups to Manage Storage in Clusters Using Disk Groups to Control Capacity Utilization
Data Striping and I/O Performance Striping for I/O Request-Intensive Applications Striping for Data Transfer-Intensive Applications Rules of Thumb for Data Striping Staggered Starts for Striped Volumes Striped Volume Width and Performance
403 403
404 404 406 407 407 408
Appendix 1 Disk and Volume States
411
Appendix 2 Recommendations at a Glance
415
Glossary of Storage Terminology
421
Index
443
Acknowledgments
he title page bears my name, and it’s true, I did put most of the words on paper. But as anyone who has ever written a book—even a modestly technical book like this one—is aware, it is inherently a team effort.
T
This project wouldn’t have come to fruition without a lot of support from a number of talented people. Pete Benoit’s Redmond engineering team was of immeasurable technical assistance. I single out Terry Carruthers, Debbie Graham, Pylee Lennil, and Mike Peterson, who all put substantial effort into correcting my mistakes. Philip Chan’s volume manager engineering team reviewed the original manuscript for accuracy, and met all my requests for license keys, access to documents, and software, server accounts and technical consulting. Particular thanks go to the engineers and lab technicians of VERITAS West, who made both their facilities and their expertise available to me unstintingly. Hrishi Vidwans, Vipin Shankar, Louis MacCubbin, T. J. Somics, Jimmy Lim, Natalia Elenina, and Sathaiah Vanam were particularly helpful in this respect. Karen Rask, the VERITAS product marketing manager for the Volume Manager described in this book saw value in the concept and drove it through to publication. Thanks, too, to other members of the VERITAS Foundation and Clustering engineering and product management teams who supported the project. Richard Barker, my manager, had the forbearance not to ask too often what I was doing with all my time. This book actually stemmed from an idea of xi
xii
Acknowledgements
Richard’s almost two years ago—although in retrospect, he may view it as proof of the adage, “Be careful what you wish for. You may get it.” Many other people contributed, both materially and by encouraging me when necessary. You know who you are. Errors that remain are solely my responsibility. Paul Massiglia Colorado Springs August, 2001
F O R EWO R D
The Importance of Understanding Online Storage
n recent years, the prevailing user view of failure-tolerant storage has progressed from “seldom-deployed high-cost extra” to “necessity for important data in mission-critical applications,” and seems to be headed for “default option for data center storage.” During the same period, the storage industry has declared independence from the computer system industry, resulting in a wider range of online storage alternatives for users.
I
Today, system administrators and managers who buy and configure online storage need to understand the implications of their choices in this complex environment. A prerequisite for making informed decisions about online storage alternatives is an awareness of how disks, volumes, mirroring, RAID, and failure-tolerant disk subsystems work; how they interact and what they can and cannot do. Similarly, client-server application developers and managers must concern themselves with the quality of online storage service provided by their data centers. Understanding storage technology can help these users negotiate with their data centers to obtain the right cost, availability, and performance alternatives for each application. Moreover, volume management technologies are now available for the desktop. As disk prices continue to decline, widespread desktop use of these techniques is only a matter of time. Desktop users should develop an understanding of storage technology, just as they have done with other aspects of their computers.
xiii
xiv
Foreword
Highly Available Storage for Windows Servers (VERITAS Series) was written for all of these audiences. Part I gives an architectural background, to enable users to formulate online storage strategies, particularly with respect to failure tolerance and performance. Part II describes how VERITAS volume management technologies apply these principles in Windows operating system environments.
PA R T
ONE
Disk Storage Architecture
CHAPTER
1
Disk Storage Basics
Data Basics Computer systems process data. The data they process may be transient, that is, acquired or created during the course of processing and ceasing to exist after processing is complete or it may be persistent, stored in some permanent fashion so that program after program may access it.
Transient Data The solitaire game familiar to Windows users is an example of transient data. When a solitaire player starts a new game, transient data structures representing a deck of cards dealt into solitaire stacks is created. As the user plays the game, keystrokes and mouse clicks are transformed into actions on virtual cards. The solitaire program maintains transient data structures that describe which cards are exposed in which stacks, which remain hidden and which have been retired. As long as the player is engaged in the game, the solitaire program maintains the data structures. When a game is over, however, or when the program ceases to run, the transient data structures are deleted from memory and cease to exist. In today’s computers, with volatile random access memory, programs may cease to run and their transient data cease to exist for a variety of uncontrol-
3
4
CHAPTER ONE
lable reasons that are collectively known as crashes. Crashes may result from power failure, from operating system failure, from application failure, or from operational error. Whatever the cause, the effect of a crash is that transient data is lost, along with the work or business state it represents. The consequence of crashes is generally a need to redo the work that went into creating the lost transient data.
Persistent Data If all data were transient, computers would not be very useful. Fortunately, technology has provided the means for data to last, or persist, across crashes and other program terminations. Several technologies, including batterybacked dynamic random access memory (solid state disk) and optical disk, are available for storing data persistently; but far and away the most prevalent storage technology is the magnetic disk. Persistent data objects (for example, files) outlast the execution of the programs that process them. When a program stops executing, its persistent data objects remain in existence, available to other programs to process for other purposes. Persistent data objects also survive crashes. Data objects that have been stored persistently prior to a crash again become available for processing after the cause of the crash has been discovered and remedied, and the system has been restarted. Work already done to create data objects or alter them to reflect new business states need not be redone. Persistent data objects, therefore, not only make computers useful as recordkeepers, they makes computers more resilient in the face of the inevitable failures that befall electromechanical devices. Persistent data objects differ so fundamentally from transient data that a different metaphor is used to describe them for human use. Whereas transient data is typically thought of in terms of variables or data structures to which values are assigned (for example, let A = 23), persistent data objects are typically regarded as files, from which data can be read and to which data can be written. The file metaphor is based on an analogy to physical file cabinets, with their hierarchy of drawers and folders for organizing large numbers of data objects. Figure 1.1 illustrates key aspects of the file metaphor for persistent data objects. The file metaphor for persistent computer data is particularly apt for several reasons: ■■
The root of the metaphor is a physical device—the file cabinet. Each file cabinet represents a separate starting point in a search for documents. An organization that needs to store files must choose between smaller number of larger file cabinets and a larger number of smaller filer cabinets.
Disk Storage Basics
5
File Folders Embedded File Folders File Cabinet
Documents (files)
Figure 1.1 The file metaphor for persistent data objects.
■■
File cabinets fundamentally hold file folders. File folders may be hierarchical: They may hold folders, which hold other folders, and so forth.
■■
Ultimately, the reason for file folders is to hold documents, or files. Thus, with rare exceptions, the lowest level of hierarchy of file folders consists of folders that hold documents.
■■
File folders are purely an organizational abstraction. Any relationship between a folder and the documents in it is entirely at the discretion of the user or system administrator who places documents in folders.
The file cabinet/file folder metaphor has proven so useful in computing that it has become nearly universal. Virtually all computers, except those that are embedded in other products, include a software component called a file system that implements the file cabinet metaphor for persistent data. UNIX systems typically use a single cabinet abstraction, with all folders contained in a single root folder. Windows operating systems use a multicabinet abstraction, with each “cabinet” corresponding to a physical or logical storage device that is identified by a drive letter.
Disk Basics Of the technologies used for persistent data storage, the most prevalent by far is the rotating magnetic disk. Magnetic disks have several properties that make them the preferred technology solution for storing persistent data: Low cost. Today, raw magnetic disk storage costs between 1 and 5 cents per megabyte. This compares with a dollar or more for dynamic random access memory.
6
CHAPTER ONE
Random access. Relatively small blocks of data stored on magnetic disks can be accessed in random order.1 This allows programs to execute and process files in an order determined by business needs rather than by data access technology. High reliability. Magnetic disks are among the most reliable electromechanical devices built today. Disk vendors routinely claim that their products have statistical mean times between failures of as much as a million hours. Universality. Over the course of the last 15 years, disk interface technology has gradually become standardized. Today, most vendors’ disks can be used with most computer systems. This has resulted in a competitive market that tends to reinforce a cycle of improving products and decreasing prices.
Disks, Data and Standards Standardization of disk interface technology unfortunately has not led to to standardization of data formats. Each operating systems and file system has a unique “on disk format,” and is generally not able to operate on disks written by other operating systems and file systems without a filter or adapter application. Operating systems that use Windows NT technology include three major file systems (Figure 1.2 shows the latter two): File Allocation Table, or FAT. DOS-compatible files system retained primarily for compatibility retained for compatibility with other operating systems, both from Microsoft and from other vendors. FAT32. 32-bit version of the FAT file system originally developed to allow personal computers to accommodate large disks, but supported by Windows operating systems that use NT technology. NTFS. The native files system for NT Technology files systems. NT Technology operating systems also include an Installable File System (IFS) facility that enables software vendors to install additional software layers in the Windows operating system data access stack to filter and preprocess file system input/output (I/O) requests. The three Windows NT file systems use different on-disk formats. The format function of Windows NT Disk Administrator prepares a disk for use with 1
Strictly speaking, disks do not provide random access to data in quite the same sense as dynamic random access memory (DRAM). The primitive actions (and therefore the time) required to access a block of data depend partly upon the last block accessed (which determines the seek time) and partly upon the timing of the access request (which determines the rotational latency). Unlike tapes, however, each disk access specifies explicitly which data is to be accessed. In this sense, disks are like random access memory, and operating system I/O driver models treat disks as random access devices.
Disk Storage Basics
7
Figure 1.2 FAT and NTFS file systems on different disk partitions.
one of the file systems by writing the initial file system metadata2 on it. The operating system mount function associates a disk with the file system for which it is formatted and makes data on the disk accessible to applications. Windows operating systems mount all visible disks automatically when they start up, so the act of associating a disk with its file system is generally transparent to system administrators and users once the disk has been formatted for use.
Magnetic Disk Operation While magnetic disks incorporate a diverse set of highly developed technologies, the physical principles on which they are based are simple. In certain materials, called ferromagnetic materials, small regions can be permanently magnetized by placing them near a magnetic field. Other materials are paramagnetic, meaning that they can be magnetized momentarily by being brought into proximity with an electrical current in a coil. Figure 1.3 illustrates the components of a magnetic recording system. Once a ferromagnetic material has been magnetized (e.g., by being brought near a strongly magnetized paramagnetic material), moving it past a paramagnetic material with a coil of wire wrapped around it results in a voltage change corresponding to each change in magnetization direction. The timing of these pulses, which is determined by the distance between transitions in field direction and the relative velocity of the materials, can be interpreted as a stream of data bits, as Figure 1.4 illustrates.
2
File system metadata is data about the file system and user data stored. It includes file names, information about the location of data within the file system, user access right information, file system free space and other data.
8
CHAPTER ONE
Paramagnetic Recording and Read-Back Device Wire Coil
Ferromagnetic Recording Material Relative Motion
Figure 1.3 General principle of magnetic data recording.
Magnetic disk (and tape) recording relies on relative motion between the ferromagnetic recording material (the media) and the device providing recording energy or sensing magnetic state (the head). In magnetic disks, circular platters rotate relative to a stationary read/write head while data is read or written.
Pulse Timing and Recording Codes Disk platters rotate at a nominally constant velocity, so the relative velocity of read/write head and media is nominally constant, allowing constant time slots to be established. In each time slot, there either is or is not a pulse. With constant rotational velocity and an electronic timer generating time slots, pulses could be interpreted as binary ones, and the absence of pulses could be interpreted as zeros. Figure 1.5 illustrates this simple encoding. Pulses peak in the
Signal detection
Recording device
Volts
Time
Data is derived from the timing between pulses. Media state changes cause electrical pulses. Known (usally constant) relative velocity
Figure 1.4 Recovering data recorded on disks from pulse timing.
9
Disk Storage Basics Evenly spaced time intervals generated by clock
Volts
Time
Data 0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
Figure 1.5 Inferring data from a combination of voltage pulses and timing.
third and twelfth time intervals and are interpreted as “1” bits. Other intervals lack pulse peaks and are interpreted as “0” bits. If rotational velocity were truly constant and the electronics used to establish time slots were perfect, this simple encoding scheme would be adequate. Unfortunately, minor variations in rotational speed and timer electronics can cause pulses to drift into adjacent time slots and to be interpreted incorrectly, as illustrated in Figure 1.6. Using the encoding scheme illustrated in Figures 1.5 and 1.6, an entire block of binary zeros would produce no pulses when read back. To guard against this, on-disk data is encoded using algorithms that guarantee the occurrence of frequent pulses independent of the input data pattern of ones and zeros. Pulses are input to a phase-locked loop, which in turn adjusts time slots. The constant fine-tuning maximizes the likelihood that pulses will be interpreted correctly. Encoding schemes that guarantee frequent pulses (or 1 bits) independent of the application data pattern are called run-length-limited, or RLL codes. Figure 1.7 illustrates a very simple RLL code. RLL codes are characterized by the smallest and largest possible intervals between 1- bits in the encoded bit stream. Thus, the code illustrated in Figure 1.7 would be characterized as a 0,4 code, because:
Volts Time
Correct pulse spacing Incorrect bits
Momentary slowdown causes pulse to be misinterpreted.
Data
0
0
1
0
0
0
0
0
0
Figure 1.6 Effect of timing on data recovery.
0
0
0
1
0
0
0
0
0
0
10
CHAPTER ONE
10 00 11
from Application 01 10 01
Incoming Data 11
10 01 00
Encoder Incoming Data
Resulting Code Bits
00
001
01
010
10
100
11
101
010
010 100 100 Encoded Data 001 101
to Media
001
010
100
101
Figure 1.7 Example of a run-length-limited data encoding.
■■
Pulses in adjacent time slots can occur (e.g., user data 0010, which encodes into 001100).
■■
There can be at most four time slots between pulses (user data 1000, which encodes into 100001).
With this code, a pulse is guaranteed to occur in the encoded bit stream at least every fifth time slot. Thus, the maximum time that the timing generator must remain synchronized without feedback from the data stream itself is four time slots. Actual RLL codes are typically more elaborate than the one illustrated in Figure 1.7, sometimes guaranteeing a minimum of one or more time intervals between adjacent pulses. This is beneficial because it decreases the frequency spectrum over which the disk’s data decoding logic must operate.
Error Correction Codes Even with run-length encoding and phase-locked loops, errors can occur when data is read from disks. Mathematically elaborate checksum schemes have been developed to protect against the possibility of incorrect data being accepted as correct; and in many instances, these checksums can correct errors in data delivered by a disk. In general, these error correction codes, or ECCs, are generated by viewing a block of data as a polynomial, whose coefficients are consecutive strings of bits comprising the data block. As data is written, specialized hardware at the source (e.g., in a disk, just upstream of the write logic) divides the data polynomial by a smaller, fixed polynomial called a generating polynomial. The quotient of the division is discarded, and the
Disk Storage Basics
11
remainder, which is guaranteed to be of limited size, is appended to the data stream as a checksum and written to disk media, as illustrated in Figure 1.8. When data is read back from the disk, the read logic performs the same computation, this time retaining the quotient and comparing the computed remainder to the checksum read from the disk. A difference in the two is a signal that data and/or checksum have been read incorrectly. The mathematical properties of the checksum are such that the difference between the two remainders plus the coefficients of the quotient can be used, to correct the erroneous data within limits. The specialized hardware used to compute checksums is usually able to correct simple data errors without delaying the data stream. More complex error patterns require disk firmware assistance. So when these patterns occur, data may reach memory out of order. Error-free blocks that occur later in the data stream may be delivered before earlier erroneous ones. Thus it is important for applications and data managers not to assume that data read from a disk is present in memory until the disk has signaled that a read is complete.
Locating Blocks of Data on Magnetic Disks For purposes of identifying and locating data, magnetic disks are logically organized into concentric circles called tracks, as illustrated in Figure 1.8. Read/write heads are attached to actuators that move them from track to track. On each track, data is stored in blocks of fixed size (512 bytes in Windows and most other systems). Each disk block starts with a synchronization pattern Synch Header
User Data
ECC Trailer
Tracks
Servo information
Index mark
Disk Surface
Figure 1.8 Magnetic disk data layout.
12
CHAPTER ONE
and identifying header3 followed by user data, an error correction code (ECC) and a trailer pattern. Adjacent blocks are separated by servo signals, recorded patterns that help keep the read/write head centered over the track. An index mark at the start of each track helps the disk’s position control logic keep track of rotational position. Figure 1.8 illustrates one surface of one disk platter. All the blocks at a given radius comprise a track. On a disk with multiple recording surfaces, all of the tracks at a given radius are collectively known as a cylinder. The disk illustrated in Figure 1.8 has the same number of blocks on each track. The capacity of such a disk is given by: Disk capacity (bytes) = number of blocks per track × number of tracks × number of data heads (data surfaces) × number of bytes per block Each block of data on such a disk can be located (“addressed”) by specifying a cylinder, a head (recording surface) and a (relative) block number. This is called cylinder, head, sector, or C-H-S addressing. Figure 1.9 illustrates C-H-S addressing. 3
In some newer disk models, the header is eliminated to save space and increase storage capacity.
1. Seek to cylinder 1
2. Select head/surface combination 2 4. Read or write a block of data
“Index mark” 3. Rotate to block 2 on track
Figure 1.9 Locating data on a disk.
Disk Storage Basics
13
Figure 1.9 illustrates the three distinct operations required to locate a block of data on a multisurface disk for reading or writing: ■■
Seeking moves the actuator to position the recording heads approximately over the track on which the target data is located.
■■
Selection of the head that will read or write data connects the head’s output to the disk’s read/write channel so that servo information can be used to center the head precisely on the track.
■■
Rotation of the platter stack brings the block to be read or written directly under the head, at which time the read or write channel is enabled for data transfer.
Logical Block Addressing C-H-S addressing is inconvenient for disk drivers and file systems because it requires awareness of disk geometry. To use C-H-S addressing to locate data, a program must be aware of the number of cylinders, recording surfaces, and blocks per track of each disk. This would require that software be customized for each type of disk. While this was in fact done in the early days of disk storage, more recently the disk industry has adopted the more abstract logical block addressing model for disks, illustrated in Figure 1.10. With logical block addressing, disk blocks are numbered in ascending sequence. To read or write data, file systems and drivers specify a logical block number. A microprocessor in the disk itself converts between the logical block address and the C-H-S address, as illustrated in Figure 1.11.
Disk Logical block m odel o f d ( visible t o hos t s)
Block 000 Block 001 Block 002 Block 003 Block 004 Block 005
Conversion performed by disk microprocessor
Disk Fi
Block 006 Block 007
etc. Logical block model of disk (visible to hosts)
Figure 1.10 The logical block disk data-addressing model.
Physical disk
14
CHAPTER ONE
Disk Geometry • 100 blocks/track • 4 surfaces • 1000 cylinders WRITE • 1536 bytes of data • starting at disk block 2001 READ • 2048 bytes of data • starting at disk block 1002
from host t
Disk Firmware
WRITE • SEEK to cylinder 5 • SELECT head 0 • ENABLE writing at sector 1 • DISABLE writing at sector 4 to disk
READ • SEEK to cylinder 2 • SELECT head 2 • ENABLE reading at sector 2 • DISABLE reading at sector 6
Figure 1.11 Conversion between logical block and C-H-S data addressing.
Desktop computer disks typically support both C-H-S and block data addressing, largely for reasons of backward compatibility. The SCSI and Fibre Channel disks typically used in servers and multidisk subsystems use block addressing exclusively.
Zoned Data Recording Each block on the outermost track of the disk platter illustrated in Figure 1.9 occupies considerably more linear distance than the corresponding block on the innermost track, even though it contains the same amount of data. During the early 1990s, in an effort to reduce storage cost, disk designers began to design disks in which the longer outer tracks are divided into more blocks than the shorter inner ones. Although this increased the complexity of disk electronics, it also increased the storage capacity for any given level of head and media technology by as much as 50 percent. Today, this technique goes by the names such as zoned data recording (ZDR). The cylinders of a ZDR disk are grouped into zones, each of which is formatted to hold a different number of 512 byte blocks. Figure 1.12 illustrates a platter surface of a ZDR disk with two zones. In Figure 1.12, each track in the inner zone contains 8 blocks, while each track in the outer zone contains 16. Compared to the disk illustrated in Figure 1.9, capacity is 50 percent higher, with little if any incremental product cost. (Figure 1.12 uses unrealistically low numbers of blocks per track for the sake of clarity of the diagram. Typical ZDR disks have 20 or more zones, with between 100 and 200 blocks per track. For 3.5-inch diameter disks, the outermost zone usually has about twice as many blocks per track as the innermost zone.)
Disk Storage Basics
Inner zone: 8 blocks per track
15
Outer zone: 16 blocks per track
Index mark Figure 1.12 One platter of a zoned data-recorded disk.
Because of the beneficial effect on cost per byte of storage, zone bit recording has essentially become ubiquitous.
Disk Media Defects With the high recording densities in use today, miniscule material defects can render part of a recording surface permanently unrecordable. The blocks that would logically lie in these surface areas are called defective blocks. Attempts to read or write data in a defective block always fail. The conventional way of dealing with defective blocks is to reserve a small percentage of a disk’s block capacity to be substituted for defective blocks when they are identified. Correspondence tables relate the addresses of defective blocks to addresses of substitute blocks and enable file systems to treat disks as if they were defect-free. Figure 1.13 illustrates such a correspondence table. These tables are sometimes called revectoring tables and the process of converting a host-specified block number that maps to a defective block into the C-H-S address of a substitute block is called revectoring. In the early days of disk technology, defective blocks were visible to hosts, and operating system drivers maintained revectoring tables for each disk. Like address conversion, however, revectoring is highly disk type-specific.
16
CHAPTER ONE
One or more tracks reserved for defective block substitution Defective Block Recovering Table defective block number
substitute block address
• • •
• • •
Substitution table causes disk to transparently reroute I/O request to substitute block.
I/O requests specify original block address
Figure 1.13 Defective block substitution.
Consequently, it became apparent to the disk industry that revectoring could be performed most effectively by the disks themselves. Today, most disks revector I/O requests addressed to defective blocks to reserved areas of the media using a correspondence table similar to that illustrated in Figure 1.13. Inside-the-disk bad block revectoring (BBR) allows host drivers as well as file systems to treat disks as if there were no defective blocks. From the host’s point of view, a disk is a consecutively numbered set of blocks. Within the disk, block addresses specified by file systems are regarded as logical. The disk translates them into physical media locations, and in so doing, transparently substitutes for defective blocks as necessary.
Writing Data on Magnetic Disks A little-recognized fact about magnetic disks is that, there is very little physical feedback to confirm that data has been correctly written. Disk read/write logic and the read/write channel write each block’s preamble, header, user data, ECC, and trailer, and disengage. Hardware in the read/write channel verifies signal levels, and head position is validated frequently with servo feedback, but data written to the media is not verified by, for example, immediate rereading, as is done with tape drives. Disks are able to reread and verify data after writing it, but this necessitates an extra disk revolution for every write. This affects performance adversely (single-block write times can increase by 50 percent or more), so the capability is rarely used in practice. Fortunately, writing data on disks is an extremely reliable operation, enough so that most of data processing can be predicated upon it. There is always a miniscule chance, however, that data written by a host will not be readable. Moreover,
17
Disk Storage Basics
unreadable data will not be discovered until a read is attempted, by which time it is usually impossible to re-create. The remote possibility of unreadable data is one of several reasons to use failure-tolerant volumes for businesscritical online data.
Intelligent Disks Electronic miniaturization and integration have made it technically and economically feasible to embed an entire disk controller in every disk built today, making the disk, in effect, a complete subsystem, as the block diagram in Figure 1.14 illustrates. A subtle but important property of the architecture illustrated in Figure 1.14 is the abstraction of the disk’s external interface. Today, host computers no longer communicate directly with disk read/write channels. Instead, they communicate with the logical interface labeled “Host I/O Bus Interface” in Figure 1.14. I/O requests sent to this logical interface are transformed by a microprocessor within the disk. Among its activities, this processor: ■■
Converts logical block addresses into C-H-S addresses and performs revectoring as necessary.
■■
Breaks down hosts’ read and write requests into more primitive seek, search, and read and write channel enable and disable operations.
■■
Manages data transfer through the disk’s internal buffers to and from the host.
All of this activity is transparent to hosts, which use simple read and write commands that specify logical block addresses from a dense linear space. Abstract host I/O interfaces allow disks to evolve as component technologies develop without significant implications for their external environment. A
Serialization & Deserialization
Data Commands & Data to Host
Buffer
Intelligent Disk with Integrate Controller
Error Correction
Host I/O Bus Interface Commands
Read/Write Channel
Processor
Servo Control
Figure 1.14 Block diagram of an intelligent disk with embedded controller.
Rotational Position
18
CHAPTER ONE
disk might use radically different technology from its predecessors, but if it responds to I/O requests, transfers data, and reports errors in the same way, the system support implications of the new disk are very minor, making market introduction easy. This very powerful abstract I/O interface concept is embodied in today’s standard I/O interfaces such as SCSI, ATA (EIDE)4 and FCP. Disks that use these interfaces are easily interchangeable. This allows applications to use the increased storage capacity and performance delivered as technology evolves, with minimal support implications.
Other Applications of Disk Intelligence: SMART Technology One innovative use of disk intelligence that has emerged in recent years is disk self-monitoring for predictive failure analysis. With so many millions of samples upon which to base statistical analyses, disk manufacturers have developed significant bodies of knowledge about how certain physical conditions within disk drives indicate impending failures before they occur. In general, these physical conditions are sensed by a disk and are implementation-specific. Such factors as head flying height (distance between read/write head and disk platter), positioning error rates, and media defect rates are useful indicators of possible disk failure. A SCSI standard called Self-Monitoring, Analysis, and Reporting Technology (SMART) provides a uniform mechanism that enables disks to report normalized predictive failure information to a host environment. Disks that use the ATA interface report raw SMART information when polled by their hosts. The hosts then make any predictive failure decisions. Large system disks use built-in intelligence to analyze SMART information themselves and only report danger signals to their hosts when analysis indicates that a failure might be imminent. Hosts that support SMART alert system administrators when a disk is in danger of failing. The system administrator can then take action to protect data on the failing disk—for example, by scheduling an immediate backup. Though SMART improves the reliability of data stored on disks, the technology is not without its limitations. It is chiefly useful to predict failures that are characterized by gradual deterioration of some measurable disk parameter. SMART does not protect against sudden failures, as are typical of logic module failures. Because SMART reports refer to conditions that are deteriorating with time, system administrators must receive and act on them promptly. SMART is thus most useful in environments that are monitored constantly by administrators. 4
AT(advanced technology) Attachment; also known as extended IDE intelligent drive electronics, FCP = Fibre Channel Protocol
Disk Storage Basics
19
SMART is indeed a useful technology for improving the reliability of data stored on disks, but it is most effective in protecting data against disk failures when used in conjunction with volume management techniques described later in Chapter 2.
Disk Controller and Subsystem Basics As disks evolved during the 1980s into the self-contained intelligent subsystems represented in Figure 1.14, separate controllers were no longer required for low-level functions such as motion control, data separation, and error recovery. The concept of aggregating disks to improve performance and availability is a powerful one, however, so intelligent disk subsystems with aggregating disk controllers evolved in their place. Figure 1.15 illustrates the essentials of an intelligent disk subsystem with an aggregating controller. The aggregating disk controller that is shown in Figure 1.15 with four disk I/O bus interfaces that connect to a memory access bus internal to the controller. The disk I/O buses connect intelligent disks to the controller. The disk controller coordinates I/O to arrays of two or more disks, and makes them appear to host computers over the host I/O bus interface as virtual disks. Aggregating disk controllers can: ■■
Concatenate disks and present their combined capacity as a single large virtual disk.
■■
Stripe or distribute data across disks for improved performance and present the combined capacity as a single large virtual disk. Disk I/O buses (e.g., SCSI)
Host Computer File I/O Commands & Data
App App App
File System Volume I/O Commands & Data
Aggregating Disk Controller Host I/O Bus Interface Policy Processor Real-time RAID Striping
DRAM Buffer/Cache Internal Bus
Disk I/O Bus Interface Disk I/O Bus Interface Disk I/O Bus Interface Disk I/O Bus Interface Disk I/O Commands & Data
Figure 1.15 Intelligent disk subsystem with aggregating disk controller.
20
CHAPTER ONE
■■
Mirror identical block contents on two or more disks or striped volumes, and present them as a single failure-tolerant virtual disk.
■■
Combine several disks using Redundant Array of Independent Disk (RAID) techniques to stripe data across the disks with parity check data interspersed, and present the combined available capacity of the disks as a single failure-tolerant virtual disk.
Each of the disk I/O bus interfaces in Figure 1.15 sends I/O requests to, and moves data between, one or more disks and a dynamic random access memory (DRAM) buffer within the aggregating controller. Similarly, a host I/O bus interface in the aggregating controller moves data between the buffer and one or more host computers. A policy processor transforms each host I/O request made to a volume into one or more requests to disks, and sends them to disks via the disk I/O bus interfaces. For example, if two mirrored disks are being presented to host computers as a single failure-tolerant virtual disk, the aggregating controller would: ■■
Choose one of the disks to satisfy each application read request, and issue a read request to it.
■■
Convert each host write request made to the volume into equivalent write requests for each of the mirrored disks.
Similarly, if data were striped across several disks, the aggregating controller’s policy processor would: ■■
Interpret each host I/O request addressed to the striped volume to determine which data should be written to or read from which disk(s).
■■
Issue the appropriate disk read or write requests.
■■
Schedule data movement between host and disk I/O bus interfaces.
For RAID arrays, in which data is also typically striped, the aggregating controller’s policy processor would perform these functions and would update parity each time user data was updated.
External and Embedded Array Controllers From a host computer standpoint, a RAID controller is either external or embedded (mounted) within the host computer’s housing. External RAID controllers function as many-to-many bridges between disks and external I/O buses, such as parallel SCSI or Fibre Channel, to which physical disks can also be attached. External RAID controllers organize the disks connected to them into arrays and make their storage capacity available to
Disk Storage Basics
21
host computers by emulating disks on the host I/O buses. Figure 1.16 illustrates a system configuration that includes an external RAID controller. External RAID controllers are attractive because they emulate disks and, therefore, require little specialized driver work. They are housed in separate packages whose power, cooling, and error-handling capabilities are optimized for disks. They typically accommodate more storage capacity per bus address or host port than the embedded controllers discussed next. Moreover, since they are optimized for larger systems, they tend to include advanced performance-enhancing features such as massive cache, multiple host ports, and specialized hardware engines for performing RAID computations. The main drawbacks of external RAID controllers are their limited downward scaling and relatively high cost. Embedded, or internal, RAID controllers normally mount within their host computer enclosures and attach to their hosts using internal I/O buses, such as PCI peripheral component interconnect. Like external RAID controllers, embedded controllers organize disks into arrays and present virtual disks to the host environment. Since there is no accepted standard for a direct disk-toPCI bus interface, embedded controllers require specialized drivers that are necessarily vendor-unique. Figure 1.17 illustrates a system configuration that includes an embedded RAID controller. Embedded RAID controllers are particularly attractive for smaller servers because of their low cost and minimal packaging requirements. An embedded RAID controller is typically a single extended PCI module. Some vendors design server enclosures that are prewired for connecting a limited number of disks mounted in the server enclosure itself to an embedded RAID controller. The disadvantages of embedded RAID controllers are their limited scaling and failure tolerance and their requirement for specialty driver software.
Disk I/O buses (e.g., SCSI)
Host Computer Normal HBA Driver for Disks
App App
External RAID Controller
File System
Volume Manager
Disk Driver
Cache
HBA
SCSI or Fibre Channel Interface
App Policy Processor
Bus carries SCSI or Fibre Channel disk I/O commands.
Figure 1.16 External RAID controller.
Policy processor converts between virtual disk commands and operations on physical disk.
22
CHAPTER ONE
Host Computer Specialized Driver for RAID Controller
App App
Embedded RAID Controller r
File System
Volume Manager
RAID Driver
Disk I/O buses (e.g., SCSI)
PCI Interface
App
PCI Bus carries commands unique to the particular RAID controller model.
Figure 1.17 Embedded RAID controller.
The virtual disks presented by external RAID controllers are functionally identical to physical disks and are usually controlled by native operating system disk drivers with little or no modification. Embedded RAID controllers, on the other hand, typically require unique bus interface protocols and, therefore, specialized drivers, typically supplied by the controller vendor. Both external and embedded RAID controllers’ virtual disks can be managed by host-based volume managers as though they were physical disks. Both external and embedded RAID controllers require management interfaces to create and manage the virtual disks they present. Embedded RAID controllers typically have in-band management interfaces, meaning that management commands are communicated to the controller over the same PCI interface used for I/O. External controllers typically offer both in-band management interfaces using SCSI or FCP commands and out-of-band interfaces using Ethernet or even serial ports. Out-of-band interfaces enable remote management from network management stations and preconfiguration of disk array subsystems before they are installed.
Host-Based Aggregation The architecture of the aggregating disk controller block diagrammed in Figure 1.15 is very similar to that of a general-purpose computer. In fact, most disk controllers use conventional microprocessors as policy processors, and several use other conventional computer components as well, such as PCI bus controller application-specific integrated circuits (ASICs), such as the single-chip PCI interfaces found on most computer mainboards. As processors became more powerful during the 1990s, processing became an abundant resource, and several software developers implemented the equivalent of
Disk Storage Basics
23
aggregating disk controllers’ function in a host computer system software component that has come to be known as a volume manager. Figure 1.18 depicts a system I/O architecture that uses a host-based volume manager to aggregate disks. This figure represents a PCI-based server, such as might run the Windows NT or Windows 2000 operating system. In such servers, disk I/O interfaces are commonly known as host bus adapters, or HBAs, because they adapt the protocol, data format, and timing of the PCI bus to those of an external disk I/O bus, such as SCSI or Fibre Channel. Host bus adapters are typically designed as add-in circuit modules that plug into PCI slots on a server mainboard or, in larger servers, a PCI to memory bus adapter. Small server main boards often include integrated host bus adapters that are functionally identical to the addin modules. Host bus adapters are controlled by operating system software components called drivers. Windows operating systems include drivers for the more popular host bus adapters, such as those from Adaptec, Q Logic, LSI Logic, and others. In other cases, the vendor of the server or host bus adapter supplies a Windows-compatible HBA driver. Microsoft’s Web site contains a hardware compatibility list that contains information about host bus adapters that have been successfully tested with each of the Windows operating systems. HBA drivers are pass-through software elements, in the sense that they have no awareness of the meaning of I/O requests made by file systems or other applications. An HBA driver passes each request made to it to the HBA for transmission to and execution by the target disk. For data movement efficiency, HBA drivers manage mapping registers that enable data to move
Server (e.g., Windows NT) Server Processor App App App
File I/O Commands & Data
File System Volume I/O Commands & Data
OS (Kernel)
Volume Manager Striping RAID etc.
HBA Driver Disk I/O Commands & Data
DRAM Buffer/Cache
Host Bus Adapter (HBA) Host Bus Adapter (HBA) Host Bus Adapter (HBA) Host Bus Adapter (HBA)
Internal I/O Bus (e.g., PCI)
Figure 1.18 Host-based disk subsystem with aggregating software.
24
CHAPTER ONE
directly between the HBA and main memory. HBA drivers do not filter I/O requests, nor do they aggregate disks into volumes (although there are some PCI-based RAID controllers that perform disk aggregation). In systems like the one depicted in Figure 1.18, the volume manager aggregates disks into logical volumes that are functionally equivalent to the virtual disks instantiated by aggregating disk controllers. The volume manager is a software layer interposed between the file system and HBA drivers. From the file system’s point of view, a volume manager behaves like a disk driver. The volume manager responds to I/O requests to read and write blocks of data and to control the (virtual) device by transforming each of these requests into one or more requests to disks that it makes through one or more HBAs. The volume manager is functionally equivalent to the aggregating disk controller depicted in Figure 1.15. Like aggregating disk controllers, host-based volume managers can: ■■
Concatenate two or more disks into a single large volume.
■■
Stripe data across two or more disks for improved performance.
■■
Mirror data on two or more disks or striped volumes for availability.
■■
Combine several disks into a RAID volume.
Chapters 3 and 4 describe the capacity, performance, and availability characteristics of these popular volume types.
CHAPTER
2
Volumes
The Volume Concept A volume is an abstract online storage unit instantiated by a system software component called a volume manager. To file systems, database management systems, and applications that do raw I/O, a volume appears to be a disk, in the sense that: ■■
It has a fixed amount of non-volatile storage.1
■■
Its storage capacity is organized as consecutively numbered 512-byte blocks.
■■
A host can read or write any sequence of consecutively numbered blocks with a single request.
■■
The smallest unit of data that can be read or written is one 512-byte block.2
1
Since host-based volumes can typically be expanded while they are in use, this is not strictly true. File systems (like the Windows 2000 NTFS file system) and applications that are volumeaware can deal properly with online volume expansion. Other software deals with volumes as if they were fixed-capacity disks. The Windows NT Version 4 NTFS file system can accommodate volumes that expand, but requires a system reboot to use the additional capacity. 2
If an application writes less than 512 bytes, the bytes written by the application are stored in the lowest numbered bytes of the block. The contents of the remaining bytes of the block (those not written by the application) are unpredictable after the write is complete. As a practical matter, file systems and other applications that do I/O directly to volumes almost always make requests that specify data transfer in multiples of 512 bytes.
25
26
C HAPTE R T WO
Actually, because host-based volumes can typically be expanded while they are in use, they are not strictly fixed in size. File systems—such as the Windows 2000 NTFS file system—and applications that are volume-aware can deal properly with online volume expansion. Other software, such as database software deals with such volumes, that is volumes which can be expended, as if they were fixed-capacity disks. As Figure 2.1 illustrates, the storage capacity represented to applications as a volume may consist of: ■■
Part of a single disk’s capacity (Volume V in Figure 2.1)
■■
All of a single disk’s capacity (Volume W in Figure 2.1)
■■
Parts of the capacity of multiple disks (Volume X in Figure 2.1)
■■
All of the capacity of multiple disks (Volume Y in Figure 2.1)
In failure-tolerant (mirrored and RAID) volumes, not all of the physical capacity is made available for user data storage. In a two-mirror volume, for example, half of the physical capacity is used to store a second copy of the user data.
NOTEThe nomenclature “n-mirror volume” is used throughout this book to denote a volume with n copies of the same data on separate physical or virtual disks.
part of a single disk (a “subdisk”)
Disk A Volume V
parts of multiple disks (“subdisks”)
additional storage capacity growth
growth
Disk B all of a single disk
Disk C
Volume W
growth
Volume X (second half)
additional storage capacity
additional storage capacity
growth
Disk E
Disk F
Volume Y (first half)
Volume Y (second half)
Figure 2.1 Volume configurations.
Disk D
Volume X (first half)
multiple entire disks
Volumes
27
A volume is a representation of disklike behavior made to file systems by a volume manager in the form of responses to application read and write requests. If a volume manager responds to I/O requests as a disk would, then file systems need not be aware that the “disk” on which they are storing data is not “real.” This simple concept has been an important factor in the success of both subsystem-based disk arrays and host-based volumes because no file system or application changes are required to reap the benefits of these virtual storage devices. Any file system, database management system, or other application that stores its data on raw disks can just as easily use a volume without having to be modified.
Virtualization in Volumes Volumes virtualize storage. The translation from volume block number to location(s) on a physical disk is completely arbitrary. All that matters is that the volume manager be able to determine the disk(s) and block number(s) that correspond to any volume block number, and conversely, the volume block number that corresponds to any given block number on any disk. Translating between volume block numbers and data locations on one or more disks is called mapping. As described in Chapter 1, disk designers use a similar virtualization technique to mask unrecordable (defective) disk blocks from host computers. The host computer addresses its requests to an apparently defect-free set of logical blocks; the disk’s firmware refers to its revectoring table and accesses alternate blocks as necessary to bypass unrecordable media areas. Similarly, a volume manager can extend capacity by concatenating the storage of several disks into a single volume block address space, or improve I/O performance by striping volume block addresses across several disks in a regular repeating pattern.
Why Volumes? System administrators find that there are major advantages to managing their online storage as volumes, as opposed to managing individual disks. The reasons for this lie in the three fundamental values of online storage: ■■
A volume can aggregate the capacity of several disks into a single storage unit so that there are fewer storage units to manage, or so that files larger than the largest available disk can be accommodated.
■■
A volume can aggregate the I/O performance of several disks. This allows large files to be transferred faster than would be possible with the fastest available disk. In some circumstances, it also enables more I/O transactions
28
C HAPTE R T WO
per second to be executed than would be possible with the fastest available disk. ■■
A volume can improve data availability through mirroring or RAID techniques that tolerate disk failures. Failure-tolerant volumes can remain fully functional when one or more of the disks that comprise them fail.
More complex volumes can be configured to provide combinations of these benefits.
The Anatomy of Windows Disk Volumes Volume managers for Windows operating systems support all of the aforementioned types of volumes. They use a common architecture to build up volumes of all types. The first step in configuring volumes is the logical subdivision of each disk’s capacity into ranges of consecutively numbered blocks called subdisks. Figure 2.2 illustrates the concept of subdisks.
Mapping and Failure Protection: Plexes The volume manager next organizes sets of subdisks on one or more disks into ordered groups, called plexes, for data block address mapping and failure protection.
Disk
Disk block numbers SubDisk A SubDisk A
Disk Block 000..........SubDisk A Block 000
Disk Block 099..........SubDisk A Block 099
Each subdisk is completely contained on a single disk
SubDisk B SubDisk B
Disk Block 100..........SubDisk A Block 000
Disk Block 199..........SubDisk A Block 099
etc. Figure 2.2 Subdisks.
Subdisk block numbers
Volumes
29
Plexes are internal volume manager objects that are invisible to users and applications. Windows system administrators are required to define two plex parameters, stripe unit size and failure tolerance type, when volumes are created. For other administrative purposes plexes can usually be ignored. Figure 2.3 illustrates a plex consisting of three subdisks on separate disks. The properties of this plex are: ■■
RAID failure tolerance type. SubDisk C contains parity that protects against user data loss if either Disk A or Disk B should fail. RAID data protection is discussed in Chapter 4.
■■
Stripe unit size of four blocks. The system administrator can set the stripe unit size when the volume is defined, to bias performance in favor of I/O request-intensive or data transfer-intensive I/O loads.
The plex concept enables volume managers to create complex volumes; for example, by mirroring data across two or more striped plexes to create mirrored striped volumes; these are discussed in Chapter 4. Using plexes as building blocks, Windows volume managers can create any of five types of volumes: Simple. In a simple volume all of the blocks reside on a single disk. Spanned (concatenated). In a spanned volume the blocks of two or more subdisks are concatenated and presented as a single large volume. Striped. In a striped volume address sequences of data blocks are distributed across two or more subdisks on separate disks for improved I/O performance.
Disk A' SubDisk A
Strip 00
Strip 24
Disk B' SubDisk B
Disk C' SubDisk C
SubDisk A Block 000--Plex Block 000
SubDisk B Block 000--Plex Block 004
SubDisk C Block 000--Parity 000-004
SubDisk A Block 001--Plex Block 001
SubDisk B Block 001--Plex Block 005
SubDisk C Block 001--Parity 001-005
SubDisk A Block 002--Plex Block 002
SubDisk B Block 002--Plex Block 006
SubDisk C Block 002--Parity 002-006
SubDisk A Block 003--Plex Block 003
SubDisk B Block 003--Plex Block 007
SubDisk C Block 003--Parity 003-007
• • • etc.
• • • etc.
• • • etc.
SubDisk A Block 096--Plex Block 192
SubDisk B Block 096--Plex Block 196
SubDisk C Block 096--Parity 192-196
SubDisk A Block 097--Plex Block 193
SubDisk B Block 097--Plex Block 197
SubDisk C Block 097--Parity 193-197
SubDisk A Block 098--Plex Block 194
SubDisk B Block 098--Plex Block 198
SubDisk C Block 098--Parity 194-198
SubDisk A Block 099--Plex Block 195
SubDisk B Block 099--Plex Block 199
SubDisk C Block 099--Parity 195-199
Figure 2.3 A RAID plex.
30
C HAPTE R T WO
Mirrored. In a mirrored volume identical data is written to two or more subdisks or striped plexes for improved availability. RAID. In a RAID volume several sub-disks are organized with data and parity striped across them for single-disk failure protection. Simple, spanned, and striped volumes are not failure-tolerant; they are used to make storage capacity management more flexible and to improve I/O performance. Mirrored and RAID volumes are failure-tolerant; they are able to survive disk failures and provide continuous data access services. Chapters 3 and 4 describe the capacity, performance, and failure tolerance characteristics of these five types of volumes in more detail.
CHAPTER
3
Volumes That Are Not Failure Tolerant
Simple Volumes A simple volume consists of a single subdisk. The subdisk may consist of some or all of the available blocks on a single disk. Simple volumes are functionally equivalent to the partitions found in legacy Windows operating system environments. Figure 3.1 illustrates a simple volume consisting of a subset of the blocks on a disk. Simple volumes are a more flexible way to manage online storage than direct management of disks. Today, disks designed for server environments can have upwards of 100 gigabytes of capacity. For financial, security, or other administrative reasons, it is often desirable to subdivide this capacity into smaller simple volumes. Subdivision of capacity becomes even more important when the disk being managed is actually a disk array presented by an intelligent storage controller. A RAID array of eight 38-gigabyte disks, for example, would be presented as a 266-gigabyte virtual disk (7 data disks × approximately 38 gigabytes per disk). The volume manager can subdivide this failure-tolerant virtual disk into smaller, more manageable units of online storage. Figure 3.2 illustrates this usage. In this figure, the RAID controller organizes its eight 38-gigabyte disks as a single RAID array of 266 gigabytes (520,224,7681 512-byte blocks). If this is too 1
Disk and volume blocks are both numbered starting at zero, so the number of blocks on a disk (520,224,768 in this example) is greater than the largest block number shown in 3.2 (520,224,767).
31
32
CHAPTER THREE
Volume
Disk
Volume Block 000 Volume Block 001
SubDisk A
Volume Block 002 Volume Block 003
etc.
Volume Block 099
File system and application view
Volume Manager
SubDisk A (identical to plex)
Additional disk capacity for expansion or creation of other volumes
SubDisk A Block 000
Disk Block 000
• • • Disk Block 099
SubDisk A Block 099
Disk Block 100
• • • Disk Block 199
etc.
Figure 3.1 A simple volume.
large for application convenience, the host-based volume manager can subdivide the virtual disk’s block space into smaller subdisks, from which it configures simple volumes for presentation to file systems and applications. In this example, the host-based volume manager subdivides the 266-gigabyte array into three equally sized volumes of approximately 88.7 gigabytes (173,408,258 blocks). This example illustrates one form of synergy between host-based volumes and controller-based RAID arrays. The RAID controller provides a large amount of failure-tolerant storage capacity, and the volume manager subdivides it for management convenience. Simple volumes also make online storage management easier because their capacity can easily be expanded if an application requires more storage. Figure 3.3 illustrates a disk containing two simple volumes and some additional unallocated storage capacity. The additional capacity can be used to expand the capacity of either volume, as the business requires. The ability to expand simple volumes allows organizations to defer storage capacity allocation deci-
Host Computer
Very Large Virual Disk
Volume A Block 000
Block 000 Block 001 Block 00 2
Block 173,408,257
App
Block 003
Volume B
App
File System
Block 000
RAID Controller
Volume Manager
etc.
Block 173,408,257
App
Volume C Block 000
Block 520,224,767
Block 173,408,257
Figure 3.2 Using a volume manager to subdivide a large virtual disk.
Eight 38-gigabyte disks (38,050,725,888 bytes. or 74,317,824 blocks)
Volumes That Are Not Failure Tolerant
Subdisk (=plex) File system and application view
33
Disk File system and application view
SubDisk A Disk Block 000–SubDisk A Block 000
Volume V
Disk Block 099–SubDisk A Block 099
Volume V Block 000 Volume V Block 001
Subdisk (= plex)
Disk Block 100–SubDisk B Block 000
· · ·
Volume V Block 003
e t c.
Volume W Block 000
Volume W Block 002
Volume Manager
Disk Block 199–SubDisk B Block 099
Additional disk capacity can be used to expand either Volume V or Volume W
Volume W
Volume W Block 001
SubDisk B
Volume V Block 002
Volume V Block 099
· · ·
Volume Manager
Volume W Block 003
e t c.
Disk Block 200
· · · Disk Block 299
etc.
Volume W Block 099
Figure 3.3 Simple volumes enable flexible deployment of storage capacity.
sions until capacity is actually required. Again, this dynamic expansion capability is especially important with the very large virtual disks presented by enterprise RAID controllers. Windows volume managers increase the capacity of simple volumes by converting them to volumes whose volume block spaces span the original subdisk and the subdisk being added to it. If the original and new subdisks are contiguous block ranges on the same disk, the volume manager will combine them into a single large subdisk. For example, if blocks 200–299 of the disk in Figure 3.3 were added to Volume W, the volume manager would enlarge SubDisk B. If the original and new subdisks are on different disks, or if they are on the same disk but are not contiguous block ranges, the volume manager creates data structures to describe a spanned volume. For example, if blocks 200–299 of the disk in Figure 3.3 were added to Volume V, the result would be a spanned volume with 200 blocks of storage capacity.
Spanned Volumes Windows volume managers form spanned, or concatenated, volumes by logically concatenating the blocks of two or more subdisks into a single address space, which it presents to file systems and applications. Figure 3.4 illustrates a 300-block spanned volume consisting of three subdisks, each with 100 blocks of storage capacity. The volume manager maps the first hundred volume blocks to SubDisk A, the second hundred to SubDisk B, and the third
34
CHAPTER THREE
Volume Volume Block 000
Disk A'
Disk B'
Disk C'
Volume Block 001 Volume Block 002 Volume Block 003 Volume Block 004 Volume Block 005 Volume Block 006
etc.
Volume Manager
SubDisk A
SubDisk B
Volume Block 000
Volume Block 100
SubDisk C Volume Block 200
Volume Block 001
Volume Block 101
Volume Block 201
Volume Block 002
Volume Block 102
Volume Block 202
Volume Block 003
Volume Block 103
Volume Block 203
Volume Block 004
Volume Block 104
Volume Block 204
Volume Block 005
Volume Block 105
Volume Block 205
Volume Block 006
Volume Block 106
Volume Block 206
etc.
etc.
etc.
Volume Block 099
Volume Block 199
Volume Block 299
Volume Block 299
File system and application view
A plex consisting of subdisks on different disks
Figure 3.4 A spanned volume.
hundred to SubDisk C. In the volume block address space, the last block of SubDisk A (block 99) immediately precedes the first block of SubDisk B. Similarly, the last block of SubDisk B precedes the first block of SubDisk C. File systems would perceive this volume as a single 300-block disk. For example, assume that SubDisk C occupies the lowest-numbered blocks on its disk. A file system request to read block 205, would cause the volume manager to determine from its internal metadata that volume block 205 is block 5 on SubDisk C and issue a read request to Disk C' specifying its block 5.2 The data would be delivered to the file system as though it had come from block 205 of a 300-block disk. Obviously, creating a spanned volume is a quick way to add storage to a volume, with minimal effect on its characteristics. The spanned volume continues to be accessed and managed as though it were a large single disk. The NTFS file system in Windows 2000 can use additional volume capacity immediately. The Windows NT Version 4 operating system requires that a volume be unmounted and remounted before its NTFS file system can utilize additional volume capacity. FAT file systems cannot be expanded in place to use larger volumes. To expand a volume containing a FAT file system: ■■
The data in the file system is backed up on external media.
■■
The original file system is deleted.
2
Assuming that SubDisk C occupies the lowest-numbered blocks on its disk.
Volumes That Are Not Failure Tolerant
■■
The volume is expanded.
■■
A new, larger FAT file system is formatted on it.
■■
The backed-up data is restored to the new, larger file system.
35
Spanned Volumes and Failure Tolerance A storage system’s failure tolerance is its resilience, or ability to continue to function if one of its components fails. For example, disks are resilient to media defects and transient data errors, but not to other failures. A spanned volume is less failure-tolerant than the same number of separately presented disks. If a disk containing a simple volume fails, the data on it becomes inaccessible, but accessibility of data on other volumes is unaffected. If a disk containing a subdisk that is part of a spanned volume fails, the entire volume becomes inaccessible. Files or file system metadata might be stored partly on a surviving disk and partly on the failed one, making partial data resiliency impossible to guarantee. For example, if a directory occupies blocks 90–110 of the spanned volume in Figure 3.4, a failure of either Disk A' or Disk B' (the disks containing SubDisks A and B, respectively) would make the directory partly inaccessible. Because the directory is partially inaccessible, it may not be possible to locate any of the files in the directory, even if they are located on a surviving disk. For this reason, volume managers do not generally support partial data recovery when one disk of a spanned volume fails.
Spanned Volumes and I/O Performance Most files stored on spanned volumes occupy space on only one of the volume’s disks. In fact, the maximum number of files that can occupy space on more than one disk is one fewer than the number of disks comprising the volume. Thus, when a file on a spanned volume is accessed, one disk usually executes all of the read and write requests. The expected I/O performance for individual file access on spanned volumes is therefore about the same as for disks. When multiple files on a spanned volume are accessed simultaneously, some parallelism is possible. If the files being accessed reside on different disks, multiple I/O requests can execute at the same time. In Figure 3.5, for example, if files F.DAT, G.DAT, and H.DAT were accessed at the same time, the accesses could all be simultaneous because the files are on different disks. Accesses to file J.DAT, however, would necessarily be interleaved with
36
CHAPTER THREE
If files F. DAT,G. DAT, and H. DAT are being accessed, all three disks will be active simultaneously.
Volume Volume Block 000
Disk B'
Disk A'
File F.DAT
SubDisk B
SubDisk A
Disk C' SubDisk C
File G.DAT File G.DAT
Application requests
File H.DAT
File F.DAT
Volume Manager
File H.DAT File J.DAT
File J.DAT
Volume Block 299
File system and application view
If files F. DAT,H. DAT, and J. DAT are being accessed, then only two disks will be active simultaneously.
Figure 3.5 Simultaneous file access on a spanned volume.
accesses to file F.DAT, because the disk containing SubDisk A can only perform one read or write at a time. For some multifile I/O workloads, spanned volumes might outperform a single disk. Unfortunately, there is no practical way to locate specific files on specific disks of a spanned volume, so simultaneous access performance is somewhat unpredictable.
Applications for Simple and Spanned Volumes Simple and spanned volumes offer neither the performance advantage of striped volumes, nor the availability advantage of mirrored or RAID volumes. This does not mean that there are no applications for them. For temporary data or other data that is easily reproduced in the event of loss, it may not make economic sense to incur the cost of extra storage and I/O components to achieve failure tolerance. Often, storing such data on simple or spanned volumes is the most cost-effective solution. If an organization’s easily reproducible data (data for which highly available storage is not justified) is likely to grow unpredictably, spanned volumes may be the best online storage choice. Like simple volumes, spanned volumes are more convenient to manage than physical disks, principally because they can
37
Volumes That Are Not Failure Tolerant
be expanded by the addition of storage capacity (additional subdisks, for example). One reasonable system management policy might be to store easily replaceable data objects for which failure tolerance is not cost-justified on simple or spanned volumes (depending on the disks available and the amount of such data to be stored). This would allow expansion as business requirements dictated without significant management changes and without the cost overhead of failure tolerance.
Striped Volumes One popular volume data mapping is the striping of data across several subdisks, each located on a different disk. Though striped volumes provide no protection against failures, they do improve I/O performance for most I/Obound applications. In a striped volume, volume blocks are mapped to subdisk blocks in a repeating rectangular pattern. Figure 3.6 illustrates a striped volume in which the volume manager maps: ■■
Volume Blocks 0–3 to SubDisk A.
■■
Volume Blocks 4–7 to SubDisk B.
■■
Volume Blocks 8–11 to SubDisk C.
■■
Volume Blocks 12–15 to SubDisk A.
Progression of volume block addresses
Volume Disk A'
Disk B'
Disk C'
SubDisk A Volume Block 000
SubDisk B Volume Block 004
SubDisk C Volume Block 008
Volume Block 000 Volume Block 001 Volume Block 002 Volume Block 003 Volume Block 005 Volume Block 006
etc.
Volume Manager
Row 0 (Stripe 0)
Volume Block 005
Volume Block 009
Volume Block 002
Volume Block 006
Volume Block 010
Volume Block 003
Volume Block 007
Volume Block 011
Volume Block 016
Volume Block 020
Volume Block 001
Volume Block 004
Volume Block 012
Row 1 (Stripe 1)
Volume Block 017
Volume Block 021
Volume Block 014
Volume Block 018
Volume Block 022
Volume Block 015
Volume Block 019
Volume Block 023
etc.
etc.
Volume Block 013
etc. Volume Block 299
File system and application view
Figure 3.6 A striped volume.
row or stripe width
A plex consisting of subdisks on different disks
stripe unit size
38
CHAPTER THREE
■■
Volume Blocks 16–19 to SubDisk B.
■■
Volume Blocks 20–23 to SubDisk C.
■■
Volume Blocks 24–27 to SubDisk A, and so forth.
In this example, groups of four volume blocks are assigned to groups of four blocks on successive disks in a repeating pattern. In Figure 3.6, each group of four blocks in corresponding subdisk positions (for example, Volume Blocks 000–011 or Volume Blocks 012–023) is called a stripe or a row. (Analogously, a subdisk that is part of a striped volume is called a column.) The number of consecutive volume blocks mapped to consecutive subdisk blocks (four in this case) is called a stripe unit. Stripe unit size is constant for a striped volume. For simplicity, this example uses an unrealistically small stripe unit size. In practice, typical stripe unit sizes are between 50 and 200 blocks. The stripe unit size multiplied by the number of columns in a striped volume is the volume’s stripe size. The stripe size of the volume depicted in Figure 3.6 is 12 blocks (four blocks per subdisk times three subdisks). With this regular mapping, the volume manager can easily translate any volume block number to a physical block location. The first step is to divide the volume block number by the stripe size (12 in the example of Figure 3.6). The quotient of this division is the row number in which the block is located, with zero being the topmost row. The remainder is the block’s relative location within the row, with zero representing the first block in the row. The next step is to divide this remainder by the stripe unit size. The quotient of this division represents the subdisk on which the block is located (0 = SubDisk A, 1 = SubDisk B and 2 = SubDisk C). The remainder is the relative block number within the stripe unit located on the target subdisk. Figure 3.7 uses an application read request for volume block 18 to illustrate this algorithm. The first step in determining the location of volume block 18 is to compute: Row number = quotient[18/12] = 1 Block number within row = remainder[18/12] = 6 The next step is to compute: Subdisk = quotient[6/4] = 1 (i.e., SubDisk B) Block within part of row on SubDisk B = remainder[6/4] = 2 Thus, the volume manager must read from block 2 in stripe 1 on SubDisk B to satisfy this request. The relative block number within the subdisk is computed as:
Volumes That Are Not Failure Tolerant Stripe 1
Disk B' (1)
Volume Manager
SubDisk A Volume Block 000
File system request: Read Volume Block 18
Disk B'
Disk A'
Volume Block 001
Row 0 (Stripe 0)
Disk C'
SubDisk B
SubDisk C
Volume Block 004
Volume Block 008
Volume Block 005
Stripe unit size (4 blocks)
Volume Block 009
Volume Block 002
Volume Block 006
Volume Block 003
Volume Block 007
Volume Block 011
Volume Block 016
Volume Block 020
Volume Block 012
Row 1 (Stripe 1)
Volume Block 010
Volume Block 017
Volume Block 021
Volume Block 014
Volume Block 018
Volume Block 022
Volume Block 015
Volume Block 019
etc.
etc.
Volume Block 013
39
Block 2 of stripe unit 1 on SubDisk B
Volume Block 023
etc.
stripe width (3 disks)
Figure 3.7 Locating data in a striped volume.
subdisk block = row number (stripe unit size + block within stripe unit =1×4+2=6 Finally, an I/O request to a disk, such as a SCSI Command Data Block (CDB), requires a logical disk block address. The volume manager must therefore compute: Logical block address (LBA) = subdisk starting offset + subdisk block =0+6=6 assuming that SubDisk B in Figure 3.7 occupies the lowest numbered disk blocks. Thus, an application request to read or write volume block 18 from the volume in this example would translate into a request to block 6 of Disk B'. Any volume block address can be translated to its corresponding disk address using this algorithm.
Striped Volumes and Failure Tolerance Like a spanned volume, a striped volume is less resilient to disk failures than an equivalent number of separately presented disks. If a disk containing a subdisk that is part of a striped volume fails, the entire volume becomes inaccessible. With a striped volume, there is a high probability that files and file system data structures will be located partly on one or more surviving disks and partly on the failed one. As with a spanned volume, the number of potential partial data loss scenarios makes it impossible to recover data objects on a striped volume when one of the disks containing it fails.
40
CHAPTER THREE
Although striped volumes are often referred to as RAID Level 0 or RAID 0, they are not RAID arrays, because they include no redundant check data to protect against data loss due to disk failure. All of a striped volume’s subdisks’ blocks correspond to volume blocks and are available for storing user data. Striped volumes are also (more properly) called striped arrays, or stripe sets.
Striped Volumes and I/O Performance Striped volumes are used primarily to enhance I/O performance in two broad classes of I/O-bound applications—applications whose performance is determined primarily by the speed with which its I/O requests complete, rather than the speed with which it can process data. I/O bound applications fall into two classes: I/O request-intensive. These applications typically perform some type of transaction processing, often using relational databases to manage their data. Their I/O requests tend to specify relatively small amounts of randomly addressed data. These applications become I/O-bound because they consist of many concurrent execution threads, thus most of their I/O requests are made without waiting for previous requests to complete. In general, I/O request-intensive applications have several I/O requests outstanding at any instant. Data transfer-intensive. These applications move long sequential streams of data between application memory and storage. Scientific, engineering, graphics, and multimedia applications typically have this I/O profile. I/O requests made by these applications typically specify large amounts of data and are sometimes issued without waiting for previous requests to complete to minimize processor and I/O bus idle time. Issuing requests in this way is sometimes double buffering, “reading ahead,” or “written behind”. Striped volumes generally improve the performance of I/O request-intensive applications because individual files are highly likely to be spread across disks, thereby increasing the probability of simultaneous disk accesses. Figure 3.8 illustrates simultaneous disk accesses with a striped volume. Of course, for simultaneous disk accesses to occur, the application must have more than one request outstanding at a time. For single-threaded or non-I/O-bound applications that tend to have one or fewer I/O requests outstanding, striped volume performance is about the same as disk performance, because only one of the volume’s disks will be in use at a time. Striped volumes can also improve I/O performance in systems where a single volume is shared by several concurrently executing applications. The aggregate I/O load of several single-stream applications tends to be similar to the
41
Volumes That Are Not Failure Tolerant
May wait for previous request to Disk B'.
Volume Manager Read Volume Block 18 Read Volume Block 9 May Read Volume Block 15 Read Volume Block 2 wait for Write Volume Block 7 previous request to Sequence of Disk A'. application or
file system I/O requests to random volume block addresses
Disk A'
Disk B'
Disk C'
SubDisk A
SubDisk B
SubDisk C
Volume Block 000
Volume Block 004
Volume Block 008
Volume Block 001
Volume Block 005
Volume Block 009
Volume Block 002
Volume Block 006
Volume Block 010
Volume Block 003
Volume Block 007
Volume Block 011
Volume Block 012
Volume Block 016
Volume Block 020
Volume Block 013
Volume Block 017
Volume Block 021
Volume Block 014
Volume Block 018
Volume Block 022
Volume Block 015
Volume Block 019
Volume Block 023
etc.
etc.
etc.
striped volume
Figure 3.8 I/O Request-intensive I/O load on a striped volume.
load imposed by a single I/O request-intensive application. Performance improves due to the load balancing induced by data striping, which tends to maximize the utilization of disk resources. Striped volumes also tend to improve I/O performance for data transferintensive applications, provided that the applications generate enough I/O to keep the volume’s disk resources busy. Figure 3.9 illustrates this. Ideal performance is achieved when applications request data in stripe-aligned multiples of the stripe size. This allows the volume manager to use all of the volume’s disk resources simultaneously. Some volume managers achieve further efficiency by consolidating requests for adjacently located data, avoiding the wait for disk revolutions between one data transfer and the next.
Applications for Striped Volumes As with spanned volumes, the data loss consequences of disk failure in a striped volume are more severe than with an equivalent number of individual disks. When any disk in a striped volume fails, all data in the volume becomes inaccessible. Striped volumes are nonetheless appropriate for performancesensitive applications that use temporary data extensively or that operate on data that can be easily reconstructed. Temporary data might include intermediate computational results (compiler or linker temporary files, for example), low-grade read-only or “read-mostly”
42
CHAPTER THREE
Volume Manager
Read Volume Blocks 0–11 Read Volume Blocks 12–23 etc. One request for Application Disk A' (File System) Blocks I/O Load 0-7
One request for Disk B' Blocks 0-7
Disk A'
One request for Disk C' Blocks 0-7
Disk B'
Disk C'
SubDisk A
SubDisk B
SubDisk C
Volume Block 000
Volume Block 004
Volume Block 008
Volume Block 001
Volume Block 005
Volume Block 009
Volume Block 002
Volume Block 006
Volume Block 010
Volume Block 003
Volume Block 007
Volume Block 011
Volume Block 012
Volume Block 016
Volume Block 020
Volume Block 013
Volume Block 017
Volume Block 021
Volume Block 014
Volume Block 018
Volume Block 022
Volume Block 015
Volume Block 019
Volume Block 023
etc.
etc.
etc.
Striped volume
Figure 3.9 Data transfer-intensive I/O load on a striped volume.
data (such as catalogs or price lists), or results of easily reproducible experiments. For any such data, it may not make economic sense to incur the cost of extra storage and I/O components to achieve failure tolerance at acceptable performance levels. Nonfailure-tolerant volumes may be the most costeffective solution. If I/O performance is important, striped volumes are definitely a more appropriate solution than spanned volumes. Like spanned volumes, striped volumes are often more convenient to manage than physical disks. A Windows striped volume can be expanded by adding another subdisk to each of the volume’s columns. One reasonable storage management practice, therefore, might be to locate performance-sensitive but easily replaceable data objects on striped volumes. This would allow for expansion by the addition of storage capacity as requirements dictate. When storage capacity is added to a striped volume, a subdisk must be added to each of the volume’s columns. This is slightly less flexible than the spanned volume case, where a single column can be added to a volume. The improved performance of data striping comes with a constraint on capacity expansion flexibility. Host-based striped volumes are also useful for aggregating the capacity and performance of failure-tolerant virtual disks presented by RAID controllers. Figure 3.10 illustrates this application of volume manager striping. In some instances, it may be beneficial to aggregate the failure-tolerant virtual disks presented by RAID controllers into larger volumes for application use. The large, high-performance, failure-tolerant volume presented to the file
Volumes That Are Not Failure Tolerant
43
File Sysytem
Large High-Performance, Failure Tolerant Volume
Failure-Tolerent Virtual Disk
Host-Based Volume Manager Volume manager stripes data across the two virtual disks.
RAID Controller Mirrored or Parity RAID Array
Failure-Tolerent Virtual Disk
RAID Controller Mirrored or Parity RAID Array
Figure 3.10 Volume manager striping of controller-based virtual disks.
system by the host-based volume manager in Figure 3.10 is a single range of blocks that is easier to manage. It will often outperform two equivalent failuretolerant virtual disks. Host-based volume-manager mirroring can be used similarly to augment RAID controller capabilities.
Why Striped Volumes Are Effective Striping data across several subdisks located on different disks tends to improve I/O performance for both I/O request-intensive and data transferintensive applications, but for different reasons. The following sections describe how striped volumes enhance I/O performance for these two classes of I/O-bound applications.
Striped Volumes and I/O RequestIntensive Applications The performance of I/O request-intensive applications is most often limited by how fast a disk can execute I/O requests. Today, a typical disk takes about 10 milliseconds to seek, rotate, and transfer data for a single small read or write request (around 4 kilobytes). The upper limit on the number of randomly addressed small requests such a disk can execute is therefore about 100 per second: 1000 milliseconds/second ÷ 10 milliseconds/request = 100 requests/second Many server applications require substantially more than this to reach their performance potential. If one disk cannot satisfy an application’s needs, an obvious solution is to use two or more disks.
44
CHAPTER THREE
In principle, it is possible to split an application’s data into two or more files and store them on separate disks, thereby doubling the available I/O request execution capacity. In practice, however, this solution has two shortcomings: ■■
It is awkward to implement, inflexible to use, and difficult to maintain.
■■
Application I/O requests are seldom divided equally among files; therefore, I/O load does not split evenly across the disks.
So, while data is sometimes divided into multiple files and distributed across several disks for administrative purposes, the technique is of limited value for balancing I/O load. Using striped volumes, on the other hand, tends to balance application I/O requests evenly across disk resources no matter how the I/O load is distributed in the volume block address space. To an application using a striped volume, the entire volume appears as one large disk. With volumes, there is no need to artificially split data into multiple files that will fit on individual disks. As the volume manager lays files out on the volume’s disks, however, each file is naturally distributed across the volume’s disks, as Figure 3.11 illustrates. For simplicity this example and the ones that follow make the assumption that each record of a file fits exactly into one disk block. Figure 3.11 shows, a 10record3 file stored on a volume striped across three subdisks with a stripe unit size of four blocks. From the standpoint of applications (and indeed, from the standpoint of all operating system and file system components except the volume manager itself), the file’s records are stored in consecutive blocks start-
3
To simplify the diagrams, this and following examples make the assumption that each record of a file fits exactly into one disk block.
Data Layout on Disks Application’s view of file
Disk A'
Record 000
SubDisk A Volume Block 000
Record 001
Disk B' Row 0 (Stripe 0)
SubDisk B Volume Block 004
Disk C' SubDisk C Record 003
Record 002
Volume Block 001
Record 000
Record 004
Record 003
Volume Block 002
Record 001
Record 005
Record 004
Volume Block 003
Record 002
Record 006
Volume Block 016
Volume Block 020
Record 005 Record 006
Volume Manager
Record 007
Row 1 (Stripe 1)
Volume Block 017
Volume Block 021
Record 009
Volume Block 018
Volume Block 022
Volume Block 015
Volume Block 019
Volume Block 023
etc.
etc.
etc.
Record 007
Record 008
Record 008 Record 009
Figure 3.11 Effect of data striping on file location.
Volumes That Are Not Failure Tolerant
45
ing with Volume Block 5. Physically, however, the file’s records are distributed across the volume’s three subdisks. If an application were to read and write this file randomly, with a uniform distribution of record numbers, accesses would be distributed across the disk as follows: ■■
Approximately 30 percent of the requests would be directed to Disk A'.
■■
Approximately 30 percent of the requests would be directed to Disk B'.
■■
Approximately 40 percent of the requests would be directed to Disk C'.
If the file were extended, new records might be stored starting at Volume Block 15, then on Disk B' (Volume Blocks 16–19), Disk C' (Volume Blocks 20–23) and in stripe 2 (not shown) starting with Disk A' and so forth. The larger a file is compared to a volume’s stripe unit size, the more evenly its blocks will be distributed across a striped volume’s disks. If a file is large compared to the volume’s subdisk size, uniformly distributed I/O requests tend to be distributed uniformly across the disks, whether the volume uses striped or spanned mapping. The benefit of striping arises when request distribution is not uniform, as for example a batch program processing record updates in alphabetical order, which tends to result in relatively small fragments of the file being accessed frequently. To illustrate this, Figure 3.12 shows a batch of alphabetically ordered requests directed at an alphabetically ordered file on a spanned volume.
Fry Frome Frederick Fitzpatrick Farrell Application’s request stream
Most requests are directed to one disk
Disk A'
Disk B'
SubDisk A
SubDisk B
Disk C' SubDisk C
Aron
Fitzpatrick
Able
Franz
Hart
Alpha
Frederick
Jones
Adelman
Frizzell
Lambert Peterson
Harris
Bates
Frome
Carlton
Frugell
Samuels
Farrell
Fry
Truestedt
Feeney
Frymoyer
Young
Volume block address progression
Figure 3.12 Sorted I/O request distribution with a spanned volume.
46
CHAPTER THREE
In Figure 3.12, the records of the file are stored in alphabetical order in blocks with increasing volume block numbers. Since the progression of volume block addresses in a spanned volume is from the end of one subdisk to the beginning of the next, this means that records corresponding to the beginning of the alphabet are stored on SubDisk A and so forth. In the example, most accesses to records for names beginning with the letter F are directed to SubDisk B, since it holds most of that fragment of the file. Disks A' and C' would remain nearly idle while this batch of requests was processed. With a striped volume, however, there is a natural distribution of logically sequential records across all of the volume’s disks, as Figure 3.13 illustrates. Thus, even though a batch of application requests creates a “hot spot” in a small area of the file (names beginning with the letter F), striping distributes the stream of accesses to that hot spot across all of the volume’s disks. A striped volume tends to distribute most request-intensive I/O loads across all of its disk resources for higher throughput. Since stripes are ordered by volume block number, files tend to be distributed across disks in a way that would make it difficult for an I/O request stream not to be balanced. An important subtlety of striped volumes in I/O request-intensive applications is that data striping does not improve the execution time of any single request. Rather, it improves the average response time of a large number of concurrent requests by increasing disk resource utilization, thereby reducing the average time that requests wait for previous ones to finish executing. Data striping only improves I/O request-intensive performance if requests overlap Fry Frome Frederick Fitzpatrick Farrell Application’s request stream
Requests are distributed evenly across disks
Disk A'
Disk B'
SubDisk A
SubDisk B
Disk C' SubDisk C Fitzpatrick
Aron
Bates
Able
Carlton
Franz
Alpha
Farrell
Frederick
Adelman
Feeney
Frizzle Peterson
Frome
Harris
Frugell
Hart
Samuels
Fry
Jones
Truestedt
Frymoyer
Lambert
Young
Volume block address progression
Figure 3.13 Sorted I/O request distribution with a striped volume.
Volumes That Are Not Failure Tolerant
47
in time. This differs from the reason that data striping improves I/O performance for data transfer-intensive applications.
Striped Volumes and Data TransferIntensive Applications The goal of data transfer-intensive applications is to move large amounts of data between memory and storage as quickly as possible. Data transfer applications almost always access large sequential files that occupy many consecutive disk blocks. If such an application uses one disk, then I/O performance is limited by how quickly the disk can read or write a large continuous stream of data. Today’s disks typically transfer data at an average of 10–20 megabytes per second. Disk interfaces such as ATA and Ultra SCSI are capable of higher speeds, but a disk can deliver or absorb data only as fast as a platter can rotate past a head. If a data transfer-intensive application uses a striped volume, however, multiple disks cooperate to get the data transfer done faster. The data layout for a data transfer-intensive application using a three-disk striped volume is illustrated in Figure 3.14. The data transfer-intensive application illustrated in this figure accesses a large file in consecutive fragments (shown as disk blocks for simplicity; in practice, they would be much larger). Unlike records, the fragments have no meaning by themselves; they simply subdivide the file logically for buffer management purposes. When an application reads or writes this file, the entire file is transferred; and the faster the transfer occurs, the better. In the extreme, the application might make one request to read the entire file. The volume manager would translate such a request into: Application’s view of data (e.g., video clip)
Physical Data Layout
File Fragment 000 File Fragment 001
Volume Manager
Row 0 (Stripe 0)
Disk C'
SubDisk B File Fragment 002
SubDisk C File Fragment 006
File Fragment 003
File Fragment 007
File Fragment 000
File Fragment 004
File Fragment 008
File Fragment 001
File Fragment 005
File Fragment 009
File Fragment 014
File Fragment 018
Volume Block 001
File Fragment 002
• • •
Disk B'
Disk A' SubDisk A Volume Block 000
File Fragment 010
Row 1 (Stripe 1)
File Fragment 015
File Fragment 019
File Fragment 012
File Fragment 016
Volume Block 022
File Fragment 013
File Fragment 017
Volume Block 023
etc.
etc.
etc.
File Fragment 017
File Fragment 011
File Fragment 018 File Fragment 019
Figure 3.14 Data striping and data transfer intensive applications.
48
CHAPTER THREE
1. A request to Disk A' for Fragments 000 and 001. 2. A request to Disk B' for Fragments 002 through 005. 3. A request to Disk C' for Fragments 006 through 009. 4. A request to Disk A' for Fragments 010 through 013. 5. A request to Disk B' for Fragments 014 through 017. 6. A request to Disk C' for Fragments 018 and 019. Because the first three requests are addressed to different disks, they can execute simultaneously. This reduces data transfer time compared to transferring all the data from a single disk. The volume manager can make its fourth request as soon as the first completes, the fifth as soon as the second completes, and the sixth as soon as the third completes. Overall data transfer time in this case would be about the time to transfer the eight fragments from Disk B', just a little over a third of the data transfer time to retrieve the entire file from one disk. Thus, with data transfer-intensive applications, the effect of striped volumes is to make individual I/O requests execute faster.
Stripe Unit Size and I/O Performance If the predominant I/O load on a volume is either data transfer-intensive or I/O request-intensive, volume stripe unit size can be adjusted to optimize performance. In the case of data transfer-intensive applications, optimal performance is generally achieved when all of a volume’s disks read or write approximately the same number of data blocks in response to each application request. Thus, if a data transfer-intensive application reads and writes data with I/O requests of a constant size, the ideal stripe unit size for its data volumes is the I/O request size divided by the number of disks comprising the volume. For example, if a video editing application reads and writes video clips in fragments of 256 kilobytes, the ideal stripe unit size for an eight-disk striped volume would be 262,144 ÷ 8 = 32,768 bytes, or 64 disk blocks. For I/O request-intensive applications, the opposite is true. Generally, I/O request-intensive applications transfer a relatively small amount of data in each request. Thus, their I/O request execution time is dominated by disk motion (seeking and rotational latency) rather than by data transfer. Therefore, reducing data transfer time for these applications is a secondary goal. On the other hand, these applications frequently have multiple I/O requests outstanding simultaneously. Striping improves performance for these applications by distributing the requests more evenly across a volume’s disks for most I/O loads. If a disk is busy satisfying one I/O request, then other simultaneous requests to it must wait. It is therefore preferable for each request in an
Volumes That Are Not Failure Tolerant
49
I/O request-intensive load to be satisfied completely by one disk, leaving as many other disks as possible free to service other requests. File systems and database management systems allocate blocks in volume address space. The volume manager maps these volume blocks to disk blocks. The overlay of these two mappings makes it difficult to guarantee that requests will never be split across disks. If stripe unit size is sufficiently large, however, the probability of a split I/O will be small. For example, if an application reads and writes data in units of four kilobytes (8 disk blocks), with uniformly distributed starting addresses,4 a stripe unit size of 256 blocks (128 kilobytes) means that each of the volume’s disks has 256 possible starting block addresses for I/O requests. Seven of these addresses would lead to the splitting of 4-kilobyte I/O requests across two disks (the 250th through 256th blocks). If I/O request starting addresses are uniformly distributed throughout the volume, then 7⁄256, or 2.7 percent of the requests will be split, and 97.3 percent will be serviced by a single disk. Thus, the rule of thumb for I/O requestintensive applications is that volume stripe unit size should be specified so that most requests (90 percent-plus) are satisfied by a single disk. Both of these arguments are predicated on: ■■
Constant application I/O request sizes (256 kilobytes and 4 kilobytes, respectively).
■■
Consistent I/O loads that are either data transfer-intensive or I/O requestintensive, respectively.
For volumes that hold data for a single application, these assumptions may be accurate, at least to a first approximation. If, however, a volume is used by multiple applications with different I/O request sizes and patterns, optimizing for one may lead to suboptimal results for the others. These rules of thumb should therefore be applied only when all of the I/O loads to which a volume will be subject are known to be highly homogeneous in this respect.
A Way to Categorize the I/O Performance Effects of Data Striping Figure 3.15 is a qualitative graphical summary of striped volume performance relative to that of a single disk. The icon at the center of the graph represents I/O performance that might reasonably be expected from a single disk. The other two icons represent striped volume performance relative to it. 4
Admittedly, a slightly unrealistic simplifying assumption, since most file systems and databases manage storage in groups of consecutively numbered disk blocks called allocation units, pages, or clusters. The assumption is conservative, however, in that it leads to a higher percentage of split I/O requests than would be expected if requests were aligned on cluster boundaries.
50
CHAPTER THREE
Throughput (I/O Requests/Second) more
Striped Volume
Striped Volume
Large sequential reads and writes
Small random reads and writes
I/O requestintensive applications
Data transferintensive applications
Dotted lines represent typical single-disk performance. Disk
fewer lower
Request Execution Time
higher
Figure 3.15 Relative performance of striped volumes.
This figure has two dimensions because there are two important ways to measure I/O performance. One is throughput, or the amount of work done per unit of time. The other is average request execution time (not including time spent queued for execution). As the figure shows, data striping improves throughput relative to a single disk for I/O request-intensive applications, but does not change average request execution time. Throughput improves because striping tends to balance the I/O requests across the disks, keeping them all busy. Request execution time does not change because most requests are executed by one of the volume’s disks. Even though individual request execution time does not change, by maximizing disk utilization, data striping tends to reduce the time that requests wait for previous requests to finish before they can begin to execute. Thus, the user perception is often that requests execute faster, because they spend less time queued for execution. Of course, if there is nothing to wait for, then a request to a striped volume completes in about the same time as the same request made to a disk. The busier a striped volume gets, the better it seems to perform, up to the point at which it is saturated (all of its disks are working at full I/O capacity). Large sequential reads and writes also perform better with striped volumes. In this case, both throughput and request execution time are improved relative to single disk performance. Individual request execution time is lower because data transfer, which accounts for most of it, is done in parallel by some or all of the disks. Throughput is higher because each request takes less time to exe-
Volumes That Are Not Failure Tolerant
51
cute, so a stream of requests executes in a shorter time. In other words, more work gets done per unit time.
An Important Optimization for Striped Volumes: Gather Writing and Scatter Reading Most host bus adapters can combine requests for data in consecutive stripes into a single request, even though nonconsecutive memory addresses are accessed. Data being written is “gathered” from nonadjacent memory areas as Figure 3.16 illustrates. Figure 3.16 shows the part of a write directed to Disk A' from Figure 3.14. Mapping registers on the host bus adapter enable file Fragments 000 and 001 and Fragments 010 through 013 to be written to consecutive blocks of Disk A' in a single continuous stream with no disk revolution time between the two groups of fragments. The converse capability—scattering of consecutive disk blocks to nonconsecutive memory locations as they are read from disk—is also supported by most HBAs. This is called scatter reading and is illustrated in Figure 3.17. Here, a continuous stream of data is read from Disk A' and delivered directly to two noncontiguous areas of memory. Scatter reading and gather writing improve I/O performance with striped volumes by:
Application memory
Disk A' from Example of Figure 35
Fragment 000 Fragment 001 Fragment 002 Fragment 003
These fragments are written to Disks B' and C' in separate operations.
Fragment 004
Fragment 000
Fragment 005
Fragment 001
Fragment 006 Fragment 007 Fragment 008 Fragment 009 Fragment 010 Fragment 011 Fragment 012 Fragment 013
etc.
Figure 3.16 Gather writing.
Fragment 010 Fragment 011 Fragment 012 Fragment 013
HBA address mapping registers allows these fragments to be written in one continuous stream with no gap between them
Disk A' SubDisk A Volume Block 000 Volume Block 001 Fragment 000 Fragment 001 Fragment 010 Fragment 011 Fragment 012 Fragment 013
etc.
52
CHAPTER THREE
Application memory
Disk A' from Example of Figure 35
Fragment 000
Disk A'
Fragment 001
SubDisk A Volume Block 000 Volume Block 001 Fragment 000 Fragment 001
Fragment 002 Fragment 003
Data is read from Disk A' in one continuous stream.
Fragment 010 Fragment 011 Fragment 012 Fragment 013
etc.
Fragment 000 Fragment 001 Fragment 010 Fragment 011 Fragment 012 Fragment 013
Host bus adaptor mapping registers allow fragments to be delivered to nonconsecutive memory locations.
Fragment 004 Fragment 005 Fragment 006 Fragment 007
These fragments are read from Disks B' and C' in separate operations.
Fragment 008 Fragment 009 Fragment 010 Fragment 011 Fragment 012 Fragment 013
etc.
Figure 3.17 Scatter reading.
■■
Eliminating at least half of the disk I/O requests that the volume manager would otherwise have to make to satisfy application I/O requests to the volume.
■■
Eliminating many of the waits for disk revolutions that result from “missing” data due to the time required to issue these additional requests (e.g., requests 4–6 in the example of Figure 3.14).
Scatter-gather capability is sometimes implemented by processor memory management units, but more often by I/O interface Application-Specific Integrated Circuits (ASICs), such as the single-chip SCSI and Fibre Channel interfaces found on some computer mainboards. The capability is therefore available to both host-based volume managers and controller-based disk arrays. Even though they are not failure tolerant simple, concatenated and striped are useful additions to the administrator’s toolkit manage capacity, either by dividing up very large disks into manageable units or by aggregating smaller disks when large units of storage are needed. When data is striped across multiple physical or virtual disk I/O performance improves for almost all applications. The next chapter discusses how data striping can be combined with mirroring RAID techniques to improve data availability along with I/O performance.
CHAPTER
4
Failure-Tolerant Volumes: Mirroring and RAID
irroring and RAID technology have become commonplace in server computing. Today, virtually all aggregating disk controllers incorporate mirroring and RAID to improve data availability and I/O performance. Host-based volume managers, including Windows volume managers, also provide mirroring and RAID functionality. System administrators and application designers responsible for implementing online storage for valuable enterprise data must choose among:
M
■■
Mirroring, with its higher associated cost
■■
RAID, with its limited failure protection
■■
Hardware (disk controller-based) and software (host volume managerbased) implementations of mirroring or RAID
Additionally, as illustrated in Figure 4.1, it is generally possible to create combinations of these options, further complicating storage strategy choices.
RAID: The Technology RAID, as noted in Chapter 1, is an acronym for the phrase Redundant Array of Independent Disks. Each word in the phrase makes a specific contribution to the meaning. ■■
Redundant means that part of the disks’ storage capacity is used to store check data derived from user data that can be used to recover user data if a disk on which it is stored should fail. 53
54
CHAPTER FOUR
Host Computer
Host Computer
File System
File System
Volume Manager Disk Driver
Volume Manager
Host Computer File System Volume Manager Disk Driver
RAID Controller
RAID Controller
RAID Controller
Host Bus Adapter
Control Software
Control Software
Control Software
Host-Based Array (Volume)
Controller-Based Array (Virtual Disk)
Hybrid Volume of Virtual Disks
Figure 4.1 RAID and mirroring configuration options.
■■
Array refers to a collection of disks managed by control software that presents their capacity as a set of coordinated virtual disks or volumes. The control software for host-based arrays is a volume manager, and the arrays are called volumes. In controller-based arrays, the control software runs in a disk controller, and the arrays are represented on host I/O buses as virtual disks.
■■
Independent means that the disks themselves are ordinary disks that can function independently of each other in the absence of control software.
■■
Disks, the storage devices comprising the array or volume, are online random access storage devices. In particular, each read or write operation specifies explicitly which blocks are to be read or written. This allows read and write operations to be repeated if they fail. (This differs from tapes, in which data addressing is implicit.)
The term RAID was coined by researchers at the University of California at Berkeley during the 1980s to refer to a collection of software techniques for enabling arrays of disks to tolerate one or more disk failures without loss of function. Table 4.1 on pages 56 and 57 summarizes the characteristics of the six forms of RAID identified by the Berkeley researchers. The table also includes striped arrays, even though they are not failure-tolerant and were discussed only in passing by the Berkeley researchers, because the nomenclature “RAID Level 0,” or “RAID 0” used to denote them has passed into such widespread use among both users and members of storage industry.
RAID Today Over the years since the RAID levels were first described, other disk array control techniques have mitigated the differences among them. Although exceptions do exist, to a large extent, today’s RAID implementations are typically either:
Failure-Tolerant Volumes: Mirroring and R AI D
55
■■
Mirroring, which the Berkeley researchers called RAID Level 1, or RAID 1, of two or more disks or sets of disks (plexes).
■■
RAID, using unsynchronized disks with exclusive OR parity distributed across them. RAID most closely resembles Berkeley RAID Level 5.
In today’s usage, the unqualified term RAID most often refers to data protection using parity. This book follows that usage. This book also generally avoids the term RAID Level 1, instead using these terms: Mirrors, mirrored disks, or mirror sets. To refer volumes or arrays that hold two or more copies of identical data on separate disks. Striped mirrors or striped mirrored disks. To refer to volumes or arrays in which data are striped across two or more mirrored plexes or arrays. Mirrored stripes or mirrored striped disks. To refer to volumes or arrays in which data are mirrored across two or more striped plexes or arrays.
Mirrored Volumes A mirrored volume consists of at least two subdisks of identical size located on separate disks. Each time a file system or application writes to the volume, the volume manager transparently writes the same data to each of the volume’s subdisks. When an application reads data from a mirrored volume, the volume manager chooses a subdisk to satisfy the read, based on a scheduling algorithm such as: Round-robin. The volume manager directs read requests to the disks comprising the mirrored volume in sequence. This algorithm is simple to implement; and, to a first approximation, it tends to balance I/O load across the volume’s disks. The round-robin algorithm also has the advantage of exercising all of the volume’s disks equally, thereby taking the opportunity to probe for failures as often as the application I/O load allows. Preferred. The volume manager directs read requests to a disk designated by the system administrator. Preferred read scheduling can be useful when one of a mirrored volume’s disk is markedly faster than the other(s). If a solidstate disk is mirrored with a rotating magnetic disk, for example, the solidstate disk might be preferred because of its higher speed. Similarly, if a local disk is mirrored with one accessed over a storage network, the local disk would ordinarily be preferred for read accesses. Least busy. The volume manager chooses the disk with the shortest queue of outstanding I/O requests. If all disk queue lengths are equal, or if no I/O is queued to any of the volume’s disks, a round-robin choice is typically made.
56
CHAPTER FOUR
Table 4.1 Summary Comparison of Common Forms of RAID COST FOR N DATA DISKS
RAID LEVEL
COMMON NAME
0
Data Striping
User data distributed across the disks in the array. No check data.
N
Lower than single disk
1
Mirroring
User data duplicated on M separate disks. (M is usually 2 or 3). Check data is second through Mth copies.
M×N
Higher than RAID Levels 3, 4, 5; lower than RAID Levels 2, 6
0+1
Striped Mirrors
User data striped across N separate sets of M mirrored disks. Check data is second through Mth copies.
M×N
Higher than RAID Levels 3, 4, 5; lower than RAID Levels 2, 6
2
N/A
User data striped across N disks. Check data distributed across m disks (m is determined by N).
N+m
Higher than RAID Levels 3, 4, 5
3
RAID 3, parallel transfer disks with parity
Synchronized disks. Each user data block distributed across all data disks. Parity check data stored on one disk.
N+1
Much higher than single disk; comparable to RAID Levels 2, 4, 5
4
RAID 4
Independent disks. User data distributed as with striping. Parity check data stored on one disk.
N+1
Much higher than single disk; comparable to RAID Levels 2, 3, 5
5
RAID 5, “RAID”
Independent disks. User data distributed as with striping. Parity check data distributed across disks.
N+1
Much higher than single disk; comparable to RAID Levels 2, 3, 4
6
RAID 6
As RAID 5 (above), but with a second set of independently computed distributed check data.
N+2
Highest of all listed types
1+0
DESCRIPTION
RELATIVE DATA AVAILABILITY
Other scheduling algorithms are possible. Windows volume managers, for example, use an “almost round-robin” algorithm. Disk selection is roundrobin, except when consecutive requests to the volume specify proximate data. When this occurs, the requests are sent to the same disk because positioning time for the second request will typically be minimal (because positioning for the first request leaves the heads of the selected disk near the data specified in the second).
Failure-Tolerant Volumes: Mirroring and R AI D
57
LARGE READ DATA TRANSFER SPEED *
LARGE WRITE DATA TRANSFER SPEED
RANDOM READ REQUEST RATE
RANDOM WRITE REQUEST RATE
Very high
Very high
Very high
Very high
Higher than single disk (up to 2×)
Slightly lower than single disk
Up to N× single disk
Similar to single disk
Much higher than single disk
Higher than single disk
Much higher than single disk
Higher than single disk
Highest of all listed types
Highest of all listed types
Approximately 2× single disk
Approximately 2× single disk
Highest of all listed types
Highest of all listed types
Approximately 2× single disk
Approximately 2× single disk
Similar to disk striping
Slightly lower than disk striping
Similar to disk striping
Significantly lower than single disk
Slightly higher than disk striping
Slightly lower than disk striping
Slightly higher than disk striping
Significantly lower than single disk; higher than RAID Level 4
Slightly higher than RAID Level 5
Lower than RAID Level 5
Slightly higher than RAID Level 5
Lower than RAID Level 5
*The Data Transfer Capacity and I/O Request Rate columns reflect only I/O performance inherent to the RAID model and do not include the effect of other features such as cache.
Mirrored Volumes and I/O Performance While their primary purpose is to enhance data availability through disk failure tolerance, mirrored volumes can also improve I/O performance for most I/O-intensive applications. Most I/O-intensive applications make substantially more read requests than writes. For transaction-processing applications, this
58
CHAPTER FOUR
is intuitively easy to accept. Transactions are fundamentally either inquiries or updates to a database. Inquiries read data from the database. Update transactions read data, modify it, and rewrite it. Thus, at least half of transactionprocessing applications’ I/O requests are likely to be reads. When a read request for an idle disk or simple volume arrives, it begins to execute. If a second request arrives while the first is executing, it must wait until the first request completes before it can execute. If a third request arrives, it must wait for the second, and so forth. Likewise, when a read request arrives for an idle mirrored volume, it begins to execute immediately on one of the volume’s disks. If a second read request arrives before the first is complete, it can begin to execute immediately using the second disk. If a third request arrives before the first and second are complete, a third mirrored disk (if there is one) could begin to execute it immediately. If the mirrored volume includes only two disks, a third request must wait, but only until either the first or second completes. Figure 4.2 illustrates reading in a mirrored volume. Thus, to a first approximation, a mirrored volume can deliver as many times the read request rate of a disk as it has disks containing copies of user data. Writing data to mirrored volumes incurs a minor performance penalty compared to writing to a simple volume. The volume manager must convert each write request to the volume into identical write requests for each of the volume’s disks. All disk writes must complete before the volume manager can declare the volume request complete. Thus, from the file system or application point of view, write request execution time is at least as long as the longest disk write time. All of the volume manager’s disk write requests are made to separate disks, however, and can therefore execute concurrently, so
Volume
Volume manager sends read request to disk with shortest I/O queue.
Volume Block 000 Volume Block 001 Volume Block 002 Block000 00 3 Volume Block BlockVolume 00 4 Volume Block 000
I/O Queue
Volume Block 005 Volume Block 006
Disk A I/O Queue
Volume Manager
Disk A' SubDisk A
Disk B I/O Queue
Disk B' SubDisk B
Disk C I/O Queue
Disk C' SubDisk C
Volume Block 007
Volume Block 000
Volume Block 000
Volume Block 000
etc.
Volume Block 001
Volume Block 001
Volume Block 001
Volume Block 002
Volume Block 002
Volume Block 002
Volume Block 003
Volume Block 003
Volume Block 003
etc.
etc.
etc.
Highest Numbered Volume Block
Figure 4.2 Read requests to a mirrored volume.
Failure-Tolerant Volumes: Mirroring and R AI D
59
elapsed time for mirrored volume writes is not much greater than for simple volume writes. Nevertheless, writing to mirrored volumes still impacts application I/O performance, in two ways: I/O subsystem loading. Write requests reduce the number of read requests that a mirrored volume can execute per unit time. For example, if each disk of a two-disk mirrored volume can execute 100 I/O requests per second, a volume I/O load of 100 read requests and 50 write requests will completely saturate both of the volume’s disks. Longer write request service times. The disk write requests required to execute a mirrored volume write request can execute concurrently, but the service time for the volume request is the longest of all the individual disk requests. Even with no I/O requests outstanding, simultaneous write requests to two or more unsynchronized disks have a longer total service time than that of a request to a single disk because the random rotational positions of the disks result in a variation of individual disk request service times. Both of these effects are typically minor, becoming visible only in the rare instances when an application or file system saturates a mirrored volume with write requests. For most applications, the beneficial effect of having multiple disks to service read requests far outweighs the slight performance penalty of multiple disk writes.
Combining Striping with Mirroring As Figure 4.3 illustrates, two or more striped plexes can be combined by mirroring data between them. SubDisks A, B, and C comprise a striped plex, as do SubDisks D, E and F. These striped plexes are internal to the volume manager and are not presented directly to file systems or applications. When mirroring striped plexes, the volume manager treats each plex as if it were a disk and issues all writes to all of them. Application write requests specify block numbers in the volume’s block address space. Each volume block number is represented in each of the striped plexes. The volume manager uses some algorithm, such as round-robin or preferred, to choose a plex for executing each volume read request, just as it would with a mirrored volume consisting of disks. Combining data striping with mirroring in a single volume has several advantages: ■■
Very large failure-tolerant volumes can hold large databases or serve applications that require large numbers of files.
60
CHAPTER FOUR
Volume
Volume manager sends read request to disk with shortest I/O queue.
Volume Block 000 Volume Block 001 Volume Block 002 Block000 00 3 Volume Block BlockVolume 00 4 Volume Block 000
I/O Queue
Volume Block 005 Volume Block 006
Disk A I/O Queue
Volume Manager
Disk B I/O Queue
Disk A'
Disk B'
SubDisk A
SubDisk B
Disk C I/O Queue
Disk C' SubDisk C
Volume Block 007
Volume Block 000
Volume Block 000
Volume Block 000
etc.
Volume Block 001
Volume Block 001
Volume Block 001
Volume Block 002
Volume Block 002
Volume Block 002
Volume Block 003
Volume Block 003
Volume Block 003
etc.
etc.
etc.
Highest Numbered Volume Block
Plex 0
Disk A' The volume manager picks a plex to execute each read
SubDisk A
Disk B' SubDisk B
Disk C' SubDisk C
Volume Block 000
Volume Block 004
Volume Block 008
Volume Block 001
Volume Block 005
Volume Block 009
Volume Block 002
Volume Block 006
Volume Block 010
Volume Block 003
Volume Block 007
Volume Block 011
Volume Block 001
Volume Block 012
Volume Block 016
Volume Block 020
Volume Block 002
013 Volume Block 000
013 Volume Block 000
013 Volume Block 000
Volume Volume Block 000
Volume Block 003 Volume Block 004 Volume Block 005 Volume Block 006 Volume Block 007
etc.
Volume Manager
etc. Plex 1
Disk D' SubDisk D
Highest Numbered Volume Block
The volume etc. manager sends writes to both plexes.
Disk E' SubDisk E
etc.
Disk F' SubDisk F
Volume Block 000
Volume Block 004
Volume Block 008
Volume Block 001
Volume Block 005
Volume Block 009
Volume Block 002
Volume Block 006
Volume Block 010
Volume Block 003
Volume Block 007
Volume Block 011
Volume Block 012
Volume Block 016
Volume Block 020
013 Volume Block 000
013 Volume Block 000
013 Volume Block 000
etc.
etc.
etc.
Figure 4.3 Data striping combined with data mirroring.
■■
Failure tolerance is excellent. Striped volumes of mirrored plexes can survive failure of up to half of their disks. (However a mirrored volume of striped plexes can only survive failure of one disk per plex in all but one of its plexes. For example a mirrored volume consisting of three striped plexes can survive failure of one disk in each of two of its plexes.)
■■
Read performance is very high. The volume manager chooses a plex to execute read requests. Within each plex, striping balances the load further, as described in Chapter 3 (Why Striped Volumes Are Effective).
Failure-Tolerant Volumes: Mirroring and R AI D
61
■■
The write penalty (the multiple writes that the volume manager must perform to keep all plexes’ contents synchronized) is mitigated by the striping of data across multiple disks.
■■
Because it is not computationally intensive, the combination of striping and mirroring is very well suited to host-based software implementations.
About the only drawback to mirrored striped volumes is hardware cost—a factor that motivated the development of parity-based RAID protection. The user must purchase, house, power, and operate twice as much raw storage capacity as his or her data requires. In the days of $10-per-megabyte disk storage, this was a major factor. Today, in contrast, with server disk prices in the neighborhood of 10 cents per megabyte (including housing, power, and cooling), and controller-based RAID subsystem prices of less than a dollar per megabyte, large mirrored striped volumes and disk arrays are becoming increasingly popular for mission-critical data.
Split Mirrors: A Major Benefit of Mirrored Volumes Mirrored volumes provide still more data storage management flexibility. A mirrored volume can be separated into its component plexes at carefully chosen times, and each plex can be mounted as a volume and used independently of the other. This is especially useful for applications that require frozen image backup but that cannot be halted for the time it takes a backup to execute. For example, an application using the volume depicted in Figure 4.3 might be paused for an instant (to allow transactions in progress to complete and cached data to be written to disk) to allow a system administrator to split the volume into two independent plexes. Each of the plexes would then be mounted as a volume and used simultaneously for different purposes: ■■
A volume based on Plex 0 (Disks A', B', and C') might be used by the (restarted) application for continued transaction processing.
■■
A volume based on Plex 1 (Disks D', E', and F') might be used to create a backup copy of the application’s data as it stood at the point in time at which the volume was split.
Later, after the backup was complete, the volume based on Plex 1 (Disks D', E', and F') would be unmounted. The plex could then be reattached to the original volume and its contents updated by the volume manager to reflect any application changes made to Plex 0 (Disks A', B', and C') while the backup was executing. When the update was complete, the striped, mirrored volume would again be failure-tolerant.
62
CHAPTER FOUR
Figure 4.4 illustrates splitting a mirrored striped volume into two striped volumes, using the striped volumes for different purposes for a time and rejoining them into a single mirrored striped volume afterward.1 Some applications use three-mirror volumes (all application data reflected on three separate disks) for mission-critical data, so that data are protected against disk failure while a split mirror copy is being backed up. In this scenario, one of three mirrored plexes would be split from the volume and used for backup, as illustrated in Figure 4.4. The two remaining copies would con-
1 A similar process would be used with a stripe mirrored volume. In that case, each of the mirrored plexes would be split and three subdishs (one from each split plex) would be made into a nonfailure-tolerant striped volume.
Volume
Volume manager sends read request to disk with shortest I/O queue.
Volume Block 000 Volume Block 001 Volume Block 002
Disk A I/O Queue
Block000 003 Volume Block BlockVolume 004 Volume Block 000
I/O Queue
Volume Manager
Volume Block 005 Volume Block 006
Disk B I/O Queue
Disk C I/O Queue
Disk C'
Disk B'
Disk A'
SubDisk C
SubDisk B
SubDisk A
Volume Block 007
Volume Block 000
Volume Block 000
Volume Block 000
etc.
Volume Block 001
Volume Block 001
Volume Block 001
Volume Block 002
Volume Block 002
Volume Block 002
Volume Block 003
Volume Block 003
Volume Block 003
etc.
etc.
etc.
Highest Numbered Volume Block
Application
Application
Application
I/O
I/O
I/O
9
9 9
9 9
9
9
9
9
9 9
9
9
9
9 9
9
9
I/O
Backup Application uses mirrored volume.
Application pauses while volume is separated and mounted as two volumes.
Application resumes using nonmirrored volume. Backup uses “breakaway” volume.
time Disks rejoin volume and are resynchronized when backup is complete.
Figure 4.4 Using a split mirror for backup while an application executes.
Application uses mirrored volume.
Failure-Tolerant Volumes: Mirroring and R AI D
63
tinue to be used by the application during the backup. Windows volume managers support this capability.
RAID Volumes The most common form of RAID uses bit-by-bit parity to protect data on an arbitrary number of disks against loss due to a failure in any one of them. RAID, which became popular at a time when disk storage was two orders of magnitude more expensive than it is today, reduces the hardware cost of disk failure tolerance (compared to mirroring) by requiring less disk space for redundant data. It protects against fewer failure modes, however, and has a much higher write performance overhead.
RAID Overview Instead of one or more complete copies of every block of user data, RAID uses one block of check data to protect an entire row of blocks of user data (Figure 4.5). Using parity as check data allows the contents of any block in a row to be computed (regenerated), given the contents of the rest of the blocks in the row. Figure 4.5 illustrates this fundamental principle of RAID, as well as its performance and data availability consequences. In Figure 4.5, SubDisks A and B hold user data. Each block of SubDisk C holds check data computed from the corresponding blocks (the blocks in the same row) from SubDisks A and B. ■■
If Disk A' fails, any block of data from it can be regenerated by performing a computation that uses the corresponding blocks from SubDisks B and C.
■■
If Disk B' fails, any block of data from it can be regenerated by performing a computation that uses the corresponding blocks from SubDisks A and C.
■■
If Disk C' fails, only failure tolerance is lost, not user data.
SubDisks A, B and C in Figure 4.5 comprise a volume, much like the spanned volume illustrated in Figure 3.4 in Chapter 3. As with a spanned volume, two read requests that specify data on SubDisks A and B can execute concurrently (there is no user data on SubDisk C). But with RAID volumes, each application write request requires that both the user data and the check data protecting it be updated. Application writes to RAID volumes are inherently high-overhead operations.
RAID Check Data RAID check data is computed as the bit-by-bit exclusive OR of the contents of all user data blocks in a row of subdisks and written in the corresponding
64
CHAPTER FOUR
Volume
Volume manager sends read request to disk with shortest I/O queue.
Volume Block 000 Volume Block 001 Volume Block 002
Disk A I/O Queue
Block000 003 Volume Block BlockVolume 004 Volume Block 000
I/O Queue
Volume Manager
Volume Block 005 Volume Block 006
Disk B I/O Queue
Disk A'
Disk C I/O Queue
Disk B'
SubDisk A
SubDisk B
Disk C' SubDisk C
Volume Block 007
Volume Block 000
Volume Block 000
Volume Block 000
etc.
Volume Block 001
Volume Block 001
Volume Block 001
Volume Block 002
Volume Block 002
Volume Block 002
Volume Block 003
Volume Block 003
Volume Block 003
etc.
etc.
etc.
Highest Numbered Volume Block
Multiple application read requests can be satisfied simultaneously as long as they specify data on different disks. User Data Block 000 on Disk A User Data Block 002 on Disk A
Disk A' SubDisk D column
row
User Data Block 001 on Disk B
Each time a block is written, the corresponding check data block must be recomputed and rewritten.
Disk B' SubDisk E column
Disk C' SubDisk F column
User Data Block 000
User Data Block 000
Check Data Block 000
User Data Block 001
User Data Block 001
Check Data Block 001
User Data Block 002
User Data Block 002
Check Data Block 002
• • •
• • •
• • •
etc.
etc.
etc.
Figure 4.5 I/O performance and failure-tolerance properties of RAID.
block of the check data disk. The exclusive OR of the check data block contents with other user data blocks in its row can be used to regenerate the contents of any single user block in the row, if necessary. Using the exclusive OR function to compute check data has two advantages: ■■
The exclusive OR function is simple to compute. The simplicity lends itself to either hardware or software implementations, which reduce the overhead of the computation in some RAID controllers, and to robust hostbased software implementations.
■■
The check data computation algorithm is the same as the user data regeneration algorithm. Whether user data is written (requiring that new check data be computed), or user data must be regenerated (because a
Failure-Tolerant Volumes: Mirroring and R AI D
65
disk failed or a block became unreadable), the same computational logic is used, again leading to simpler and therefore more robust implementations. Figure 4.6 illustrates the computation of parity check data for row X of a three-disk RAID volume and the nth block on each of a RAID volume’s three subdisks. The nth blocks on SubDisks A and B contain user data. Block n on SubDisk C contains the bit-by-bit exclusive OR of the corresponding user data from SubDisks A and B. Check data is recomputed and written to SubDisk C every time user data on SubDisk A or SubDisk B is written, so the contents of SubDisk C can help regenerate user data from either SubDisk A or SubDisk B, should either of them fail. The user data regeneration computation makes use of the fact that the exclusive OR of a binary number with itself is always 0. Referring to Figure 4.6: Check data(Block n) = User data(SubDisk A, Block n) 䊝 User data(SubDisk B, Block n). Suppose that SubDisk B is unavailable. It is still possible to compute: Volume
Volume manager sends read request to disk with shortest I/O queue.
Volume Block 000 Volume Block 001 Volume Block 002
Disk A I/O Queue
Block000 003 Volume Block Volume Block000 004 Volume Block
I/O Queue
Volume Manager
Volume Block 005 Volume Block 006
Disk B I/O Queue
Disk A' SubDisk A
Disk C I/O Queue
Disk B'
Disk C'
SubDisk B
SubDisk C
Volume Block 007
Volume Block 000
Volume Block 000
Volume Block 000
etc.
Volume Block 001
Volume Block 001
Volume Block 001
Volume Block 002
Volume Block 002
Volume Block 002
Volume Block 003
Volume Block 003
Volume Block 003
etc.
etc.
etc.
Highest Numbered Volume Block
Disk A'
row 0
Symbol for the exclusive OR function
Disk B'
Disk C'
SubDisk A Block n
SubDisk B Block n
SubDisk C Block n
User Data Bits
User Data Bits
Parity Check Data
0
0
(0
0 =) 0
0
1
(0
1 =) 1
1
0
(1
0 =) 1
1
1
(1
• • •
• • •
Figure 4.6 Exclusive OR parity in a three-disk RAID volume.
1 =) 0 • • •
66
CHAPTER FOUR
User data(SubDisk A, Block n) 䊝 Check data(Block n) = User data(SubDisk A, Block n) 䊝 User data(SubDisk A, Block n) 䊝 User data(SubDisk B, Block n) but, User data(SubDisk A, Block n) 䊝 User data(SubDisk A, Block n) = 0, so, Check data(Block n) 䊝 User data(SubDisk A, Block n) = User data(SubDisk B, Block n )! Thus the user data from block n of SubDisk B has been regenerated by computing the exclusive OR of the check data with the user data from the surviving subdisk. Figure 4.7 illustrates a data regeneration computation. This figure represents a situation in which Disk B' containing SubDisk B has failed. When an application read request for data maps to SubDisk B, the volume manager: ■■
Reads the contents of the corresponding blocks from SubDisks A and C into its buffers.
■■
Computes the bit-by-bit exclusive OR of the two.
■■
Returns the result of the computation to the requesting application.
Delivery of data to the application takes slightly longer than if SubDisk B had not failed, but otherwise the failure of SubDisk B is transparent. The exclusive OR computation can be thought of as binary addition of equallength bit strings with carries ignored. Bit b of the result is 0 if the number of corresponding 1-bits is even, and 1 if that number is odd. Furthermore, the exclusive OR function in Figure 4.7 has the additional useful property that it can be extended to any number of user data blocks. Figure 4.8 gives the exclusive OR check data computation for a four-disk volume, and demonstrates its use to regenerate user data after a disk failure. A key point to be inferred from Figure 4.8 is that the RAID data regeneration computation requires data from all of the volume’s disks except the failed one. Thus, while RAID volumes of any size can be created, they only afford protection against loss of data due to a single disk failure. This contrasts with mirroring, which can survive the failure of several disks in certain circumstances. There are additional factors to consider when specifying the number of disks to include in a RAID volume: ■■
Adding more disks to a RAID volume increases both the probability of volume failure and the amount of data lost as a consequence of it.
Failure-Tolerant Volumes: Mirroring and R AI D
Volume
67
Volume manager sends read request to disk with shortest I/O queue.
Volume Block 000 Volume Block 001 Volume Block 002 Block000 003 Volume Block BlockVolume 004 Volume Block 000
I/O Queue
Volume Block 005 Volume Block 006
Disk A I/O Queue
Volume Manager
Disk B I/O Queue
Disk A'
Disk B'
SubDisk A
SubDisk B
Disk C I/O Queue
Disk C' SubDisk C
Volume Block 007
Volume Block 000
Volume Block 000
Volume Block 000
etc.
Volume Block 001
Volume Block 001
Volume Block 001
Volume Block 002
Volume Block 002
Volume Block 002
Volume Block 003
Volume Block 003
Volume Block 003
etc.
etc.
etc.
Highest Numbered Volume Block
Disk A'
application requests this block
Disk B' (failed)
Disk C'
SubDisk A Block n
SubDisk B Block n
SubDisk C Block n
User Data Bits
User Data Bits
Parity Check Data
0
0
0
0
1
1
1
0
1
1
1
0
• • •
• • •
• • •
etc. Regenerated Data Bits (0
0 =) 0
(0
1 =) 1
(1
1 =) 0
(1
0 =) 1 • • •
Delivered to requesting application
Figure 4.7 Using exclusive OR parity to regenerate user data.
■■
Adding more disks to a RAID volume adversely affects the performance of some application writes.
■■
Adding more disks to a RAID volume increases recovery time after a failed disk is replaced, thereby further increasing the risk of volume failure.
The system administrator must balance these factors against the lower disk cost when choosing between RAID and mirrored volumes.
The Hardware Cost of RAID The incremental hardware cost of failure tolerance in the RAID volume illustrated in Figure 4.5 is lower than that of a mirrored volume—one “extra” disk
68
CHAPTER FOUR
Volume
Volume manager sends read request to disk with shortest I/O queue.
Volume Block 000 Volume Block 001 Volume Block 002
Disk A I/O Queue
Block000 003 Volume Block BlockVolume 004 Volume Block 000
I/O Queue
Volume Block 005 Volume Block 006
Volume Manager
Disk B I/O Queue
Disk A'
Disk B'
SubDisk A
SubDisk B
Disk C I/O Queue
Disk C' SubDisk C
Volume Block 007
Volume Block 000
Volume Block 000
Volume Block 000
etc.
Volume Block 001
Volume Block 001
Volume Block 001
Volume Block 002
Volume Block 002
Volume Block 002
Volume Block 003
Volume Block 003
Volume Block 003
etc.
etc.
etc.
Highest Numbered Volume Block
Disk A'
Disk B'
Disk C'
Disk D'
SubDisk A Block n
SubDisk B Block n
SubDisk C Block n
SubDisk C Block n
User Data Bits
User Data Bits
User Data Bits
Parity Check Data
0
0
1
(0
0
1 =) 1
0
1
1
(0
1
1 =) 0
1
0
1
(1
0
1 =) 0
1
1
1
(1
1
• • •
• • •
• • •
1 =) 1 • • •
Check Data Computation User Data Regeneration
Disk A'
application requests this block
Disk B' (failed)
Disk C'
Disk D'
SubDisk B Block n
SubDisk C Block n
SubDisk D Block n
User Data Bits
User Data Bits
User Data Bits
Parity Check Data
0
0
1
1
0
1
1
0
1
0
1
0
1
1
1
1
• • •
• • •
• • •
• • •
SubDisk A Block n
etc. Regenerated Data Bits (0
1
1 =) 0
(0
1
0 =) 1
(1
1
0 =) 0
(1
1
1 =) 1 • • •
Delivered to requesting application
Figure 4.8 Exclusive OR parity in a four-disk raid volume.
block for every two blocks of user data, compared to a mirrored volume’s extra disk block for each block of user data. RAID supports the configuration of volumes with any number of data disks at a hardware cost of only one disk for parity. The volume illustrated in Figure 4.5 has an overhead cost of 50 percent (three disk blocks required for every two disk
Failure-Tolerant Volumes: Mirroring and R AI D
69
blocks of user data storage2). The volume shown in Figure 4.8 has an overhead cost of 33 percent. Larger RAID volumes with much lower hardware overhead cost can also be configured. For example, an eleven-disk RAID volume would hold as much user data as 10 disks, at an overhead cost of 10 percent. Figure 4.9 shows some of the more popular RAID array sizes encountered in practice. If a single check data block can protect an arbitrarily large number of user data blocks, the configuration strategy for RAID arrays and volumes would seem to be simple: designate one check data disk and add user data disks incrementally as more storage capacity is required. But, RAID volumes with large numbers of disks have three major disadvantages: ■■
Parity check data protects all of a volume’s disks from the failure of any one of them. If a second disk in a RAID volume fails, data is lost. A three-disk RAID volume with one failed disk fails (results in data loss) if one of the two surviving disks fails. A six-disk RAID volume with one failed disk loses data if any one of five disks fails. The more disks in a volume, the more likely it is that two disk failures will overlap in time, resulting in volume failure. Moreover, when a volume fails, all the data stored in it becomes inaccessible, not just data on the failed disks. Thus, the larger the volume, the more data lost if it fails. Smaller volumes both reduce the probability of volume failure and mitigate data loss if volume failure does occur.
■■
Large RAID volumes have poor write performance. The steps required to write data to a RAID volume are described in the following section. For purposes of this discussion, it is sufficient to note that each application write to a RAID volume requires a partially serialized sequence of reads, computations, and writes involving at least two of the volume’s disks. In addition, the volume manager should maintain some type of persistent3
2
The overhead is not only the disks required but the packaging, I/O buses and host bus adapters or controllers to support them.
3
In data storage and I/O literature, the term persistent is used to describe objects such as logs that retain their state when power is turned off. In practical terms RAID array logs are visually kept on disks or held in nonvolatile memory.
“4 plus 1” array overhead cost: 25%
User Data
Check Data
“5 plus 1” array overhead cost: 20% “10 plus 1” array overhead cost: 10%
Figure 4.9 Frequently encountered RAID array sizes.
70
CHAPTER FOUR
recovery log in case of system failure during an update sequence. Writing data to a RAID volume is thus a highly serialized operation, and the more data in the volume, the greater the impact of serialization. ■■
When a failed disk is replaced, the new disk’s contents must be synchronized with the rest of the volume so that all user data is regenerated and all check data is consistent with the user data it protects. Synchronization requires reading all blocks on all disks and computing user data and check data for the replacement disk. A volume with more disks takes longer to synchronize after a failure, increasing the interval, during which the volume is susceptible to failure due to loss of a second disk.
Economics and experience have led most hardware RAID subsystem designers to optimize their subsystems for arrays containing four to six disks. Some designs allow administrators to specify the number of disks in each array; the designers of these products also tend to recommend arrays of four to six disks. This rule of thumb is also valid for host-based RAID volumes.
Data Striping with RAID RAID data protection is almost always combined with data striping to balance I/O load, as described in Chapter 3. Figure 4.10 illustrates a RAID volume in which user data is striped across three of the four disks. Here, SubDisk D contains no user data. All of its blocks are used to store the exclusive OR parity of corresponding blocks on the volume’s other three subdisks. Thus, Block 000 of SubDisk D contains the bit-by-bit exclusive OR of the user data in Volume Blocks 000, 004, and 008. The volume depicted in this figure offers the data protection of RAID and the performance benefits of striping . . . almost.
Disk B'
Disk A'
Disk C'
Disk D'
SubDisk B
SubDisk C
Volume Block 004
Volume Block 008
000
004
008
Volume Block 005
Volume Block 009
001
005
009
Volume Block 002
Volume Block 006
Volume Block 010
002
006
010
Volume Block 003
Volume Block 007
Volume Block 011
003
007
011
Volume Block 016
Volume Block 020
012
016
020
Volume Block 017
Volume Block 021
013
017
021
Volume Block 018
Volume Block 022
014
018
022
Volume Block 015
Volume Block 019
Volume Block 023
015
019
023
etc.
etc.
etc.
SubDisk A Volume Block 000 Volume Block 001
Volume Block 012 Volume Block 013 Volume Block 014
Row 0 or Stripe 0
Row 1 or Stripe 1
Figure 4.10 Data striping with RAID.
SubDisk D
etc.
Failure-Tolerant Volumes: Mirroring and R AI D
71
Writing Data to a RAID Volume RAID works because each check data block contains the exclusive OR of the contents of all corresponding user data blocks, enabling the regeneration of user data illustrated in Figures 4.7 and 4.8. But each time an application writes data to a volume block, the corresponding parity block must be updated. Assume, for example, that an application writes Volume Block 012 of the volume shown in Figure 4.10. The corresponding parity block on SubDisk D must be changed from: (old) Volume Block 012 contents 䊝 Volume Block 016 contents 䊝 Volume Block 20 contents
to (new) Volume Block 012 contents 䊝 Volume Block 016 contents 䊝 Volume Block 20 contents
In other words, when an application writes Volume Block 012, the volume manager must: ■■
Read the contents of Volume Block 016 into an internal buffer.
■■
Read the contents of Volume Block 020 into an internal buffer.
■■
Compute the exclusive OR of Volume Blocks 016 and 020.
■■
Compute the exclusive OR of the preceding result with the (new) contents for Volume Block 012 supplied by the application.
■■
Make a log entry, either in nonvolatile memory or on a disk, indicating that data is being updated.
■■
Write the new contents of Volume Block 012 to Disk A'.
■■
Write the new parity check data to Disk D'.
■■
Delete the log entry or make another log entry indicating that the update is complete.
Figure 4.11 illustrates the disk reads and writes that are implied by a single application write request (log entries are not shown).
An Important Optimization for Small Writes to Large Volumes A useful property of the exclusive OR function is that adding the same number to an exclusive OR sum twice is equivalent to subtracting it, or not adding it at all. This can easily be seen by observing that the exclusive OR of a binary number with itself is 0 and that the exclusive OR of any number with 0 is the number itself. These facts can be used to simplify RAID parity computations as follows: (old) Volume Block 012 䊝 (old) parity
72
CHAPTER FOUR
Application writes: (new) Volume Block 012
Volume manager writes
Disk B'
Disk A'
Disk D'
Disk C'
SubDisk B
SubDisk C
SubDisk D
Volume Block 004
Volume Block 008
000
004
008
Volume Block 005
Volume Block 009
001
005
009
Volume Block 002
Volume Block 006
Volume Block 010
002
006
010
Volume Block 003
Volume Block 007
Volume Block 011
003
007
011
Volume Block 016
Volume Block 020
012
016
020
Volume Block 017
Volume Block 021
013
017
021
Volume Block 014
Volume Block 018
Volume Block 022
014
018
022
Volume Block 015
Volume Block 019
Volume Block 023
015
019
023
etc.
etc.
etc.
SubDisk A Volume Block 000 Volume Block 001
(old) Volume Block 012
Volume Block 013
Row 0 (Stripe 0)
Row 1 (Stripe 1)
Volume manager writes
Volume manager reads Volume Block 012
Volume Block 016
etc.
Volume Block 020
Volume manager computes
Figure 4.11 A single-block write algorithm for a RAID volume.
is the same as (old) Volume Block 012 䊝 [(old) Volume Block 012 䊝 Volume Block 016 䊝 Volume Block 20]
or [(old) Volume Block 012 䊝 (old) Volume Block 012] 䊝 Volume Block 016 䊝 Volume Block 20
which is equal to Volume Block 016 䊝 Volume Block 20. In other words, the exclusive OR of the user data to be replaced with its corresponding parity is equal to the exclusive OR of the remaining user data blocks in the row. This is true no matter how many subdisks comprise a volume. This property suggests an alternate algorithm for updating parity: 1. Read the data block to be replaced. 2. Read the corresponding parity block. 3. Compute the exclusive OR of the two.
73
Failure-Tolerant Volumes: Mirroring and R AI D
These steps eliminate the “old” data’s contribution to the parity, leaving a “partial sum” consisting of the contributions of all other blocks in the row. Computing the exclusive OR of this partial sum with the “new” data supplied by the application gives the correct parity for the newly written data. Using this algorithm, it is never necessary to access more than two disks (the disk to which application data is to be written and the disk containing the row’s parity) to execute a single-block application write to the volume. The sequence of steps for updating Volume Block 012, illustrated in Figure 4.12, would be: 1. Read the contents of Volume Block 012 into an internal buffer. 2. Read the row’s parity block from Disk D' (SubDisk D) into an internal buffer. 3. Compute the exclusive OR of the two blocks read. 4. Compute the exclusive OR of the preceding result with the (new) contents for Volume Block 012 supplied by the application. 5. Make a log entry, either in nonvolatile memory or on a disk, indicating that data is being updated. 6. Write the new contents of Volume Block 012 to Disk A'. 7. Write the new parity to Disk D'. 8. Make a log entry indicating that the update is complete.
Application writes: (new) Volume Block 012
Volume manager writes
Disk B'
Disk A'
Disk C'
Disk D'
SubDisk B
SubDisk C
Volume Block 004
Volume Block 008
000
004
008
Volume Block 005
Volume Block 009
001
005
009
Volume Block 002
Volume Block 006
Volume Block 010
002
006
010
Volume Block 003
Volume Block 007
Volume Block 011
003
007
011
Volume Block 016
Volume Block 020
012
016
020
Volume Block 017
Volume Block 021
013
017
021
Volume Block 014
Volume Block 018
Volume Block 022
014
018
022
Volume Block 015
Volume Block 019
Volume Block 023
015
019
023
etc.
etc.
etc.
SubDisk A Volume Block 000 Volume Block 001
(old) Volume Block 012
Volume Block 013
Row 0 (Stripe 0)
Row 1 (Stripe 1)
Volume manager reads (new) Volume Block 012
(old) Volume Block 012
012
016
Volume manager computes
Figure 4.12 Optimized write algorithm for a RAID volume.
020
SubDisk D
etc. Volume manager writes
74
CHAPTER FOUR
The number of reads, writes, and computations using this algorithm is identical to that in the preceding example. For volumes with five or more disks, however, this algorithm is preferable, because it never requires accessing more than two of the volume’s disks for a single-block update. The preceding algorithm requires that all user data blocks in the row that are not modified by the application’s write be read. This optimized algorithm is universally implemented by RAID volume managers.
NOTEThese examples deal with the case of an application write of a single block, which always maps to one disk in a RAID array. The same algorithms are valid for application writes of any number of consecutive volume blocks that map to a single subdisk. More complicated variations arise with application writes that map to blocks on two or more subdisks.
An Important Optimization for Large Writes The preceding discussion deals with application write requests that modify data on a single subdisk of a RAID volume. If an application makes a large write request, all of the user data in a stripe may be overwritten. In the volume in Figure 4.10, for example, an application request to write Volume Blocks 12–23 would overwrite all the user data in Stripe 1. When an application writes a full stripe to a RAID volume, new parity can be computed entirely from data supplied by the application. There is no need for the volume manager to perform overhead disk reads. Once new parity has been computed from the application data stream, both user data and parity can be written concurrently. Concurrent writing improves data transfer performance compared to a single disk, because the long data transfer is executed in concurrent parts, as with the striped volume in the example of Figure 3.14. Even with these optimizations, the overhead computation and I/O implied by RAID write algorithms take time and consume resources. Many users deemed early RAID subsystems unusable because of this so-called write penalty. Today, the RAID write penalty has been essentially hidden from applications through the use of nonvolatile cache memory, at least in controller-based RAID implementations. But nonvolatile memory is not usually available to host-based RAID volume managers, so it is more difficult for these to mask the write penalty. For this reason, host-based mirroring and striped mirroring are generally preferable to host-based RAID. In either case, for applications whose I/O loads consist predominantly of updates, striped-mirrored or mirrored-striped volumes are generally preferable to RAID volumes from both performance and disk failure-tolerance standpoints.
Failure-Tolerant Volumes: Mirroring and R AI D
75
The Parity Disk Bottleneck Even with a perfectly balanced I/O load of overlapping application write requests to SubDisks A, B, and C in the volume depicted in Figure 4.10 (the goal of data striping), there is an I/O bottleneck inherent in this RAID implementation. For each application write request, the volume manager executes the steps listed earlier in the section “An Important Optimization for Small Writes to Large Volumes.” While striping tends to distribute user data I/O evenly across SubDisks A, B, and C, every write request to the volume requires that the volume manager read and write some block(s) on SubDisk D. The parity disk is the volume’s write performance limitation. The maximum rate at which volume writes can be executed is about half the parity disk’s I/O request execution speed. This bottleneck was discerned by early RAID researchers, who devised a simple but effective means of balancing the overhead I/O load across all of a volume’s disks. Interleaving parity with striped data distributes parity as well as data across all disks. Figure 4.13 illustrates a RAID volume with striped user data and interleaved parity. The concept of parity interleaving is simple. The parity blocks for Row 0 in this figure are located on the “rightmost” disk of the volume. The parity blocks for Row 1 are stored on the disk to its left, and so forth. In this example, the parity blocks for Row 4 would be stored on SubDisk D, those for Row 5 on SubDisk C, and so forth.
Disk A' SubDisk A
Disk B' Row 0 (Stripe 0)
Disk C'
SubDisk B
SubDisk C
Disk D' SubDisk D
Volume Block 004
Volume Block 008
000
004
008
Volume Block 001
Volume Block 005
Volume Block 009
001
005
009
Volume Block 002
Volume Block 006
Volume Block 010
002
006
010
Volume Block 003
Volume Block 007
Volume Block 011
003
007
011
Volume Block 020
012
016
020
Volume Block 012
Volume Block 021
013
017
021
Volume Block 013
Volume Block 018
Volume Block 022
014
018
022
Volume Block 014
Volume Block 019
Volume Block 023
015
019
023
Volume Block 015
Volume Block 000
Volume Block 016 Volume Block 017
Row 1 (Stripe 1)
Row 2 (Stripe 2)
024
028
032
Volume Block 024
Volume Block 028
025
029
033
Volume Block 025
Volume Block 029
Volume Block 034
026
030
034
Volume Block 026
Volume Block 030
Volume Block 035
027
031
035
Volume Block 027
Volume Block 031
Volume Block 036
Volume Block 040
Volume Block 044
Volume Block 037
Volume Block 041
Volume Block 045
Volume Block 032 Volume Block 033
Row 3 (Stripe 3)
036
040
044
037
041
045
038
042
046
Volume Block 038
Volume Block 042
Volume Block 046
039
043
047
Volume Block 039
Volume Block 043
Volume Block 047
etc.
etc.
etc.
Figure 4.13 RAID volume with data striping and interleaved parity.
etc.
76
CHAPTER FOUR
Distributing parity across an entire RAID volume balances the overhead I/O required by parity updates. All of the reads and writes listed on page 72 must still occur, but with interleaved parity, no single disk is a “hot spot.” Under the best of circumstances (user data and parity all mapping to separate disks), a volume with interleaved parity can execute application write requests at about one-fourth the combined speed of all the disks in the volume. For a RAID volume with four disks, performance is equivalent whether parity is distributed or not. For RAID volumes with five or more disks, distributed parity generally leads to better performance. With or without a cache to mask the write penalty, interleaved parity improves RAID volume performance, so the technique is in ubiquitous use today. The combination of interleaved parity and data striping is often called RAID Level 5, after the nomenclature of the Berkeley researchers. The number 5 does not refer to the number of subdisks in the volume as is sometimes thought. RAID Level 5 volumes with as few as 3 and as many as 20 disks have been implemented. Typically, however, RAID Level 5 volumes contain between four and ten disks, as illustrated in Figure 4.9.
A Summary of RAID Volume Performance Figure 4.14 summarizes the performance of a RAID volume relative to that of a single disk. This summary reflects only disk and software effects; it does not take write-back cache or other performance optimizations into account. As this figure suggests, a RAID volume typically executes both Throughput (I/O Requests/Second) more
RAID
RAID
Large reads
Small reads
RAID
disk
Large writes
RAID Small writes
fewer lower
Request Execution Time
Figure 4.14 Relative I/O performance of RAID volumes.
higher
Failure-Tolerant Volumes: Mirroring and R AI D
77
large sequential and small random read requests at a higher rate than a single disk. This is due to the I/O load balancing that results from striping user data across disks, as described in the example of Figure 3.14. But when writing, there is simply more overhead I/O to be done, as described in the section “An Important Optimization for Small Writes to Large Volumes.” A RAID volume therefore executes write requests much more slowly than a single disk. For writes that modify entire stripes of data, this penalty can be mitigated by precomputing parity, as described in the section entitled “An Important Optimization for Large Writes to Large Volumes.” For small writes, though, all the operations listed in “Writing Data to a RAID Volume” must be performed.
Failure-Tolerant Volumes and Data Availability Figure 3.15 in the previous chapter and Figure 4.14 here summarize the performance characteristics of data striping and RAID. Understanding what data protection mirroring and RAID provide (and do not provide) is equally important for application designers and system administrators who may be responsible for keeping data stored on hundreds or even thousands of disks available to applications. Failure-tolerant volumes basically protect against loss of data due to disk failure and loss of data accessibility due to disk inaccessibility (as, for example, when a host bus adapter fails). Use of either mirrored or RAID failure-tolerant volumes means that disk failure does not stop an application from functioning normally (although it may function at reduced performance). When a disk in a mirrored or RAID volume fails, the volume is said to be reduced or degraded. For most implementations, the failure of a second disk before the first is repaired and resynchronized results in data loss. Thus, mirroring and RAID are not data loss panaceas; they just improve the odds. System administrators should ask how good mirroring and RAID are in protecting against data loss due to disk failures, as well as how significant disk failures are in the overall scheme of data processing. Disk reliability is often expressed in mean time between failures (MTBF ), measured in device operating hours. MTBF does not refer to the average time between two successive failures of a single device, but rather the expected number of device operating hours between two failures in a large population of identical devices. As an example, typical disk MTBF values today are in the range of 500,000 hours. This doesn’t mean that a single device is only expected to fail after 57 years; it means that in a population of, say, 1,000 operating devices, a failure is
78
CHAPTER FOUR
to be expected every 500 hours (about every three weeks). In a population of 100 operating devices, a failure can be expected every 5,000 hours (about every 7 months).
NOTEThis analysis and the ones that follow assume that the disks in the population are operating under proper environmental conditions. If a small group of disks that are part of a larger population are in a high-temperature environment, for example, then failures may be concentrated in this group rather than being uniformly distributed.
Mirroring and Availability Suppose that 100 identical disks with an MTBF of 500,000 hours are arranged in 50 mirrored pairs. About every seven months, one member of one mirrored pair can be expected to fail. Data is still available from the surviving member. The question is: What event could make data unavailable, and how likely is that event? Qualitatively, the answer is simple: If the surviving member of the mirrored pair fails before the first failed disk is replaced and resynchronized, no copy of the data will be available. Figure 4.15 presents a rough analysis of this situation, which, while not mathematically rigorous, indicates why mirroring is valued so highly by professional system administrators. From a pool of 100 identical disks arranged as 50 mirrored pairs, in any 5000hour window, one failure is to be expected. Until the failed disk has been replaced and resynchronized, failure of its mirror could result in data loss. The failed disk’s mirror is one specific disk, representing a population of one for the next stage of the analysis. Assume further that replacement and resynchronization of the failed disk takes about two days (∼50 hours). The question then becomes: In a population Population: 100 Disks (50 mirrored pairs)
Expected failures in 5,000 hours = 1 Disk Expected failures of 1 Disk in 50 hours Event that can cause data loss: failure of failed disk’s mirror
Population: 1 Disk (failed disk’s mirror)
Figure 4.15 Mirroring and failure rates.
1/100 (Disks) × 50/5,000 hours = 1/10,000
Failure-Tolerant Volumes: Mirroring and R AI D
79
of one disk, what is the expectation that a failure will occur in a 50-hour period? The answer is that with 1⁄100th of the population and 1⁄100th of the time period in which one failure is expected, the chance of the second failure is 1 in 10,000. Moreover, a two-day repair time is somewhat lax by today’s data center standards. Were the calculation to assume a more rigorous five-hour repair and resynchronization time, the chance of a second failure would be 1 in 100,000. Two conclusions can be drawn from this analysis: ■■
Mirroring does not eliminate the possibility of data loss due to disk failure, but it does improve the odds greatly.
■■
Mirroring is not a substitute for proper data center management procedures. The order of magnitude improvement in probability of data loss by reducing repair and resynchronization time from 50 hours to 5 hours demonstrates this.
RAID and Availability Suppose that the same 100 disks are arranged in 20 five-disk RAID volumes. In every seven-month window, one member of one volume is expected to fail sometime. Data is still available by regeneration from the surviving four disks, using a procedure similar to that illustrated in Figure 4.16. What events could make data unavailable in this case, and how likely are they to occur? Again, the answer is simple: If any surviving member of the degraded volume fails before the failed disk has been replaced and resynchronized, data will be lost. Figure 4.16 illustrates this scenario. Once a disk has failed and its volume has become degraded, failure of any of the volume’s four remaining disks results in data loss. The question is therefore: What is the expectation that in a population of four disks, a failure will occur in a 50-hour period? The answer is Population: 100 Disks (20 five-disk RAID arrays)
Expected failures in 5,000 hours = 1 Disk Expected failures of 1Disk in 50 hours Event that can cause data loss: failure of any other disk in the array
Population: 4 Disks (other array members)
Figure 4.16 RAID volumes and failure rates.
4/100 (Disks) × 50/5,000 hours = 1/2,500
80
CHAPTER FOUR
that with 1⁄25 of the full population and 1⁄100 of the time period in which one failure is expected, the chance of the second failure is 1 in 2,500. For many applications, this level of protection may be adequate. For others, particularly mission-critical ones, the importance of continuous data availability may eliminate cost as a factor. For these, two- or three-copy mirroring is generally preferred. There are two additional factors to be considered when evaluating mirroring and RAID as storage alternatives: ■■
Once a failed disk is replaced, resynchronization with the rest of the volume takes significantly longer than resynchronization of a mirrored disk. This is intuitively plausible. To synchronize a disk newly added to a mirrored volume, all of the data must be read from one surviving disk and written to the new disk. To synchronize a disk newly added to a degraded RAID volume, all data must be read from all of the volume’s disks, the exclusive OR computation performed on corresponding blocks, and the results written to the new disk. Compared to mirrored volumes, RAID volumes take longer to resynchronize after a disk failure and replacement.
■■
As described on page 61, it is possible to split a mirrored or striped mirrored volume into two plexes containing identical data. Each of these plexes can be mounted as a volume and used for separate applications, such as point-in-time backup while online applications are operational. Since a RAID volume does not contain a complete second copy of user data, this functionality is not possible with RAID volumes.
Thus, while RAID costs less than mirroring in terms of number of disks required, low cost comes with a higher probability of an incapacitating failure, a longer window of risk, and lesser functionality. The system administrator, application designer, or organizational data administrator must balance the cost of data protection against the cost of downtime and data loss and make a judgment accordingly.
What Failure-Tolerant Volumes Don’t Do While they are not perfect, RAID and mirroring do significantly reduce the probability of online data loss. Odds of 1 in 2,500 aren’t bad. But the foregoing discussion has focused entirely on disk failures. There are several other storage and I/O system components whose failure can result in data loss, even if mirroring or RAID is in use. These include: Disk buses, host bus adapters and interface ASICs. If any of these components fails, it becomes impossible to communicate with any of the disks on the bus. If the I/O system is configured so that only one disk from any
Failure-Tolerant Volumes: Mirroring and R AI D
81
volume is attached to a given bus, these failures are survivable. If not, bus failure can result in unavailable data (but probably not permanently lost data if there are no disk failures and if the volume manager handles the bus failure properly). Power and cooling subsystems. A failed power supply makes all the disks it serves inaccessible. A failed cooling fan eventually results in destruction of all the disks it cools. In most instances it is prohibitively expensive to equip each disk with its own power supply and cooling device (although this has been done). More commonly, power supplies and fans are configured in redundant pairs, with two units serving each set of eight to ten disks. The capacities of the power and cooling units are such that one can adequately power or cool all of the disks if the other should fail. External controllers. Failure of an external (outside the server cabinet) disk controller makes all the disks and arrays connected to it inaccessible. Most external controllers are therefore designed so that they can be configured in pairs connected to the same disks and host computers. When all components are functioning, the controllers typically share the I/O load, with some disks assigned to one and the remainder to the other. When a controller fails, its partner takes control of all disks and services all host I/O requests. Embedded controllers. Embedded controllers are not designed to fail over to partner controllers as external controllers are. From an I/O standpoint, failure of an embedded controller is equivalent to failure of the computer in which it is embedded. Some embedded controller vendors have devised solutions that cooperate with high-availability cluster operating systems to transfer control of disks and arrays from a failed embedded controller to a second unit embedded in a separate computer. In these configurations, applications must also fail over to the second host whose embedded controller controls the disks and arrays. Host computers. Except with regard to host-based volume managers, a host computer failure is not precisely a failure of the I/O subsystem. Increasingly, however, it is a business requirement that applications resume immediately after a host computer failure. This need has given rise to clusters of computers that are connected to the same disks and clients and that are capable of failing over to each other, with a designated survivor restarting a failed computer’s applications. This impacts different types of RAID subsystems differently. Host-based volume managers and embedded RAID controllers must be able to take control of a failed computer’s disks, verify volume consistency and make any necessary repairs, and present the volumes or arrays to the alternate computer. External RAID controllers must be able to present virtual disks to alternate host computers either automatically on failure detection or when directed to by a system administrator (host failure does not usually raise a data consistency issue with external RAID controllers).
82
CHAPTER FOUR
Humans and applications. Mirrored and RAID volumes store and retrieve data reliably, regardless of the data’s content. It is sometimes observed that RAID volumes and arrays write wrong data just as reliably as they write correct data. RAID does not protect against corruption of data due to human errors or application faults. A combination of high-integrity data managers (e.g., journaling file systems or databases) and a well-designed program of regular backups of critical data are the only protections against these sources of data loss. This short list of other possible disk subsystem failure modes points out why RAID by itself should not be regarded as a complete high-availability data management solution. As users have become more sophisticated, they have learned that protection against disk failure is necessary, but not sufficient for nonstop data center operation. The entire I/O subsystem, as well as host computers and applications, must be protected against equipment failures. RAID is only one building block in a highly available data processing solution.
I/O Subsystem Cache One of the most important I/O performance enhancements of the last decade has been the exploitation of solid-state memory as cache in I/O subsystems. IO subsystem cache is commonly found in disks, RAID controllers, and at multiple levels in the host computer, as Figure 4.17 illustrates. Cache is effective when reading because it shortens the time that an application must wait for data to be delivered to it. Cache is also effective when writing because it nearly eliminates the time that an application must wait for data to be delivered to the disk on which it will be stored. Effectiveness Database (pages) File system (metadata) Host OS (blocks) Controller (array blocks) Disk (disk blocks)
“Distance” from Application Figure 4.17 Read cache effectiveness.
Failure-Tolerant Volumes: Mirroring and R AI D
83
Read cache is solid-state memory in which data is stored in anticipation that an application will require it. The determination of which data to hold in read cache is essentially an informed guess. An important determinant of the effectiveness of read cache is the accuracy of this guess, called a caching policy. Accurate guessing (an appropriate policy) results in a high percentage of application requests being satisfied by delivering data from cache. Less accurate guessing (an inappropriate policy) results in cache being occupied by data that is never requested by applications—effectively wasted. There are two basic forms of read cache policy: Anticipatory or read-ahead. The cache manager either notices or is informed of an application data access pattern and reads data from disk according to that pattern even before the application requests it. Sequential access of large files or database tables usually benefits from a read-ahead policy. Retentive. The cache manager holds data written by applications in cache in anticipation that it will be reread within a short time. RAID parity is an example of data that typically benefits from this type of policy. Whatever the specific policy, it is intuitively obvious that accuracy will benefit from knowledge of the nature of the data. For example, if it is known that data is being read sequentially, then a read-ahead policy is beneficial, and retention is of little or no value because data read sequentially is not reread. Following the cache elements represented in Figure 4.17 from right to left, each has more information about the nature of the data it processes than the one to its right.
Disk Cache As explained in Chapter 1, disks present a uniform address space of fixed-size blocks in which data can be stored. The logic within the disk does not ascribe any meaning to the data stored in its blocks. A disk cannot distinguish a file system’s access to metadata from an application access to a customer record. It therefore has very little contextual information on which to base cache policy. A disk can, however, discern a sequential data access pattern. Disk read cache policies are therefore uniformly read-ahead policies. In the absence of other I/O requests, a modern disk will typically read some amount of data beyond what is requested by an application and hold it in cache, anticipating that it may be required in the near future. This policy optimizes I/O performance if application reads are actually sequential. Since read-ahead is terminated immediately if another I/O request is made to the disk, it has no ill effect on performance.
84
CHAPTER FOUR
Disk write cache can eliminate application waits for seeking and rotational latency. As soon as a disk has the application’s data in its cache, it can signal to the application that the write is complete. The application can continue, while the disk seeks and rotates to the correct location and writes data in parallel with it. While disk write cache can improve the performance of applications that update data frequently, it is often disabled by RAID controllers, volume managers, file systems, and database management systems. The reason for this is that disk cache is volatile. If electrical power is lost, so are cache contents. Disks are generally treated as persistent data stores, whose contents survive power outages. Write cache, which can allow the data from an apparently successful write to disappear, changes disk semantics in a way that is not easy for controlling software, such as volume managers, to overcome. These data managers therefore frequently disable disk write cache.
RAID Controller Cache Since RAID controllers present disklike flat block address spaces to their host computers, they too have little information about the nature of the data they are reading and writing. But they do have some advantages over disks that allow them to use cache more effectively: ■■
They control multiple disks, so they are able to adjust cache utilization to favor heavily accessed disks.
■■
They coordinate the activities of multiple disks; therefore, they can implement cache policies that, for example, favor retaining RAID parity blocks over user data blocks.
■■
They are generally less cost-constrained than disks and therefore tend to have larger cache.
■■
Recognizing the problem with lost data, most RAID controller vendors have made their cache nonvolatile, typically through battery backup. Thus, if a system crash or power outage results in loss of external power to a RAID controller, cache contents are retained until power is restored and written at that time.
Another RAID problem introduced by controller cache is failure of the controller itself. Many RAID controllers, particularly external ones, can be configured in redundant pairs, with either capable of taking over the entire disk I/O load in the event that the other fails. If a RAID controller fails with unwritten data in its write cache, that data must be written by the other controller before it can make the failed controller’s virtual disks available to applications. RAID controller vendors have developed several techniques for dealing
Failure-Tolerant Volumes: Mirroring and R AI D
85
with this. All such techniques ultimately reduce to writing all cached data to at least two independent cache memories controlled by different controllers. Thus, RAID controller vendors have dealt with the problem of write cache volatility essentially by making their caches nonvolatile and failure-tolerant. This, combined with size and allocation flexibility, makes RAID controller cache much more attractive to system designers. RAID controller caches are generally used where they are available.
Operating System Cache Cache is also implemented by most operating systems, both as an I/O performance enhancement (e.g., UNIX buffer cache) and for storing data required by the operating system itself. With very few exceptions, dynamic random access memory (DRAM) is used to implement operating system cache (and other types of host-based cache). DRAM is volatile; that is, if power is lost or switched off, DRAM contents are lost. This is not a problem for read cache, since any cached data can be read from backing store (disk) after a restart. On the other hand, deriving the fundamental benefit of write cache (reduction in the I/O service time apparent to applications) inherently creates a risk: If (a) an application makes a write request, (b) data is moved to the operating system cache, (c) the application is informed that its write is complete, and (d) the system crashes with loss of cache contents before the cached data can be written to disk, there is a strong possibility of data corruption or incorrect application behavior after a system restart. An automatic teller machine is a simple example of how a system failure with data in write cache can lead to incorrect behavior. Imagine that a bank account holder makes a request to withdraw cash from an ATM. After the request is validated, a debit record against the customer’s account is written and a message is sent to the ATM ordering it to disburse funds. If the debit record is cached, and a system failure occurs before it can be written, the bank will have no record that funds were disbursed to the customer, even though in fact the customer walked away with cash in hand. With the exception of mainframe operating systems such as IBM’s MVS and Compaq’s OpenVMS, loss of data in write cache has typically been treated as an acceptable risk. Operating systems provide services by which a data manager or application can flush all or part of a cache; but, traditionally, it has been the data manager’s responsibility to protect the integrity of its data. The two types of cache provided by operating systems reflect different levels of awareness of the data being cached. With a few exceptions, operating system cache manager services provided to data managers have no information about the nature of the data they are caching. However, special-
86
CHAPTER FOUR
purpose caches used by the operating system itself do reflect a high degree of data awareness. For example, network routing tables or common code libraries may be cached by an operating system. In both cases, only a specific type of data is cached; and in both cases the operating system is intimately aware of the data’s usage characteristics (e.g., code libraries typically have usage counts that indicate whether they can be unloaded to free memory space).
File System Metadata Cache Protection against the consequences of loss of cached data within a file or database is conventionally regarded as the responsibility of the data manager or application. File systems, in contrast, have a global view of the entire structure of a disk or volume and are, therefore, responsible for preserving the integrity of that structure. File systems must also offer good performance in multiapplication environments, so caching of file system structural data is a practical necessity. Again, read cache is not problematic, because unmodified metadata cached for reading can be reread from backing store if lost due to a system crash. Write-back caching of file system metadata is also crucial to overall system performance. File system metadata that has been modified in cache but not yet written to disk must be preserved if disks and volumes are to have continued structural integrity after a system crash. If data in a file are lost because a system crashes while it is still in write cache, the application that wrote the data suffers. If, on the other hand, cached file system metadata is lost, potentially all of the applications that use data in the file system could be adversely affected. This has been recognized since the early days of operating systems; therefore, operating systems provide utility programs that verify file system structure after system crashes. Perhaps the best known of these is the file system check program (fsck) available in some form in all UNIX systems. The equivalent program for the Windows NT and Windows 2000 operating systems is called CHKDSK. File system check programs verify that disk or volume space is used consistently (e.g., neither allocated to two or more files nor unaccounted for entirely). To do this, they must scan the file system’s complete metadata structure. Moreover, the scan must be complete before the file system can be used by applications. As disk and volume capacities rise, file systems routinely accommodate millions of files. Checking such file systems after a crash can mean that applications cannot restart for hours after a system crash. This is obviously unacceptable and has led to the concept of journaled file systems, of which NTFS is an example.
Failure-Tolerant Volumes: Mirroring and R AI D
87
A journaled file system has its roots in the fact that file system metadata updates are generally done in batches. When a file is created, for example: ■■
Space must be allocated for the data structures that describe the file.
■■
Space must be allocated for the user data the file will contain.
■■
The data structures that describe the file (e.g., its name, ownership, protection attributes, and so forth) must be written.
■■
The space allocated must be removed from the file system’s free space pools.
All of these actions require reading and writing file system metadata structures that are not contiguous and that are likely to be in use by other file operations. Journaled file systems create a single record that describes all the metadata updates required to accomplish an entire file system transaction (e.g., creation, extension, truncation, or deletion of a file or directory) and write that record to a log or journal. The I/O operations that update the actual metadata can then be done “lazily,” when the file system has opportunity; they need not block the continued execution of the application that requested the file system update. If a system crashes with cached file system metadata updates unwritten, the file system recovery process need only read the journal and make sure that all the metadata updates indicated in it have been applied to actual file system metadata on disk. Since the number of metadata updates at risk is typically very small compared to the totality of the file system’s metadata, replaying a journal is much faster than scanning all of the file system’s metadata with a file system check program. The advantages of a file system journal over a complete file system check are obvious: ■■
Instead of checking all file system metadata, only the metadata that is known to be at risk need be checked.
■■
Since the journal is time-ordered, the recovery process is able to bring the file system to the most up-to-date internally consistent state possible. With file system checking, this is not always possible.
File system metadata written to a disk journal is effectively also held in a write cache until it is actually written to on-disk data structures within the file system. This write cache is highly application-specific (only metadata is held in it). The file system has intimate knowledge of the cached metadata and can therefore make highly optimal decisions about what should or must be cached and the order in which metadata should be written. Journal entries are the file system’s way of protecting against loss of file system integrity if a system crash occurs with unwritten metadata in cache.
88
CHAPTER FOUR
Database Management System and Other Application Cache Database management systems and other applications that manage data also rely heavily on DRAM cache. Like file systems, they have very detailed knowledge of the nature of the data they manage and are therefore able to make optimal decisions about how to use cache. Also like file systems, they must protect against the consequences of system crashes with unwritten data in their write caches. They do this with logs or journals that record related updates (“transactions”) that can be applied after a system crash.
PA R T
TWO
Volume Management for Windows Servers
Part I of this book described general architectural principles of disks, RAID subsystems, and volume management. This next chapter explains how these principles are manifested in Windows NT technology operating system1 environments; and the following chapters detail the specific capabilities of Windows volume managers and the Disk Management snap-in for the Microsoft Management Console (MMC), which replaces the Windows NT Disk Administrator for managing these volumes.
1
This book uses the phrase “Windows NT operating systems” as a collective term to denote both the Windows NT Version 4.0 operating system and the Windows 2000 operating system, which is described by Microsoft as “with Windows NT technology.”
CHAPTER
5
Disks and Volumes in Windows 2000
The Windows Operating Systems View of Disks Windows NT technology operating systems use a common disk media structure that supports backward compatibility with personal computers running MS-DOS and earlier Windows operating systems, and coexistence with other operating systems such as UNIX and OS2. The storage capacity of disks used with Windows NT operating systems are organized into partitions that are functionally similar to the subdisks described in Chapter 2.
Starting the Computer Regardless of how a disk is organized, the computer needs a place to begin its search for the structural information that describes the location and layout of operating systems, programs, and data on the disk. For the Intel Architecture (IA) computers on which Windows NT operating systems run, system startup is done by Basic Input Output System (BIOS) software built into the computer’s mainboard. A computer’s BIOS runs when the computer is powered on; it conducts power-on self-tests (POSTs), offers the user the option of changing certain hardware parameters, and starts the process of loading an operating system (OS) from a diskette or hard disk. Built-in BIOS can be extended. For example, most SCSI host bus adapters include their own BIOS code for controlling SCSI disks and accessing data on them. When a computer starts up, its built-in BIOS probes a memory address 91
92
CHAPTER FIVE
region reserved for host bus adapters. Each adapter found is permitted to overlay the “jump to” address for executing functions relevant to it. For example, IA BIOS hard-disk I/O functions are invoked using the INT 13 instruction, which transfers program control to an address stored at a designated location in low memory. The computer’s built-in BIOS supplies this address when it starts up. A SCSI host bus adapter with its own BIOS for disk functions would overlay the transfer address in this location with the starting address of its BIOS code. Thus, an INT 13 disk read request made by the built-in BIOS would transfer control to the extended BIOS on the SCSI host bus adapter, which would execute the request. Figure 5.1 illustrates the operation of the BIOS in starting up an Intel Architecture computer.
Locating and Loading the Operating System Loader In Intel Architecture computers, the root of disk structural information is called the master boot record (MBR). The master boot record is the first 512byte block1 on a disk (the block with the numerically lowest address) and must be in a predefined fixed format because it is interpreted by the BIOS. C-H-S disk addressing (described in Chapter 1) predated logical block addressing and is therefore more generally applicable on older computers; consequently, IA BIOS implementations use C-H-S addressing by default to locate and load an operating system into the computer. The master boot record is located in the first sector (sector 1) on cylinder 0, head 0. 1
Intel Architecture computer BIOS only supports 512-byte disk blocks or sectors. Therefore in Windows operating system contexts, the terms block and sector are used interchangeably.
Host Memory Addresses Memory Address Space for Programs
6. SCSI disk delivers data to host.
5. SCSI disk executes read request.
Disk
Master Boot Record
4. Read request is rerouted to extended BIOS. 1. Built-in BIOS looks for adapter BIOS extensions.
Address Space Reserved for BIOS Extensions Extended BIOS Code for SCSI Disk
Build-in BIOS
2. BIOS extension overlays INT13 “jump to” address.
Interrupt “Jump to” Addresses
3. Built-in BIOS uses INTI3 to request a disk read.
Figure 5.1 BIOS extensions: an example.
Disks and Volumes in Windows 2000
93
Figure 5.2 illustrates the contents of an IA BIOS-compatible master boot record. The master boot record contains both a small program and a partition table containing disk partition descriptors. The last startup action of an IA computer’s built-in BIOS is to read the master boot record and begin execution of the small program in it. The master boot record must be contained within a single disk block, so its fixed size of 512 bytes. The more partitions described in a master boot record, the less space is available for the startup program. By convention, a fixed-size partition table capable of describing four partitions is always used. Four is therefore the maximum number of partitions that a disk formatted for use with IA BIOS implementations can support. Each entry in the partition table contains four elements: the partition’s start and end addresses expressed in C-H-S format, the partition’s starting disk block number in Logical Block Address format (described in Chapter 1), and a size expressed in disk blocks. In addition to the location of the partition on the disk and its size, the partition table entry contains two other important pieces of information: Boot flag. Indicates whether the partition is the one from which an operating system should be loaded at system startup time. Partition type. Denotes the type of file system for which the partition is structured and whether the partition is used by FTDISK. Windows operating systems support FAT16, FAT32, and NTFS file systems. The program that resides in the master boot record is typically supplied by the OS vendor, and is written when the operating system is installed, or by the
Master Boot Record
Cylinder 0, head 0, sector 1
Small program to check partition table validity and load boot sector (maximum of 446 bytes) One 16-byte entry for each of four partitions
Disk Partition Table
Partition 4 entry
Partition n Entry Boot Flag
Partition 3 entry
Partition Start Address
Partition 2 entry
Partition Type (File System Code)
Partition 1 entry
Partition End Address
Master Boot Record Signature (constant X′ aa55′)
Figure 5.2 Master boot record format.
Partition Starting Block Partition Size
94
CHAPTER FIVE
vendor of a partition management software package. The role of this program is threefold: ■■
To validate and interpret the partition table
■■
To determine which partition to boot from (by reading the boot flags in the artition table entries), to read the boot program from the boot partition
■■
To transfer control to it
Extended Partitions and Logical Disks It is sometimes administratively convenient to divide a large disk into more than four partitions. But a larger partition table would reduce the master boot record space available for system startup code. In IA architecture systems, this limitation is mitigated through the use of extended partitions that contain logical disks. Instead of being formatted by a file system, an IA BIOS-compatible disk partition may be designated as an extended partition that contains one or more logical disks. Each logical disk may be formatted by a file system. Figure 5.3 shows the relationship of an extended partition containing three logical disks to its partition table entry in the master boot record.
Extended Partition Extended Boot Record
Master Boot Record Small program to check partition table valididy and load boot sector (maximum of 446 bytes) Partition n Entry Boot Flag (off)
Logical Disk Boot Sector Data
Extended Boot Record
Part it ion Tabl e
Partition Stat Address Extended Par t ition 4 Type: Extended Partition ion 3 En try Partition Partition End Address ion 2 En try Partition Partition Starting Block Partition ion 1 En try Partition Size X’aa 5 5 ’
Logical Disk Boot Sector Data
Extended Boot Record 0 Logical Disk Boot Sector Data
Figure 5.3 An extended partition and logical disks.
Disks and Volumes in Windows 2000
95
The logical disks in an extended partition are linked as a list with an extended boot record in each logical disk containing a pointer to the starting block address of the next logical disk in the extended partition. The boot flag of an extended partition is always off, because the extended partition itself is not formatted by a file system and does not contain an operating system. It is possible however, to boot a computer from one of the logical disks in an extended partition. Therefore, in addition to interpreting the partition table, the small program in the master boot record must also be able to navigate through an extended partition, when one exists, and boot from a bootable logical disk within it, if one exists.
Loading the Operating System The computer system BIOS loads the master boot record into memory and transfers control to the executable part of it. The small program in the master boot record validates the partition table. Validating the partition table includes verifying that: ■■
Exactly one partition or logical disk has its boot flag set.
■■
The boot partition has a signature (a known bit pattern in the partition table entry), indicating that it has been properly formatted at some past time.
■■
All partition starting and ending addresses and lengths in the partition table and extended boot records are valid.
Next, the master boot record program reads the contents of the boot sector (the first sector) from the active partition or logical disk into memory and transfers control to the executable code in it. From the point at which the boot sector of a partition is loaded into memory and the code in it begins to execute, booting and execution are OS-specific. With Windows NT operating systems, it is possible for more than one operating system to reside in the same partition or logical disk. The program in the first sector of the boot partition reads a hidden file called boot.ini, which contains the human-readable names of all bootable operating systems in the partition and the files in which their initial loaders are located. Much of the structure described in the foregoing paragraphs is specific to the booting process of the IA BIOS. If a disk were never to be used to load an operating system, there would be no need for this structure. For maximum interchange flexibility, however, disks used with IA computers are all partitioned and formatted in this way, whether they are used to load operating systems or not. Operating systems, disk drivers, and file systems often contain code that is based on the assumption that disks are formatted as described in the foregoing paragraphs.
96
CHAPTER FIVE
Dynamic Disks: Eliminating the Shortcomings of the Partition Structure Conceived in an era of small individual disk capacities (typically 5–40 megabytes), the IA BIOS-compatible partitioning scheme just described is somewhat limiting for today’s 10–50 gigabyte disks, especially in server environments. The key limitations are: ■■
The number of partitions supported is too small.
■■
The amount of metadata about a partition that can be stored is very limited.
■■
All the disk blocks comprising a partition must be logically contiguous (consecutively numbered).
■■
The size of a partition cannot be changed except by deleting and recreating the partition (thus destroying all data in it).
To overcome these limitations, Windows 2000 incorporates a new diskstructuring concept, the dynamic disk. Dynamic disk technology is available both in the Windows 2000 Operating System Disk Management component delivered by Microsoft and in Volume Managers and Array Managers for Windows sold by VERITAS Software Corporation. Dynamic disks can be created from new disks with no data on them. Alternatively, disks that already contain a conventional partition structure (called basic disks in Windows 2000 contexts) can be upgraded to dynamic disks with no effect on data stored on them.
NOTEWhen the term Volume Manager is written with initial capitals it denotes the VERITAS Software Corporation product, when the term volume manager appears lowercase it denotes any of the volume management capabilities available for the Windows 2000 operation system described later.
A dynamic disk can be subdivided into subdisks that fulfill a role similar to that of conventional partitions. Simple volumes can be built directly on them, or they can be aggregated into spanned, striped, mirrored, or RAID volumes. The media format of dynamic disks includes a master boot record that can be correctly interpreted by IA BIOS. OS loaders can therefore be booted from dynamic disks. Whether a particular OS can be booted from a dynamic disk depends on what the operating system requires from the disk and partition layout. For example, under certain circumstances, Windows operating systems cannot be booted from dynamic volumes formed from upgraded partitions that have been extended since the upgrade.
Disks and Volumes in Windows 2000
97
Figure 5.4 illustrates the media format of a dynamic disk that was formatted from scratch. Partition table entries on a dynamic disk indicate a partition type that is unrecognizable by all but Windows NT operating systems running the Windows 2000 Disk Management component or an add-on Volume Manager. In addition, on a dynamic disk, approximately 1 megabyte of the highest addressed block space is reserved as a private area for volume management metadata. This allows for much more flexible volume organizations than does the conventional partition table structure, as the following section explains. On a basic disk, partitions and logical disks are the entities in which file systems are formatted. On a dynamic disk, the equivalent structure is the subdisk. When a subdisk is formatted with a file system, it becomes a simple volume. Most of the storage objects in the conventional (basic) IA disk organization have equivalent dynamic disk objects. These equivalencies are summarized in Table 5.1. Volumes on dynamic disks have full read and write functionality of basic disk partitions. As Table 5.1 indicates, some volumes on dynamic disks may be made bootable, that is, set up so that Windows operating systems can be loaded from them. In order for a dynamic volume to be bootable, it must have been a bootable partition on a disk that was upgraded from basic to dynamic, or a mirror of such a partition. Because of the differences in metadata format between basic and dynamic disks, administrators must be cautious when managing bootable volumes. In particular, alternate bootable volumes should not be extended. (An alternate bootable volume is not the volume from which the running operating system was actually booted). Extending a volume modifies the metadata in the vol-
Master Boot Record
Boot Record
Small program to check partition table valididy and load boot sector (maximum of 446 bytes)
Storage space available for subdisks
Full capacity of disk
Partition 1 Entry Boot Flag Part it ion Tabl e
Partition Start Address Extended Par t ition 4 Type: Dynamic ion 3 En try Partition Partition End Address ion 2 En try Partition Partition Starting Block ion 1 En try Partition Partition Size X’aa 5 5 ’
Figure 5.4 Dynamic disk structure (new disk).
Private Region for Volume Manager Subdisk, plex, and volume descriptors
about 1MB
98
CHAPTER FIVE
Table 5.1 Basic and Dynamic Disk Storage Objects BASIC DISK ORGANIZATION
DYNAMIC DISK ORGANIZATION
COMMENTS
Partition
Simple volume
Windows operating systems can be booted from simple dynamic volumes as long as the volumes were originally basic partitions that were present when a basic disk was upgraded to be a dynamic disk.
Primary, system, active, and boot partitions
System, boot, and active volumes
Both simple and mirrored volumes can serve as system, boot, and active volumes.
Extended partition
No equivalent
Logical disk
Simple volume
In dynamic disks, the private region metadata space is sufficient to describe unlimited numbers of subdisks and volumes, therefore the extended partition hierarchy is unnecessary.
Concatenated volume
Spanned volume
The terms concatenated volume and spanned volume are used interchangeably. Windows operating systems cannot be booted from spanned volumes.
Striped volume
Striped volume
Windows operating systems cannot be booted from striped volumes.
Mirrored volume
Mirrored volume
Both simple and mirrored volumes can serve as system, boot, and active volumes.
Striped volume with parity
RAID 5 volume
Windows operating systems cannot be booted from RAID volumes.
冦
ume manager’s private region, but does not alter the basic disk metadata used by the BIOS and Windows OS loader. Extending an alternate bootable volume would result in the OS in the alternate bootable partition (volume) having incorrect information about the size and structure of its boot partition.
Dynamic Volume Functionality Volumes on dynamic disks support additional functions beyond those of basic volumes of equivalent type. For example, it is possible to expand a dynamic volume by adding rows to it without adversely affecting the data on it. If a file system supports capacity expansion (as NTFS does), adding rows to its volume enables it to expand without having to reorganize the data in it. One reason that dynamic disks can support greater functionality than basic ones is that they can store more internal metadata. Approximately 1 megabyte of storage at the high end of the address range of a dynamic disk is reserved as a pri-
Disks and Volumes in Windows 2000
99
vate region for storing metadata. This obviously enables much greater flexibility than the basic disk partition structure. In particular, dynamic disks have sufficient metadata capacity so that volumes constructed from them can be completely self-describing. In other words, a volume manager can completely discover a volume configuration and retrieve all the information required to manage it from the disks comprising the volume. There is no need to access other databases such as the Windows Registry for this information. Selfdescribing disks can be moved between Windows systems with volume structure and user data preserved, and again, with no requirement for data reorganization. Dynamic disks containing dynamic volumes and basic partitions containing basic volumes can coexist on a Windows NT system. Both can be managed by Microsoft Management Console snap-ins. By default, the Windows 2000 OS creates basic disks and partitions. These can be upgraded to dynamic volumes (with certain functionality restrictions) without affecting the data on them.
Volumes in Windows NT Operating Systems The volume concept has been a part of Windows NT server operating systems for some time. In Windows NT Version 4, the FTDISK (Failure-Tolerant Disk) operating system component supports the creation of spanned, striped, mirrored, and RAID2 volumes using partitions as building blocks. FTDISK plays the role of a volume manager, as described in Part I. The system administrator’s view of FTDISK volumes is through the Disk Administrator, Windows administrative tool. Figure 5.5 shows Windows NT Disk Administrator’s view of a system with four disks: ■■
A 6.675-gigabyte system disk containing two partitions (C: and D:).
■■
A 325-megabyte disk with a single partition (S:).
■■
Two 39-gigabyte disks, each containing a single partition, organized as a mirrored volume (X:). Since Volume X: is realized on two disks (Disk 2 and Disk 3 in Figure 5.5), there are two rows representing it in the Disk Administrator window.
All of the disks represented in the Figure 5.5 are formatted to use the NTFS file system. This view of disks would typically be of interest to a system administrator responsible for with organizing and managing physical storage capacity. In contrast, Figure 5.6 presents the Disk Administrator’s volume view of the same system. This is the view that would normally be of interest to users, as it 2
In Windows NT documentation, RAID is referred to as striping with parity.
100
CHAPTER FIVE
Figure 5.5 Windows NT Disk Administrator disk view.
emphasizes how the disks are addressed (drive letter), the total and available capacities, file system types, and whether or not the volume is failure-tolerant. Although in this example, failure-tolerant mirrored volume X: is made up of two partitions that occupy the entire disk capacity, volumes can also be constructed from partitions that occupy only part of the capacity of two or more disks. FTDISK is a very useful program, as it enables Windows NT systems to offer failure-tolerant disk storage at very low hardware cost.3 However, Windows NT systems are increasingly being used in mission-critical applications, and for these environments, FTDISK has four key shortcomings: ■■
Only two-copy mirrored volumes are supported. This makes it impossible to use split mirrors to perform a backup while applications continue to operate on protected online data.
■■
Online addition and removal of disks is not supported. This prevents replacement of a failed disk while a system is running, even if the packaging and host bus adapter support hot swapping.
3
The system whose disk configuration is shown in Figures 5.5 and 5.6 uses only low-cost IDE disks.
Disks and Volumes in Windows 2000
101
Figure 5.6 A failure-tolerant volume, shown by Disk Administrator.4 ■■
Reconstruction of a failed disk’s contents onto a replacement disk can be started only at system startup time. This effectively requires a system reboot in order to replace a failed disk, even though resynchronization can occur while the volume is in use.
■■
The Microsoft Cluster Server two-server availability cluster option of the Windows NT Enterprise Edition operating system is not supported.
Clearly, for Windows 2000 to become a mission-critical application platform, greater comprehensive volume management capability is required. To that end, Microsoft Corporation has engaged with VERITAS Software Corporation to create both a family of separately marketed volume management products and a set of capabilities embedded in the Windows 2000 operating system.
Recovering Volumes from System Crashes Other volume managers have long relied on the logging of updates to guarantee consistency of the data on failure-tolerant mirrored and RAID volumes. For every update to a failure-tolerant volume, a log entry is written before the actual update. The log entry contains sufficient information for the volume manager to make volume contents consistent during restart after a dirty shutdown. From the point of view of clusters, a failure of one of the computers in a cluster requires the same volume recovery procedures as would the failure of a single system.
Update Logging for Mirrored Volume To preserve data on mirrored volumes against inconsistency due to system crashes, the volume manager uses a protection technique called dirty region 4
Volumes F:, G:, H:, and I: listed in Figure 5.6 are four CD-Rom drives.
102
CHAPTER FIVE
logging. Each time a mirrored volume is written, a log entry written prior to the data update indicates which volume blocks (dirtied regions) will be updated. As updates to mirrored volumes complete (i.e., when identical data has been written to all corresponding blocks of all plexes), cached dirty region log entries are cleared to indicate that the regions are no longer dirty. To minimize the performance impact of logging, dirty region indicators are cleared opportunistically when other regions are dirtied by application writes. Thus, dirtying a region by writing to a volume incurs I/O overhead to write a log entry, whereas clearing a dirty region does not, because the region’s log entry is not cleared until an application write requires that a new dirty region log entry be written.
Update Logging for RAID Volumes Because updates to RAID volumes are more complex than mirrored-volume updates, the volume manager uses a different mechanism to preserve RAID volume data consistency against system failures. Before each update to a RAID volume, the volume manager writes a log entry indicating which blocks are to be updated, along with the new data to be written in them. Unlike the dirty region logs used with mirrored volumes, which are updated in place, RAID volumes use circular logs, fixed disk-block regions with pointers to indicate the oldest and newest entries. RAID volume update logs are selfclearing, because obsolete entries are eventually overwritten by new ones.5
Crash Recovery of Failure-Tolerant Volumes When a mounted failure-tolerant volume is first written to by an application, the volume manager sets an “in-use” indicator in its on-disk private region. The indicator remains set as long as the volume is mounted; it is cleared only when the volume is unmounted. If a system crashes, the volume manager has no chance to unmount its volumes. When the volume manager starts up, it uses in-use indicators to determine which volumes were not shut down cleanly. The logs for these volumes must be processed to guarantee failuretolerant volume data consistency. To recover a mirrored volume after a system crash, the volume manager uses the precrash dirty region log to determine which regions must be made consistent across the volume’s plexes. The volume manager reads each dirty region’s data from one of the volume’s plexes and writes it to all the other plexes. Precrash dirty region logs are merged with active dirty region logs so that mir5
The Volume Manager stalls application writes in the rare event a RAID volume’s log fills.
Disks and Volumes in Windows 2000
103
rored volumes can be mounted and used by applications during recovery. When recovery is complete, contents of all plexes of a mirrored volume will be identical, but there is no guarantee that the data most recently written by applications prior to a crash will have been preserved. To recover a RAID volume after a system crash, the Volume Manager points to the log’s oldest log entry and “replays” the log in sequence, reapplying the volume updates indicated by log entries up to the newest one. During recovery, logged updates must be written to the volume in the same order as they were made by applications prior to the crash. Applications therefore cannot mount and use RAID volumes until recovery is complete.
Where Volume Managers Fit: The Windows OS I/O Stack The I/O subsystem design of the Windows operating systems is a natural environment for host-based volume management. Windows uses a layered, or stacked, driver concept, in which drivers can call on other drivers for services, until a hardware layer is reached. Figure 5.7 presents a simplified model of the
Application User Mode File operations
Kernel Mode
Windows I/O Manager File operations
Windows Cache Manager
Other I/O
File System (NTFS or other) Block operations on volumes or disk
Volume Manager (vxio)
Block operations on disks
Disk Class Driver Block operations on local disks
Miniport Driver Software Cache Memory
Hardware Host Bus Adapter To physical disks or RAID controller virtual disks
Figure 5.7 Simplified Windows NT V4.0 I/O stack.
104
CHAPTER FIVE
Windows operating system I/O subsystem, illustrating how a volume manager fits naturally into the I/O hierarchy. In the Windows operating environment, all application I/O requests are initially handled by the I/O manager. The I/O manager is a kernel function whose responsibility is to validate each application request (e.g., verify that a known function is being requested and that the application actually has the right to access any memory addresses specified for data transfer). For each valid request, the I/O manager creates an internal representation called an I/O request packet (IRP) and routes it to the appropriate driver for execution. Nearly all I/O requests that result in volume or disk access are made through the volume’s file system. In the Windows architecture, the file system occupies the position of a driver in the I/O stack. The I/O manager passes valid file system I/O requests to the file system (for example, NTFS in Figure 5.7). The file system performs further request validation in the context in which it runs. For example, the file system verifies that the application’s credentials give it the right to make the requested control or data access and that the file address range specified for any data transfer is actually part of the file’s data area. The file system performs these validations using data structures that may be in cache or read from disk. Having ascertained that a read or write request is valid, the file system converts it from file offsets (relative byte numbers within a file) to volume block or disk block addresses. Again the file system uses mapping data structures that may be in cache or may have to be read from a volume or disk. Windows file systems, especially NTFS, use cache extensively for user data. Once a request is known to be valid, the file system calls the Windows Cache Manager, which searches the system I/O cache to determine whether the requested data is already available. Cached data is delivered to applications immediately. If the requested data is not in the system I/O cache, the file system makes a request to a device driver. This is the point at which volume managers fit into the I/O stack. If the file system’s block request is made to a volume, the volume manager fields it. If the file system’s request is made to a disk, the request is sent directly to the operating system’s disk class driver. The volume manager translates the file system’s request for volume blocks into requests for disk blocks and makes one or more requests to the system’s disk class driver. When a file system request for a string of volume blocks translates into disk blocks on two or more disks, the volume manager must make two or more disk requests. It is the volume manager’s responsibility to coordinate the execution of multiple disk requests and to signal completion of the file system’s request when all the underlying disk I/O requests are complete.
Disks and Volumes in Windows 2000
105
The Windows I/O subsystem architecture breaks disk I/O into two layers called the class and port layers. The disk class layer, implemented by a disk class driver, provides a common block address space model for all types of disks, including ATA, SCSI, and Fibre Channel. The disk class driver creates a generic disk-specific data structure for each I/O request and passes it to a miniport. The miniport formats the request as required by the bus and communicates it to a host bus adapter or to an interface built into a support chip set, which sends it to the target device on the ATA, SCSI, or Fibre Channel bus.
Windows Disk and Volume Naming Schemes Windows system administrators must concern themselves with three storagedevice naming schemes: ■■
Windows drive letters
■■
Universal Naming Convention (UNC) names
■■
Volume manager internal names
Drive letters, despite their name, are actually the means by which applications and user utilities refer to volumes or partitions. They are, by far, the most common way of referring to online storage in the Windows environment. By convention, drive letters A: and B: are reserved for referring to floppy disks. Drive letter C: is the “disk” from which IA computer BIOS usually reads the master boot record. Most often, the partition or volume referred to by drive letter C: is also the location from which the operating system is booted. The Universal Naming Convention (UNC) is part of a Windows architecture for identifying all operating system components and facilities in a consistent, hierarchical manner. A UNC name is similar in form to a file system path name. UNC names are created when a Windows system starts to operate. In the case of storage hardware, devices are given UNC names as they are discovered during the boot process. The grammar of a UNC device name identifies the device’s position in the system hardware hierarchy. For example, \Devices\Harddisk2 would refer to the second hard disk on the first interface discovered during the boot process. If a disk or disk interface is added to or removed from a system, the UNC names of storage devices may change to reflect their new positions discovered during the boot process. UNC names are primarily used internally by Windows operating systems, although they can be used by applications. Disks, partitions, and volumes all have UNC names. The volume manager uses UNC names internally, but gener-
106
CHAPTER FIVE
ally hides them from users and administrators. Figure 10.3 (page 246) gives an example of an instance in which the volume manager exposes UNC names to the user. The third type of storage device name is one given to a disk or RAID subsystem logical units (LUNs) by the volume manager during an upgrade to dynamic disk. This name is written in the volume’s metadata and remains invariant throughout the dynamic disk’s life. Volume manager device names are used in internal data structures such as the tables describe plex membership. Dynamic disks are self-describing through metadata contained in their private regions. There is sufficient information in this metadata to identify subdisks and bind those subdisks to the plexes and volumes of which they are a part.
Volume Manager Implementations Volume management technology for Windows 2000 operating systems is available in two forms. Windows 2000 volume managers provide flexible management of disks LUNs connected to host bus adapters, as well as of certain embedded RAID controllers. Volume managers are available as: ■■
A built-in set of components in Windows 2000 server operating systems that provide basic volume management functions. These components are collectively called Disk Management, or more often, the Logical Disk Manager, both in Microsoft documentation and in this book.
■■
Separately licensed software packages that extend the capabilities of the Windows 2000 built-in Logical Disk Manager. In this book, these are called Volume Managers. Add-on Volume Managers are available for the Windows 2000 Professional,6 Advanced Server and Data Center Edition operating systems. One such package is the Volume Manager for Windows 2000 available from VERITAS Software Corporation, which is used extensively in examples in this book.
■■
Software components called Array Managers shipped with Windows 2000 servers. These combine Logical Disk Manager functionality with management for various types of RAID controllers. Array Managers are available from server system vendors such as Dell Computer. (Dell’s OpenManage Array Manager is one example.).
Table 5.2 summarizes the distinctions among these terms. 6 The VERITAS Volume Manager for Windows 2000 Professional does not support mirrored or RAID volumes
Disks and Volumes in Windows 2000
107
Table 5.2 Terminology Summary TERM
USAGE
Disk Management (component)
A component of the Windows 2000 Operating System providing basic volume management.
Logical Disk Manager
A synonym for the Disk Management component.
Volume Manager for Windows 2000
An add-on for Windows 2000 that provides advanced volume management functions.
Volume Manager for Windows NT
An add-on for Windows NT Version 4 that provides advanced volume management functions.
Volume Manager
(Initial capitalized) A shorthand synonym for Volume Manager for Windows 2000.
Array Manager
Facilities for managing certain embedded RAID controllers whose user interface is integrated with that of the Logical Disk Manager. Array Managers are delivered by original equipment manufacturers (OEMs) that deliver the RAID controllers they manage.
volume manager
(Lowercase) A collective term for the Volume Manager, the Logical Disk Manager and Array Managers.
Common Features of All Volume Managers All Windows volume managers provide basic online storage management functions, which include creating basic disk partitions, assigning drive letters, and creating and formatting Windows 2000 file systems. All Windows 2000 volume managers can: ■■
Create and manage simple (single-disk) volumes, spanned, striped, and RAID volumes consisting of up to 32 subdisks and two-mirror volumes.
■■
Be controlled through a Microsoft Management Console (MMC) graphical interface, which provides multiple views of disk and volume configurations, context-dependent menus, and wizards to lead the administrator through the specification of common online storage management tasks.
■■
Perform online storage management operations, such as detection and removal of a failed disk from a failure-tolerant volume, and resynchronization of a replacement disk while a volume is mounted for use by applications and others.
In addition to these basic capabilities, Windows NT and 2000 add-on Volume Managers offer advanced capabilities enumerated in the next section. All Windows 2000 volume managers use the same on-media metadata formats, mean-
108
CHAPTER FIVE
ing that any volume manager can mount and use volumes created by any other. Advanced features are available only with Volume Managers that support them, however.
Volume Manager for Windows NT Version 4 The Volume Manager for Windows NT Version 4, available from VERITAS Software Corporation, supplants the function of the Windows NT operating system’s built-in Disk Administrator and FTDISK (Fault-Tolerant Disk) components. The Volume Manager for Windows NT offers many of the capabilities of the Volume Manager for Windows 2000. On-media metadata formats are identical between the two, so upgrading from the Windows NT Version 4 operating system to Windows 2000 will not impact online storage in systems that use the Volume Manager for Windows NT Version 4.
Volume Managers for Windows 2000 In the Windows 2000 operating system, a Disk Management component, or Logical Disk Manager, replaces the familiar Windows NT Version 4 Disk Administrator and FTDISK components. Windows 2000 Disk Management supports: ■■
Any volumes created by the Windows NT Version 4 Disk Administrator and FTDISK components.
■■
Spanned and striped volumes consisting of up to 32 subdisks.
■■
Failure-tolerant two-mirror volumes.
■■
RAID volumes consisting of up to 32 subdisks.
■■
Online capacity expansion for spanned and striped volumes.
■■
Dynamic replacement of failed disks in failure-tolerant volumes and restoration of failure tolerance (resynchronization or rebuilding) while the volume is in use.
■■
A Microsoft Management Console snap-in graphical console (user interface) for configuring and monitoring volume state and activity.
■■
Monitoring and control of certain volumes located on remote systems.
Windows 2000 Volume Manager Capabilities The add-on Volume Manager for Windows 2000 (the one used in many of the examples in this book) during installation replaces the built-in Logical Disk
Disks and Volumes in Windows 2000
109
Manager. The Volume Manager for Windows 2000 provides all of the capabilities of the Logical Disk Manager, plus: ■■
Support for dynamic mirrored volumes with up to 32 identical copies of data. Three-mirror volumes enable the split-mirror functionality described in Chapter 4, which in turn supports point-in-time database and file system backup while applications continue to use a failure-tolerant copy of the database or file system. Mirror splitting is invoked through the Volume Manager console (see page 163 in Chapter 8). Four-mirror (and more) volumes may be used in a similar fashion to prepare data for publication or replication. With optical Fibre Channel, four-mirror volumes can provide failure tolerance across a distance of 10 kilometers or more, with local failure tolerance at each end of the connection.
■■
The ability to create mirrored volumes of striped plexes, as described on page 59. Mirrored striped volumes provide easily manageable, highperforming, failure-tolerant storage for databases and file systems that are larger than the largest single disk available.
■■
The ability to specify stripe unit size at the time of volume creation for striped, mirrored-striped, and RAID volumes. Stripe unit size can be used to influence the I/O performance of volumes for which the predominant I/O load characteristics are known at the time of volume creation, as described on page 48.
■■
The option to designate a preferred plex to force application reads to be directed to one plex of a mirrored volume for execution, as described on page 55. Preferred plexes can be used to direct I/O to a higher-performing disk that is mirrored with one or more lower-performing ones. For example, a subdisk located on a solid-state disk that is mirrored with one or more subdisks on rotating magnetic disks would make an ideal preferred plex.
■■
Dynamic-striped and RAID volumes with up to 256 subdisks, or columns. Volumes with a large number of columns can be useful in distributing load for applications that require very high data transfer or I/O request rates.
■■
Online expansion of NTFS volume capacity by the addition of rows (effectively, subdisks of the same size added to each subdisk of the original volume). (FAT file systems cannot be resized after they are created.)
■■
Online movement of subdisks from one disk to another. This can be useful when early symptoms of failure are discovered in a disk containing a subdisk of a failure-tolerant volume; for example, by analyzing SMART messages from the disk. This capability is sometimes called predictive failure response. The subdisk can be moved to an alternate disk while the volume continues to be accessed by applications. A subdisk can also be moved to balance I/O load; for example, when subdisks belonging to two chronically overloaded volumes reside on the same disk.
110
CHAPTER FIVE
■■
Ability to display I/O activity levels of the volume and the disks comprising it through the Volume Manager console. This is useful for detecting hot spots of intense I/O activity before they become application-hobbling I/O bottlenecks.
■■
Support for Microsoft Cluster Server clusters of two or four servers (Microsoft Cluster Server clusters and their impact on volumes are discussed starting on page 253.7
■■
Support for multiple disk groups. Disk groups are the unit in which disk ownership passes (that is “fails over”) from server to server in both Microsoft Cluster Server (MSCS) and VERITAS Cluster Server (VCS) clusters.
■■
Ability to upgrade basic partitions (including boot and system partitions) to dynamic volumes by encapsulating the basic partitions within the dynamic volume structure. Encapsulation preserves all data on the partitions.
■■
Ability to display information about, and delete, failure-tolerant basic volumes created by Windows NT Version 4 Disk Administrator.
■■
Online capacity expansion for all types of volumes by the addition of rows. The Windows 2000 NTFS file system recognizes and uses increased volume capacity immediately; volume unmount and remount are not required.
■■
Ability to repair damaged failure-tolerant basic volumes created by Windows NT Version 4 Disk Administrator. A failure-tolerant basic volume may be damaged if a system crashes while the volume is in use.
Array Managers The term Array Manager is used to refer to a software tool that enables system administrators to manage supported hardware RAID subsystems using the Logical Disk Manager console. All Logical Disk Manager functionality is also available through the console, so that both RAID subsystems and Logical Disk Manager-based volumes can be used concurrently on the same system. Management of supported RAID controllers and management of host-based volumes are nearly identical. A RAID subsystem consists of a RAID controller and some disk drives. The RAID controller forms a bridge between intelligent disks and computers. 7
Volumes managed by the VERITAS Volume Manager for Windows 2000 may not be used as quorum devices in four-node Microsoft Cluster Server clusters at the time of publication.
Disks and Volumes in Windows 2000
111
RAID controllers run internal software that performs a function similar to that of host-based volume managers. Typically, RAID controllers enable the creation of striped, mirrored, striped-mirrored, and RAID volumes just as volume managers. To Windows operating system drivers and file systems, these volumes are represented as virtual disks that are functionally identical to physical disks (i.e., they respond to the same I/O commands with the same type of behavior). Like physical disk partitions and host-based volumes, array controller virtual disks can be formatted with file systems. In most cases, a host-based volume manager can combine virtual disks presented by array controllers into higher-capacity, higher-performing, or more failure-tolerant volumes. This is described in the subsequent paragraphs. Array Managers perform two important functions for hardware RAID subsystems: ■■
They provide a common system administrator interface through which RAID controller-based disk arrays can be created, deleted, and otherwise managed. This interface is identical in appearance to the interface through which host-based volumes consisting of physical and virtual disks are managed in Windows OS environments.
■■
Using logical disk management capabilities, Array Managers can aggregate the virtual disks presented by RAID controllers to provide greater capacity and performance or to enhance data availability beyond that of the RAID controllers themselves.
Volumes Made from Disk Arrays Arrays of disks managed by hardware RAID controllers are often used to complement host-based volume manager capabilities. RAID controllers provide: ■■
Incremental processing power; for example, for converting application I/O requests made to virtual disks into operations on physical disks.
■■
Specialized hardware, such as battery-backed mirrored write cache and exclusive OR engines for computing RAID parity.
■■
Disk connectivity fanout. Fanout enables the connection of significantly more disk capacity per port or bus address to a server.
Because RAID controllers present virtual disks to the host environment through the driver interface, volume managers can aggregate storage capacity managed and presented by embedded or external RAID controllers. Figure 5.8 illustrates external RAID controllers used in conjunction with host-based volume management.
112
CHAPTER FIVE
Separate enclosures provide power and cooling redundancy. Host Computer
PCI HBA
App App App
File System
Volume Manager
Volume Manager aggregates virtual disks into a single volume, e.g., by mirroring or striping
Complete redundancy
Disk Driver HBA
Multiple resources increase system performance potential.
Figure 5.8 Combining external hardware arrays and software volumes.
A volume manager can aggregate RAID controller virtual disks in any of the ways it can aggregate physical disks. Two types of virtual disk aggregation are particularly useful: ■■
Virtual disks of identical capacity can be organized as mirrored volumes. With optical Fibre Channel, one external RAID controller can be placed as much as 10 kilometers from the main data center, providing protection of data against a range of site disasters. In this scenario, each virtual disk could be a RAID or mirrored volume to minimize the system impact of single disk failures.
■■
A volume manager can stripe data across two or more virtual disks presented by different controllers, aggregating the data transfer or I/O request processing capacity of two or more RAID controllers for greater overall system I/O performance. As in the preceding example, the virtual disks can be made failure-tolerant through the use of controller-based RAID or mirroring.
Host-based volume managers can aggregate the virtual disks presented by both embedded and external RAID controllers. Figure 5.8 illustrates the use of a host-based volume manager to aggregate the capabilities of two external RAID controllers. Although they are generally higher in cost than embedded RAID controllers, external controllers provide additional performance and availability advantages, as noted in the figure. In the example shown in Figure 5.8, the RAID controllers, external host I/O buses, host bus adapters, and disk enclosure systems are all duplicated. When the volume manager mirrors virtual disks presented by the two RAID controllers, data availability is protected against failures in any of these I/O subsystem hardware components.
Disks and Volumes in Windows 2000
113
If greater storage capacity or increased performance were required, the volume manager could create striped volumes consisting of pairs of virtual disks presented by the two RAID controllers. This configuration would aggregate the performance potential of the two RAID controllers. Although striping data across two or more RAID controllers’ virtual disks is a performance enhancement, failure tolerance need not be completely sacrificed. For example, each of the RAID controllers in Figure 5.8 could present mirrored or RAID arrays to protect against disk failures. That said, with data striped across the two controllers, there would be no protection against I/O bus or host bus adapter failure. The Volume Manager can also aggregate embedded RAID controllers’ virtual disks into volumes. Figure 5.9 illustrates this configuration, where the volume manager aggregates the virtual disks presented by the two embedded RAID controllers. As in the example in Figure 5.8, the virtual disks can either be mirrored to improve availability or striped to improve performance. Similarly, the RAID controllers can provide disk failure tolerance through the use of controller-based mirroring or RAID. The principal differences between embedded aggregates and aggregates of external arrays, such as that illustrated in Figure 5.8, are: ■■
I/O latency is generally lower in embedded configurations because there is one less stage at which commands and data must be transformed and queued.
■■
The configuration shown in Figure 5.9 provides less failure protection because the controllers are housed, powered, and cooled within the server enclosure, instead of separately as in external configurations. A server failure makes data inaccessible. The capability of data to survive server failure becomes especially important in multihost cluster configurations.
Host Computer
PCI
App App App
File System
Volume Manager
RAID Driver
Volume Manager aggregates virtual disks into a single volume.
Figure 5.9 Combining embedded RAID arrays and software volumes.
114 X X
32 column (subdisk) dynamic volumes
Online expansion of simple and spanned volumes
X X X
Mirrored volumes of striped plexes
Preferred plex for reading
X
X
X
X
Adjustable stripe unit size
Launching LDM component clients
Encapsulation of basic disk partitions when upgrading
Disk view in management console
X
(V2.4)
MANAGER
Spanning, striping, two-mirror volumes, RAID
MANAGER
ARRAY
VOLUME
X
X
X
X
X
X
X
X
(V2.5)
MANAGER
VOLUME
WINDOWS NT VERSION 4
Table 5.3 Windows Volume Manager Family Features (Part I)
VOLUME
VOLUME
VOLUME
X
X
X
X
X
X
X
X
X
X
X
X
X
X X X X
X
X X X X
X X X
X
X
X
X X X
X
X X
X
X
X
X
(V2.6)
MANAGER (V2.5)
MANAGER
MANAGER
ARRAY
MANAGER
DISK
LOGICAL
(V2.6)
MANAGER
WINDOWS 2000
115
X
X
X
(V2.6)
*TBD = To be delivered. As this book is written, VERITAS Software Corporation has discussed these capabilities in a general sense, but has not made them available.
X
TBD
X
X
X
X
(V2.5)
VOLUME MANAGER
Dynamic multipath support for storage devices
MANAGER
VOLUME MANAGER
TBD*
X
MANAGER
ARRAY
X
X
X
X
(V2.6)
DISK
LOGICAL
VERITAS Cluster Server support
X
X
(V2.5)
VOLUME MANAGER
X
X
(V2.4)
MANAGER
VOLUME MANAGER
WINDOWS 2000
Four-server Microsoft Cluster Server support
Two-server Microsoft Cluster Server support
Multiple disk groups
Dynamic volumes with with up to 256 columns
MANAGER
ARRAY
VOLUME
WINDOWS NT VERSION 4
116 X
X
X
X
Splitting a mirror from mirrored volume or mirrored-striped volume
Online movement of subdisk contents to alternate disk
Mirrored volumes with up to 32 copies
(V2.4)
Online expansion of striped, mirrored, and RAID volumes
MANAGER
MANAGER
ARRAY MANAGER
X
X
X
X
(V2.5)
VOLUME
VOLUME
WINDOWS NT VERSION 4
Table 5.4 Windows Volume Manager Family Features (Part II)
X
X
X
X
(V2.6)
MANAGER
VOLUME MANAGER
DISK
LOGICAL ARRAY MANAGER
X
X
X
X
(V2.5)
MANAGER
VOLUME
WINDOWS 2000
X
X
X
X
(V2.6)
MANAGER
VOLUME
117
X X
Hot spare/relocation/ swap
X
(V2.6)
RAID volume logging
X
(V2.5)
VOLUME MANAGER
X
X
(V2.4)
MANAGER
VOLUME MANAGER
Dirty region logging
I/O statistics view in management console
MANAGER
ARRAY
VOLUME
WINDOWS NT VERSION 4
MANAGER
DISK
LOGICAL MANAGER
ARRAY
X
(V2.5)
MANAGER
VOLUME
WINDOWS 2000 VOLUME
X
X
X
X
(V2.6)
MANAGER
118
CHAPTER FIVE
■■
External controller-based configurations, such as that illustrated in Figure 5.8, are generally more expandable than embedded controller configurations. Each host bus adapter or embedded RAID controller occupies a single PCI bus slot. The amount of storage that can be connected via a single PCI bus slot is the number of I/O bus attachments possible (15 for parallel SCSI; 126 for Fibre Channel Arbitrated Loop; practically unlimited for switched Fibre Channel) multiplied by the number of disks supported by a controller. With embedded controllers, this number is the number of disks supported by a single controller. Moreover, external RAID controllers generally support greater disk connectivity than embedded ones.
Choosing between external and embedded RAID subsystems can be a complex process, involving factors such as system cost, expected I/O load characteristics, system and data failure-tolerance requirements, data growth potential, and other systems in the environment. Whichever choice is made, host-based volume management can be used in conjunction with RAID controllers to enhance storage capacity, I/O performance, and data availability.
Summary of Volume Manager Capabilities Built-in and separately marketed volume managers are available for both Windows NT Version 4 and Windows 2000. Administrators may encounter several versions of the Volume Manager developed by VERITAS Software Corporation, with advanced volume management functionality. Each of these Volume Managers is intended to replace its predecessor. Tables 5.3 and 5.4 summarize the capabilities of the Windows 2000 Logical Disk Manager (the Volume Manager versions that are available as this book is written) and the Array Manager for Windows 2000 (only the host-based volume capabilities of the Array Managers are listed.). Controller-based array management capabilities are specific to the type of controller being managed.
CHAPTER
6
Host-Based Volumes in Windows Servers
his chapter and those that follow illustrate the mechanics of managing online storage for Windows 2000 servers. The examples use several small servers each configured with enough locally attached disks to make the point of the example. Specifically, they use Windows 2000 built-in Logical Disk Manager, the Dell OpenManage Array Manager, and several versions of the VERITAS Volume Manager
T
NOTEIn addition to the disks used in the examples, the servers used have locally attached disks that hold operating systems, software, and permanent user data. These are not actively used in the examples, but do appear in some of the screen images. Shared disks on other computers (e.g., "hdd on mrcluster") also appear in some of the screen captures. These are an artifact of the system and network environments, and are not relevant to the examples. For example, the Logical Disk Manager console view at the bottom of Figure 6.1 shows the volume configuration of two of the disks in one of the systems used in these chapters to illustrate Windows 2000 volume and virtual disk management. (The top of the figure shows the system’s complete volume complement).
Figure 6.1 shows a Logical Disk Manager console view of a Windows 2000 server used in a development laboratory. The server has been configured to boot from the seventh disk discovered by Windows 2000 disk services (Disk 7). The capacity of Disk 7 is 8.47 gigabytes; and it has been formatted as a basic disk with two FAT32 partitions: C: and BACKUP. The unlabeled partition (C:) contains the Windows 2000 operating system. The laboratory 119
120
CHAPTER SIX
Figure 6.1 Disks connected to a system used for the examples in this chapter.
uses the partition labeled BACKUP to restore the system quickly to a known state after experiments. (The examples in this book do not use the BACKUP partition.) The remaining disks connected to this system 0–6 are represented in Figure 6.2; Disk 0 contains permanent data, and is not used in the examples of this chapter; Disks 1–6 are used. At the moment of the image capture, Disks 1–6 had been upgraded to dynamic format, but no volumes had been allocated on them. Disks 1–6 are returned to a state in which no partitions or volumes are allocated on them. Each of Disks 1–6 shown in Figure 6.2 is listed in the disk view as having 8.37 gigabytes of capacity, neither the disk’s capacities nor their types are identical. As the examples in this chapter and those that follow illustrate, Windows 2000 volume management does not require that the volumes it manages be built from identical disks, or even that the disks have identical physical storage capacities.
Starting the Logical Disk Manager Console Windows 2000 disks can be managed using the Logical Disk Manager’s Microsoft Management Console (MMC) snap-in. The MMC is found in the Programs\Administrative Tools folder under the name Computer
Host-Based Volumes in Windows Ser vers
121
Figure 6.2 Logical Disk Manager view of laboratory system.
Management, and can be started from the Windows 2000 desktop Start menu, as shown in Figure 6.3. By default, the operating system services that manage Windows 2000 disks and volumes are configured at installation time to launch automatically when the operating system starts up. Administrators can configure partitions and volumes, and control the actions of these services, using the Logical Disk
122
CHAPTER SIX
Figure 6.3 Starting the Windows 2000 Computer Management application.
Manager snap-in. Figure 6.3 illustrates the Start menu sequence used to invoke the Microsoft Management Console on a system called NODE-3-1 (different from the system whose disk configuration is shown in Figure 6.2). To begin managing disks using Windows 2000 built-in tools, a Windows 2000 administrator (i.e., a user with administrator privilege) invokes a Start menu command sequence similar to that shown in Figure 6.3. When the console window is displayed, the administrator selects the Disk Management snap-in under the Storage heading in the console’s Tree tab, as shown in Figure 6.4. In this figure, upper and lower disk management panels are displayed on the right side of the console window. These views can be customized for administrative convenience. This capability is shown in Figure 6.5, which also illustrates that both top and bottom panels can be set to display either: ■■
A list of physical disks and RAID controller LUNs connected to the Windows 2000 system (as Figure 6.5 illustrates).
■■
A list of the volumes configured on those disks (as the upper panel in Figure 6.4 illustrates).
■■
A graphical view that shows both disks and the volumes allocated on them (as the bottom panels of Figures 6.2 and 6.4 illustrate).
All Logical Disk Manager functions are available from the Disk Management console. Most of these functions are invoked by right-clicking the mouse while the cursor’s hot spot is on an icon that represents the object on which the function will be performed, and, selecting a command from the context-
Host-Based Volumes in Windows Ser vers
123
Figure 6.4 Disk Management MMC snap-in.
sensitive menu displayed. (Throughout this book, this procedure is called invoking the function or command.) For example, Figure 6.6 shows the Properties page for Disk 4, which was invoked by right-clicking the disk’s icon in the graphical view and selecting the Properties… command from the resulting menu. In the figure, Disk 4’s icon is highlighted, indicating that it is the target for management operations. The Properties page contains useful information about Disk 4 and its logical relationship to the system. As physical properties, the panel lists the port (host bus adapter) to which the disk is connected, as well as the disk type, vendor,
Figure 6.5 Customizing the Disk Management application view.
124
CHAPTER SIX
Figure 6.6 A Windows 2000 Logical Disk Manager Properties page.
SCSI target ID, and logical unit number. For logical properties, the inset panel lists partitions (for basic disks) or volumes (for dynamic disks) for which capacity is allocated on the disk. In this figure, Disk 4 is contributing storage capacity to New Volume A and New Volume B. And as shown in the background disk view, New Volume A consists of at least three mirrors on separate disks (Disks 4, 5, and 6),1 and New Volume B consists of at least two mirrors (on Disks 4 and 5). The Properties page for Disk 4 only lists volumes to which the disk is contributing capacity; it does not present the entire structure of those volumes. Probably, the Disk Management console will be familiar to Windows 2000 system administrators. The Tree panel on the left of the console window provides access to all of a system’s user-visible managed objects. The panels on the right represent objects related to the selected management object or task. The goal of the Microsoft Management Console is to simplify Windows server management by: ■■
1
Consolidating the access points for as many system management tasks as possible at a single location (the Computer Management console).
The Windows 2000 Logical Disk Manager supports mirrored volumes with two mirrors. The three-mirror volume illustrated in this example was created by a Volume Manager (discussed in later chapters), and is visible to the Logical Disk Manager Console.
Host-Based Volumes in Windows Ser vers
■■
125
Structuring all management tasks similarly, for example by using MMC windows to display object states and relationships, and by using Properties pages to display detailed information about objects.
In Figure 6.7, the upper panel summarizes basic information about system NODE-3-1’s disks. The lower panel (only partially visible in the figure) presents essentially the same information but adds a graphical view of each disk’s capacity allocation. These two views serve different purposes: The list view makes it easy to determine at a glance where disk capacity is available for allocating additional volumes; the graphical view provides an easy-toassimilate picture of both free and allocated storage capacity. Windows 2000 accesses disks through disk class drivers that control the actions of IDE, SCSI, or Fibre Channel host bus adapters. Windows 2000 cannot distinguish between physical disks and the virtual disks (LUNs) presented by embedded or external RAID controllers. Thus, combinations of RAID controller-based virtual disks and host-based volumes can be created simply by managing a RAID controller’s virtual disks as though they were physical disks, and taking into account their internal performance and failure-tolerance characteristics when creating volumes that use them. Some vendors integrate RAID controller management with the Logical Disk Manager console, making Array Manager MMC snap-ins, as described in Chapter 12, “Managing Hardware Disk Arrays.” Array Managers allow controller-based arrays, host-based volumes, and combinations of the two to be managed through the same administrative interface.
Figure 6.7 Windows 2000 Disk Management disk view.
126
CHAPTER SIX
Disk Management Simplified The Logical Disk Manager’s console provides useful information about a system’s disks and volumes. This information includes: ■■
Disk format (basic or dynamic) and file system (NTFS, FAT, or FAT32).
■■
Operational status (designated as Online for all disks shown in Figure 6.8).
■■
Operational status of volumes allocated on the disks (designated as Healthy for all volumes shown in Figure 6.8).
The significance of disk and volume status is made clear in the examples that follow. The Unallocated and Free Space designations indicate storage capacity that is not part of any volume or partition and is available for use.
Creating and Reconfiguring Partitions and Volumes Creating and reconfiguring partitions and volumes could be a painstaking and time-consuming undertaking, even in relatively simple systems such as those used in these examples. Therefore, a primary goal of all Windows 2000 online storage management is to simplify as much as possible the configuration and management of disks, partitions, and volumes. This is achieved in part by: ■■
Providing wizards that lead system administrators through necessary but seldom-performed (and therefore possibly unfamiliar) disk management tasks. Wizards ensure that only appropriate parameters with valid values are specified, that all information required to perform a management task is known, and that all required resources are available before the task’s execution is begun.
■■
Heading off potential procedural errors by enabling only valid choices at all stages of interaction with management interfaces. In the examples that follow, many menu choices are shown in gray, indicating that they are not available. The Logical Disk Manager (and Volume Manager) adjusts available menu choices whenever menus are expanded so that administrators are presented only with those actions that are valid in the current operational context.
■■
Automating administrative functions to the extent possible and requiring system administrators to specify only those parameters, and make only those decisions, that cannot be inferred automatically. Default recom-
Host-Based Volumes in Windows Ser vers
127
mendations are supplied wherever possible in cases where input is required. Expert system administrators can override default parameter values supplied by wizards with valid custom choices. A little study of the examples that follow should make it obvious that this principle has been applied throughout Windows 2000 online storage management user interface design.
Figure 6.8 Logical Disk Manager view of NODE-3-1’s disks.
128
CHAPTER SIX
Figure 6.9 Context-sensitive menu for basic disk with no partitions.
Figure 6.9 and those that follow demonstrate the automation of administrative functions by the Windows 2000 Logical Disk Manager. In Figure 6.9, Disk 2 has been selected (as indicated by reverse border highlighting), when the action menu is pulled down. Because physical disk Disk 2 is a basic disk with no partitions allocated, the only valid management actions are to upgrade it to a dynamic disk and to view its properties page. Therefore, these are the only action commands displayed when the All Tasks list is selected from the Action menu. Similarly, when unallocated storage space on a basic disk is selected (indicated by diagonal striping of the rectangle representing the space on Disk 2 in Figure 6.10), the only valid actions are to create a basic partition and view the disk’s properties. Therefore, the Logical Disk Manager displays only those commands on the Action menu when the storage space on basic Disk 2 is selected.
Invoking Logical Disk Manager Wizards The menu commands illustrated Figure 6.9 and Figure 6.10 both result in the invocation of wizards that lead an administrator through the indicated volume management tasks. The Logical Disk Manager includes wizards to guide the system administrator through these common tasks associated with managing online storage:
Host-Based Volumes in Windows Ser vers
129
Figure 6.10 Context-sensitive menu for space on a basic disk. ■■
Creating and deleting basic partitions or dynamic volumes of any type supported by the built-in Logical Disk Manager.
■■
Upgrading disks from basic to dynamic and, conversely, reverting disks from dynamic to basic format.
What a Logical Disk Manager Wizard Does The Disk Management Create Volume wizard leads an administrator through the process of volume creation, requiring only that the administrator supply: ■■
Desired usable capacity for the volume
■■
Type of volume (i.e., simple, spanned, striped, mirrored, or RAID)
■■
The “width,” or number of subdisks to comprise the volume (for spanned, striped, and RAID volumes)
Using this information, the wizard examines the available dynamic disks and displays a list of those eligible to contribute to the volume. The wizard ensures that: ■■
Only dynamic disks with sufficient unused capacity are made available for subdisk allocation.
■■
The subdisks comprising a volume are located on different disks as required by the volume type. For example, no two subdisks belonging to different plexes of a mirrored volume may be located on the same disk.
130
CHAPTER SIX
Similarly, for a RAID volume, no two subdisks that are part of different columns of an array may be located on the same disk. A system administrator may accept or override any of the wizard’s recommendations during input specification.
Upgrading Disks to Dynamic Format Windows 2000 volumes can be created only on dynamic disks. Basic disks can accommodate only partitions of the sort described in Chapter 5, “Disks and Volumes in Windows 2000.” Windows 2000 basic partitions cannot be resized or combined into volumes; they are supported primarily to enable legacy storage devices with data on them to be read and written by applications on Windows 2000 systems. Unless legacy data is involved, dynamic disks and volumes are usually preferred for their versatility. Among a system administrator’s first actions upon deciding to use Windows 2000 volume manager technology should therefore be to upgrade physical disks to dynamic disk format. Using the Upgrade to Dynamic Disk… command illustrated in Figure 6.9, either basic (previously used with Windows NT or other operating systems) or unformatted (newly installed or initialized) disks can be upgraded to dynamic disk format. The Logical Disk Manager upgrades a basic or unformatted disk to dynamic format by writing format information on the disk, as described in Chapter 5. When the Logical Disk Manager’s Upgrade to Dynamic Disk… command is invoked, it first displays a dialog inviting the administrator to specify the disk or disks to be upgraded (see Figure 6.11).
Figure 6.11 Upgrade to Dynamic Disk Disk Specification dialog.
Host-Based Volumes in Windows Ser vers
131
Figure 6.12 Disk 2 after upgrading to dynamic format.
The dialog in Figure 6.11 displays the list of disks that are eligible for upgrading. These are the disks listed as Basic in Figure 6.8. Only basic and unformatted disks are included in this listing. The administrator specifies a disk by checking the box to the left of its name. Any number of disks can be upgraded in a single invocation of the wizard. Figure 6.12 gives the Logical Disk Manager console’s General view of the system’s disks after Disk 2 has been upgraded to dynamic format. Disk 2’s Type has changed to Dynamic. The available capacity of Disk 2 has been reduced slightly because the Logical Disk Manager has allocated a small amount of the disk’s capacity for metadata that will ultimately describe the disk’s contents. The rounding of disk capacity values for display purposes, however, has hidden that diminution of capacity. Furthermore, because no volumes have been created as yet, Disk 2’s unallocated capacity is the entire disk capacity as in the console display shown in Figure 6.8. With disks in this state (upgraded to dynamic format), a system administrator can create dynamic volumes of any of the supported types.
CHAPTER
7
Basic Volumes
Creating a Simple Volume Creation of a Windows 2000 Logical Disk Manager Volume begins when an administrator invokes the Create Volume… command from the Logical Disk Manager’s Action menu, as shown in Figure 7.1. Because the Create Volume… command is a valid action for dynamic disks, it is displayed when the Action menu is pulled down while a dynamic disk is selected (Disk 5 in the figure). Invoking the Create Volume command launches the Create Volume wizard, whose introductory panel is shown in Figure 7.2. All Windows 2000 Logical Disk Manager wizards begin with a similar informational panel such as that illustrated in Figure 7.2. Volume Manager wizards begin with informational panels that describe their operation. For example, the Upgrade to Dynamic Disk wizard introductory panel describes: ■■
What the wizard does (this is common to all wizards).
■■
The nature and use of volumes in brief (this is unique to wizard type).
Clicking the Next button at the bottom of the introductory panel results in the display of the Select Volume Type panel shown in Figure 7.3. In this panel, which is common to all volume creation, the administrator specifies which of the supported volume types is to be created. The Logical Disk Manager can create either basic partitions or dynamic volumes. Basic partitions are created on basic disks using the Create Parti133
134
CHAPTER SEVEN
Figure 7.1 Disk Management Create Volume command.
tion… context-sensitive menu command; dynamic volumes are created on dynamic disks using the Create Volume… command. The Logical Disk Manager displays the appropriate context-sensitive menu for the selected disk. For example, the Create Volume… wizard panel shown in Figure 7.2 will create a dynamic volume because the command was executed on a dynamic disk (Figure 7.1). The Windows 2000 Logical Disk Manager supports only one dynamic disk group, called DynamicGroup; thus, it is unnecessary to specify the disk group in which the volume should be created. In contrast, Windows 2000 Volume Managers support multiple disk groups; therefore, an intermediate panel for dynamic disk group specification is required. In Windows 2000 cluster systems, Volume Manager disk groups can be used to control the units in which disks may fail over between servers. With the Windows NT Version 4 Volume Manager, Windows 2000 Logical Disk Manager, and Array Managers, only one dynamic disk group is supported. The panel is omitted because there is no choice to make. In Figure 7.3, a simple volume has been specified. The Logical Disk Manager can survey the available capacity on its dynamic disks (listed in the Unallocated Space column shown in Chapter 6, “Host-Based Volumes in Windows Servers,” Figure 6.12) and propose a location and capacity for the
Basic Volumes
135
Figure 7.2 Create Volume wizard introductory panel.
new volume. Figure 7.4 shows the Logical Disk Manager’s proposal to locate a 4,338-megabyte simple volume on Disk 2. As this figure suggests, the volume to be created will consist of a single plex with a single column, containing a single subdisk. The Logical Disk Manager proposes to create the subdisk and plex on Disk 2. The Logical Disk Manager has algorithms for selecting default locations for each type of volume it supports. In the case of a simple volume, the default is the first disk in the disk group with sufficient capacity available. As the next step in volume creation, the administrator must either ratify (by clicking the Next button) or modify the Logical Disk Manager’s choice of volume location. Generically, this step is required for all volume types, but the details differ for each type of volume. By clicking the Add and Remove buttons on this panel, the administrator can specify the disk on which to locate a simple volume. Each time a different disk is specified, the Logical Disk Manager displays the maximum possible volume capacity on that disk in the Size box. The administrator can override this by specifying a smaller capacity than that proposed by the Logical Disk Manager.
136
CHAPTER SEVEN
Figure 7.3 Specifying Create Volume wizard volume type.
Once the administrator has specified the disk on which to allocate the simple volume, the next step is to assign a drive letter or drive path to the volume for application access. The Assign Drive Letter or Path panel is shown in Figure 7.6.1 Drive letter or access path assignment can be deferred if there is any reason to do so. The next step is to specify the type of file system with which the volume is to be formatted. This is done in the Format Volume panel (see Figure 7.7). In the Windows 2000 operating system, the available files systems are FAT, FAT32, and NTFS. In the figure, the file system has been chosen (NTFS) and a volume label has been specified (1GB Simple). In this panel the administrator can also specify the allocation unit size for the file system from a list of candidates. A smaller allocation unit size optimizes the file system for many small files; a larger allocation unit size is more efficient for smaller numbers of larger files. 1
The Volume Manager displays only drive letters that are available for use in the drop-down list box for the Assign Drive Letter panel.
Basic Volumes
137
Figure 7.4 Disk specification for a simple volume.
If an NTFS file system is specified, the administrator can enable file and folder compression by checking a box in the Format Volume panel. Compressing files conserves space on storage media, but at the expense of processing time when files are opened. In addition, by not checking the Perform a Quick Format check box, the administrator can implicitly request a scan of all the blocks in the file system for readability. This (possibly time-consuming) scan can be bypassed if the state of the blocks on the volume’s disks is known to be good, for example, from previous usage of the disks. At this point in the volume creation process, the Logical Disk Manager has all the information required to create a simple volume. The Create Volume wizard displays a summary of the action it is about to perform, as shown in Figure 7.8. Not until the administrator clicks the Finish button on this panel does the Logical Disk Manager take any action to modify the system’s disks and volumes. And as long as the panel in Figure 7.8 is displayed, the administrator has the option of going back through the wizard and changing any of the parameters (by clicking Back), or of canceling the wizard entirely (by clicking Cancel).
138
CHAPTER SEVEN
Figure 7.5 Modifying the Create Volume wizard’s allocation proposal.
Creation of a volume results in space being allocated within the dynamic disk group. Figure 7.9 shows two Logical Disk Manager views of the system’s disks. On Disk 3, 1 gigabyte of capacity has been allocated to a volume that will be mounted as drive letter J:. At the point at which this figure was captured, the volume is still formatting, and will not be usable until the process is completed. In the volume view in the upper panel, therefore, no file system is indicated for drive letter J: because file system metadata is written at the end of the formatting process. Figure 7.10 shows the Logical Disk Manager Console’s volume objects. The newly created volume, labeled 1GB Simple, has a status of Healthy (all components functional) and a free capacity of 1.00 gigabytes.2 Applications address the volume as drive letter J:. Figures 7.9 and 7.10 show the state of the newly created simple volume before and after formatting is complete. If the Perform a Quick Format check 2
About five of the volume’s 1024 megabytes are consumed by file system metadata. The Logical Disk Manager rounds this to 1.00 GB for the 3-digit display.
Basic Volumes
139
Figure 7.6 Create Volume wizard drive letter assignment.
box had been checked, the Healthy state would have been exhibited in seconds. In this case, because a full format was specified (by not checking the box), the Logical Disk Manager verifies the readability of all of the volume’s blocks by reading them.3 This can be time-consuming for large volumes as volumes are not available for application use until formatting is complete. While a volume is formatting, an indicator in the Progress column of the general view tracks the progress. When formatting is complete, the newly created simple volume can be used by applications. It is not necessary to reboot either a Windows NT Version 4 or a Windows 2000 system before using a new dynamic volume. Figure 7.11, for example, shows a Windows 2000 Explorer view of the newly created volume. 3
If any unreadable blocks are encountered during full formatting, the Volume Manager’s driver component (vxio) attempts to replace them using disk services requested over the I/O bus. If bad block replacement fails, the file system format operation fails and the area of the disk allocated to the volume is not usable.
140
CHAPTER SEVEN
Figure 7.7 Create Volume wizard file system choices.
Management Simplicity The foregoing example illustrates how simple it is to use Logical Disk Manager wizards to manage online storage devices. In this example, the administrator had to specify only the type of volume required and the label text. The Logical Disk Manager supplied default values for all other choices, some of which were used in the example (e.g., file system type and allocation unit size) and others of which were overridden (e.g., capacity and drive letter). At each stage of specification, the administrator had the option of overriding these Logical Disk Manager’s choices: Volume type. Striped, mirrored, or RAID, rather than simple or spanned. Volume location. All dynamic disks with sufficient capacity were offered. Drive letter. All unused drive letters were offered. File system and file system parameters. FAT, FAT32, or NTFS.
Basic Volumes
141
Figure 7.8 Create Volume wizard specification summary.
File system compression. Compressed or not. Formatting option. Quick or complete block scan. In all cases, only valid options were presented, to guide the administrator through the entire process of volume creation. No knowledge of configuration details external to the wizard was required; all valid choices were exposed at each stage. This design allows the administrator to focus on storage management policy issues rather than on implementation details.
NOTERegardless of the type of volume being created, the sequence of steps described in the preceding section is basically the same. The subsequent sections of this chapter describe the creation and use of more complex types of volumes. In all cases, the Create Volume wizard is used. For the sake of brevity, only those panels that illustrate features unique to a particular volume type are reproduced in each of the following examples.
142
CHAPTER SEVEN
Figure 7.9 Disk object general view of newly created volume.
Creating a Spanned Volume This example illustrates the creation of a spanned volume. Spanned volumes can be used to present more storage capacity than is available on the largest single disk in a system as a single volume. They can also be used to aggregate unallocated space on disks that are hosting other volumes for presentation as a single, large nonfailure-tolerant volume.
Figure 7.10 Volume object general view of newly created volume.
Basic Volumes
143
Figure 7.11 Windows 2000 Explorer view of newly created simple volume.
NOTEPrior to the start of this example, the simple volume created in the preceding example has been deleted, so the disks are in the state illustrated in Figure 6.12—already having been upgraded to dynamic format, but with no volumes allocated on them.
The Create Volume wizard begins with the two introductory panels shown in Figures 7.2 and 7.3. In the second (Select Volume Type), the administrator specifies the spanned volume type by clicking its radio button. In the third panel, shown in Figure 7.12, the administrator specifies the volume’s usable capacity and the disks on which it should reside. The Create Volume wizard behavior is similar for simple and spanned volumes. Each time a disk is selected, its available capacity is added to the Total volume size, which represents the largest spanned volume that can be created using the specified disks. The All available dynamic disks listbox shown in Figure 7.12 contains an entry for each disk that is eligible to be part of the spanned volume (any dynamic disk with unallocated capacity is eligible). An administrator adds a disk to the spanned volume by selecting the disk in this list box and clicking the Add button. When all desired disks have been specified, the administrator specifies the desired capacity for the spanned volume and clicks the Next button. From this point, spanned volume creation is identical to simple volume creation, and therefore is not illustrated here. The Create Volume wizard advances to the Assign Drive Letter or Path and Format Volume panels, shown in Figures 7.6 and 7.7, respectively. These panels behave the same for spanned volumes as for simple ones. A summary panel similar to that
144
CHAPTER SEVEN
Figure 7.12 Logical Disk Manager allocation for a spanned volume.
shown in Figure 7.8 summarizes the input specifications and gives the administrator the options of actually creating the spanned volume, retracing the steps to change input specifications, or canceling volume creation altogether. When volume creation is complete, the Logical Disk Manager console’s disk view, shown in Figure 7.13, indicates that capacity from Disks 5 and 6 has been allocated to the spanned volume. At point of the figure capture, the volume is being formatted, so no file system has been initialized on it and no File System is listed in the upper console panel. Upon completion, the volume will be addressed by applications as drive K:. Figure 7.14 shows the Logical Disk Manager’s console view of the spanned volume created in this example after formatting is complete and the volume is ready for use. The upper panel conveys basic information about all of the system’s volumes at a glance. Drive letter K:, labeled 3.45 GB Spanned,4 is shown as a 4
While the labels of volumes used as examples in this book are often created with the intention of signifying the volume’s function, the reader should clearly understand that there is no inherent relationship between a volume’s label and its type. Thus, the label 3.45 GB Spanned could equally well be used with a simple, striped, mirrored, or RAID volume.
Basic Volumes
Figure 7.13 Logical Disk Manager view of spanned volume during formatting.
Figure 7.14 View of spanned volume ready for use.
145
146
CHAPTER SEVEN
dynamic volume with spanned layout and an NTFS file system; the volume is listed as Healthy. The lower panel indicates that Disks 5 and 6 each contribute capacity to the 3.45 GB spanned volume (as well as to other volumes).
Creating a Striped Volume The example is this section demonstrates the creation of a striped volume on the system used for the preceding examples. The striped volume is to be two columns wide (perhaps for performance reasons), so it will occupy storage capacity from two subdisks located on different disks. Again in the preceding example, the Create Volume wizard begins with the two introductory panels shown in Figures 7.2 and 7.3. In the second (Select Volume Type) panel, the administrator specifies the striped volume type. In the third (Select Disks), the administrator specifies the volume’s usable capacity and the disks on which its subdisks will be allocated. The number of disks across which the blocks of a striped volume are spread affects the performance potential of the volume. Wider volumes containing more disks have the potential for servicing more concurrent I/O requests, or, for very large I/O requests, the potential that very large I/O requests will be split across two or more disks for parallel execution. In the Select Disks panel, the wizard’s behavior is similar to that for simple and spanned volume creation. Each time a disk is specified, its available capacity is added to the Total volume size, which represents the largest striped volume that can be created using the specified disks. All the subdisks that comprise a striped volume must be identical in size; therefore, the Total volume size reported for a given disk configuration during striped volume creation may be lower than that for spanned volume creation on the same disk configuration, because a spanned volume can use all available subdisks, no matter what their sizes. The Select Disks panel is also used to specify the total usable capacity of the striped volume, by entering the desired capacity in the text box labeled For all selected disks. Figure 7.15 indicates that the two disks specified (Disks 2 and 3) have a total capacity of 8,676 megabytes, of which 4,338 megabytes are available for creation of a striped volume. If the administrator inadvertently specifies a larger capacity volume than can be created using the specified disks, the wizard blocks further progress by disabling the Select Disks panel’s Next button (Figure 7.16). The Volume Manager allows an administrator to specify the stripe unit size for striped, striped-mirrored, and RAID volumes, but the Logical Disk Manager supports stripe unit size at 64 kilobytes. Therefore, if the I/O loads that will be
Basic Volumes
147
Figure 7.15 Create Volume wizard disk specification and resulting maximum.
placed on volumes are known to be relatively homogeneous (i.e., predominantly I/O request-intensive, or predominantly data transfer-intensive), substituting the Volume Manager for the built-in Logical Disk Manager may be advisable so that stripe unit size may be adjusted to optimize performance, as described on page 48. Administrators are advised to use caution in altering the default stripe unit size for striped volumes. Too small a stripe unit size can result in an excessive number of small I/O requests being split across two disks, increasing average I/O latency. Too large a stripe unit size can reduce parallelism in the execution of very large I/O requests; and by mapping more consecutively numbered volume blocks to a single disk, this can reduce the number of small I/O requests that can execute concurrently for applications with high locality of reference. To successfully create a volume, an administrator must specify a capacity that can be allocated from available space within the disk group. In this example, this could be done in two ways: ■■
The number of columns could be held constant and the specified capacity of the volume could be reduced to the 7,542-megabyte capacity shown as
148
CHAPTER SEVEN
Figure 7.16 Create Volume wizard response to an impossible request.
Maximum in Figure 7.16. This would allow the specified capacity to be met from the specified disks. ■■
The number of columns in the striped volume could be increased by specifying more disks, to contribute capacity to the volume. The maximum size of a striped volume is the smallest number of unallocated blocks on any of the specified disks multiplied by the number of disks specified.
The Logical Disk Manager will not allow two columns of the same striped, RAID, or striped-mirrored volume to be allocated on the same disk, because that would defeat the purpose of the volume, whether performance enhancement or failure tolerance. Each column of a striped, striped-mirrored, or RAID volume must be located on a different dynamic disk. Once the disk complement for the striped volume and the volume’s capacity have been specified, the wizard advances to the Assign Drive Letter or Path (as in Figure 7.6) and Format Volume (as in Figure 7.7) panels. These are treated the same as in the simple and spanned volume examples. As Fig-
Basic Volumes
149
ure 7.17 indicates, applications will address the striped volume using drive letter J:. An NTFS file system using the default allocation unit size will be initialized. The volume label—4.3GB Strip (occluded in Figure 7.17)—has been specified in the Format Volume panel. As with all Logical Disk Manager wizards, no action is taken until the administrator clicks the Finish button on the summary panel. When the Finish button is clicked, changes to the dynamic disk group metadata to record the composition and layout of the striped volume are made atomically. Figure 7.18 shows the Logical Disk Manager view of the striped volume during formatting. The top of the console window indicates that the volume is striped, but no file system is shown in the File System column because the volume itself is still formatting, and the file system has not yet been initialized. The disk view at the bottom of the window shows that storage capacity for the volume is allocated on Disks 2 and 3. Applications will address the volume using drive letter J:.
Figure 7.17 Specification summary for a striped volume.
150
CHAPTER SEVEN
Figure 7.18 Logical Disk Manager view of a striped volume during formatting.
When formatting is complete and the volume is usable by applications, three things happen: ■■
The progress indicator disappears.
■■
The volume label (4.3GB Strip) appears in the Volume column of the display.
■■
The file system type (NTFS) is displayed.
These changes are reflected in Figure 7.19.
Creating a Mirrored Volume The Logical Disk Manager can create mirrored volumes—with two identical copies of application data, stored on different disks. Mirrored volumes keep data accessible to applications even when one of the disks that contributes space to the volume fails. The example here starts with the disk configuration shown in Chapter 6, Figure 6.12 (dynamic disks with no volumes allocated on them), this example uses the Logical Disk Manager’s Create Volume wizard to create a mirrored volume with a capacity of 1,024 megabytes. The administrator specifies that a mirrored volume should be created by clicking the corresponding radio button
Basic Volumes
151
Figure 7.19 The striped volume is ready for application use.
in the wizard’s Select Volume Type panel (refer to Figure 7.3). Clicking the Next button with the Mirrored volume type specified leads to the Select Disks panel (Figure 7.20); there, the administrator specifies the two disks on which the mirrored volume’s subdisks are to be allocated. The subdisks of a mirrored volume must be allocated on different physical disks; otherwise, there is no protection against disk failure. In Windows 2000 servers with two or more host bus adapters connecting to disks, it is usually preferable to allocate the two subdisks of a mirrored volume on disks connected to different host bus adapters. This protects data accessibility against host bus adapter and cable failure as well as against disk failure. The largest possible capacity for a Logical Disk Manager mirrored volume is the capacity of the largest subdisk that can be allocated on either of the specified disks. In Figure 7.20, for example, one of the disks specified has 4,338 megabytes available for allocation; the other has at least that much. This can be inferred from the Maximum field in the lower right corner of the panel. In the figure, the administrator has specified a volume usable capacity of 1,024 megabytes, which is well below the maximum. Once the location of the mirrored volume has been specified, the wizard proceeds to the Assign Drive Letter or Path (Figure 7.6) and Format Volume (Figure 7.7) panels. As with other volume creation examples in this
152
CHAPTER SEVEN
Figure 7.20 Specifying the disks for a mirrored volume.
chapter, these panels behave identically, no matter what kind of volume is being created. For this example, an NTFS file system with the Default allocation unit size is specified. The volume label is 1 GB Mirror, and applications will address it using drive letter H:, as shown in Figure 7.21. Again, after the administrator clicks the Finish button on the wizard’s summary panel, the Logical Disk Manager begins the actual work of volume and file system creation. For failure-tolerant volumes, this includes: ■■
Writing Logical Disk Manager metadata that describes the volume in the private region on each disk in the disk group.
■■
Formatting (writing and reading each volume block if a full format was requested).
■■
Initializing file system metadata for the designated type of file system into the appropriate volume blocks.
■■
Adjusting Windows operating system data structures so that the volume is recognized by applications and utilities such as Windows Explorer. New
Basic Volumes
153
Figure 7.21 Formatting a mirrored volume.
dynamic volumes can be recognized and used by applications without a system reboot in both Windows NT Version 4 and Windows 2000. ■■
Synchronizing volume contents (i.e., making the contents of corresponding blocks of all mirrors in a mirrored volume identical, or making RAID volume user data blocks match corresponding parity block contents).
In this example, quick formatting of the newly created volume is not specified, so the volume must be: Formatted. Every block must be written and read to ascertain that it is readable. Synchronized. The contents of every volume block must be made identical on both of the disks comprising the mirror. Figure 7.21 is the Logical Disk Manager’s console view captured shortly after mirrored volume was created. In this view, the volume is still formatting, so no file system has been initialized on it at this point and it is not available for application use. Figure 7.22 was captured a few minutes after the volume was created. As with earlier examples, although a drive letter (H:) has been assigned for the volume, no file system is indicated in the volume overview at the top of the console window. The disk view at the bottom of the window indicates that capacity for the mirrored volume has been allocated on Disks 2 and 3, and that the volume is formatting.
154
CHAPTER SEVEN
Figure 7.22 Attempting to access a mirrored volume during formatting.
Although the volume is visible to applications at this point, it is not usable because it lacks a file system. Application attempts to access the volume return errors of various sorts, as shown in Figure 7.22 where a Windows 2000 Explorer attempt has been made to explore the still-formatting mirrored volume (Local Disk H:) while formatting is still in progress (as indicated by the Logical Disk Manager console view also shown in the figure). Before they can protect against data loss due to disk failure, failure-tolerant volumes’ contents must be internally consistent. For mirrored volumes, the contents of corresponding blocks in each mirror must be identical. For RAID volumes, parity must be consistent with corresponding user data block contents. The process of making a failure-tolerant volume’s underlying disk blocks consistent is called resynchronization. Resynchronization is required when a volume is created and whenever a failed or removed disk is replaced. For large volumes, resynchronization is a time-consuming process, since it requires reading or writing of every block on every subdisk that is part of the volume. Figure 7.23 shows the Logical Disk Manager console window view of the mirrored volume created in this example while it is resynchronizing (“Resynching” in the figure). A failure-tolerant volume can be used by applications while it is resynchronizing (for example, it can be populated with data). Until resynchronization is complete, however, the volume’s check data is not guaranteed to be consis-
Basic Volumes
155
Figure 7.23 Resynchronizing a mirrored volume.
tent, and user data contained on the volume is at risk if one of the volume’s disks fails. A prudent administrative practice, therefore, is to defer writing data that is not easily recoverable or reproducible on failure-tolerant volumes until resynchronization is complete and the volume’s status is listed as Healthy in the Logical Disk Manager console, as in Figure 7.24.
Figure 7.24 Mirrored volume redundant and ready for use.
156
CHAPTER SEVEN
Splitting a Mirror from a Mirrored Volume It is sometimes useful to “freeze” an image of data by decoupling a mirror from a mirrored volume at a point in time when applications are not using the data. In the terminology of the Windows 2000 Logical Disk Manager, this is called “breaking” the mirror. Splitting a mirror from a mirrored volume enables the data on the split mirror to be used for other purposes—for example, backup or analysis of an image of application data that is frozen at a point in time. The Windows 2000 Logical Disk Manager is capable of splitting a mirror from a mirrored volume so that the data can be used for other purposes. Splitting a mirror is an administrator function initiated by invoking the Break Mirror volume object command. This is shown in Figure 7.25 for the volume created in the preceding example. Since the Logical Disk Manager only supports mirrored volumes with two mirrors, splitting a mirror from one of its mirrored volumes necessarily leaves that volume unprotected against disk failure. The first step in executing the Break Mirror command is therefore to verify that this is really the administrator’s intention. This is done by responding to the alert message as shown in Figure 7.26.
Figure 7.25 Invoking the Break Mirror command to split a mirror from a mirrored volume.
Basic Volumes
157
Figure 7.26 Verifying the Break Mirror action.
Clicking No to this query aborts execution of the Break Mirror command; clicking Yes splits the mirrored volume into two simple volumes that are made immediately available to applications. Figure 7.27 illustrates the mirrored volume split into two simple ones. The Logical Disk Manager chooses an available drive letter (G: in the example) on which to make the split mirror available to applications. At the instant of splitting, the contents of the two volumes are identical, up to and including the volume labels, which both display as 1 GB Mirror in Figure 7.27. Once the split occurs, however, the two resulting simple volumes bear no relationship to each other, and the data on each can be updated independently of the other.
Figure 7.27 Mirrored volume split into two simple volumes.
158
CHAPTER SEVEN
A typical usage of this feature might be to use data on the split mirror (drive G:) as the source for a backup while applications resume processing the data on the original volume (H:). Because the Logical Disk Manager has a limit of two mirrors per volume, splitting one in this way necessarily leaves data susceptible to loss due to disk failure. Therefore, it is usually advisable to rejoin the split mirror to its volume as soon as the intended usage of the data on it (e.g., as a backup) has been carried out.5
Adding a Mirror to a Logical Disk Manager Volume Returning a split mirror to its original volume or indeed adding a mirror to any simple Logical Disk Manager simple volume is accomplished by invoking the Add Mirror simple volume object command, as shown in Figure 7.28. This command is directed to the volume to which a mirror is to be added (the volume made available as drive H: in this figure). The Logical Disk Manager can allocate the mirror from available capacity on any disk in the dynamic disk group (except, of course, the disk on which the volume to which the mirror is to be added resides). 5
Windows 2000 Volume Manager support mirrored volumes with three or more mirrors, so this limitation does not apply to them. Whenever a mirror is split from a volume for other use, however, it is usually desirable to replace it with another or to rejoin it to its volume eventually.
Figure 7.28 Invoking the Add Mirror… volume object command.
Basic Volumes
159
The administrator must specify the disk on which to allocate capacity for the new mirror. This is done using the dialog displayed when the Add Mirror… command is invoked, as shown in Figure 7.29. Here, the Logical Disk Manager displays a list of disks on which sufficient unallocated capacity for the new mirror exists. Disks with insufficient free capacity are not listed; neither is the disk on which the volume to which the mirror is to be added resides. In the example, the administrator has specified Disk 3 (the disk that was part of the 1 GB Mirror volume when it was originally allocated). Figure 7.30 shows the (perhaps surprising) result of adding a mirror to the volume presented as drive letter H:. Disk 3 is contributing capacity to two volumes: ■■
A simple volume presented as drive letter G:.
■■
A mirrored volume presented as drive letter H:. Space for the other mirror of this volume is allocated on Disk 2.
This result occurs because the volume known as drive letter G:, which was created when the Break Mirror command was issued in the preceding example, was never deleted. It continues to occupy capacity on Disk 3. Nevertheless, Disk 3 had sufficient unallocated capacity, even with this volume on it, so it was a candidate to provide the capacity for the added mirror specified in Figures 7.28 and 7.29.
Figure 7.29 Specifying the disk when adding a mirror to a mirrored volume.
160
CHAPTER SEVEN
Figure 7.30 Mirror added to a mirrored volume.
Part of the confusion around this example stems from the fact that both volumes have the same label (1 GB Mirror), which seems to indicate that the volumes are mirrored, even though the volume presented as drive letter G: is never mirrored, and the volume presented as drive letter H: is not mirrored at the start of the example. This suggests the importance of meticulous online storage management practices; specifically, this would include the prompt and accurate relabeling volumes as their nature and usage changes. It also points out the necessity of relying on definitive information such as that provided by the Logical Disk Manager console, rather than suggestive hints (such as volume labels) that may or may not be accurate.
Removing a Mirror from a Mirrored Volume The Logical Disk Manager also makes it possible to completely remove a mirror from a mirrored volume and deallocate the storage capacity it occupies. Removing a mirror from a mirrored volume makes data on the removed mirror inaccessible. In essence, removing a mirror is a shortcut for: ■■
Splitting the mirror using the Break Mirror… command.
■■
Deleting the volume that results from removing the mirror.
A mirror is removed from a mirrored volume by invoking the Remove Mirror… volume object command, shown in Figure 7.31.
Basic Volumes
161
Figure 7.31 Removing a mirror from a mirrored volume.
Invoking the Remove Mirror… command displays the dialog shown in Figure 7.32. Here the administrator uses this dialog to specify the mirror to be removed from the mirrored volume.Because the Logical Disk Manager’s limitation of two mirrors means that removing one necessarily makes a volume nonredundant, or no longer protected against disk failure. Moreover, when a
Figure 7.32 Specifying and confirming mirror removal.
162
CHAPTER SEVEN
Figure 7.33 Mirrored volume after mirror removal.
mirror is removed, the subdisk it contains is deleted, rendering the data on it inaccessible. For this reason, when the disk to be removed from the mirrored volume is specified, the Logical Disk Manager displays a confirmation message, also shown in Figure 7.32. After responding Yes to the query in Figure 7.32, and the mirror subdisk on Disk 3 has been removed from the volume, the state of Disks 2 and 3 from this and the preceding examples is given in Figure 7.33. In this figure, both disks have 1 gigabyte simple volumes allocated on them. The subdisk on Disk 3 that was allocated in the preceding example when a mirror was added to the volume, presented as drive letter H:, has been deleted. Both G: and H: now represent simple volumes. Although neither is fault-tolerant, both volumes still carry the somewhat confusing and deceptive label 1 GB Mirror, which again suggests the importance of prompt and well-designed storage management procedures, especially for tasks for which there are no checks and balances to prevent errors. A user looking for available storage capacity might easily make false inferences about the properties of these volumes by reading their labels.
CHAPTER
8
Advanced Volumes
The Volume Manager for Windows 2000 The examples in Chapter 7 illustrated the Windows 2000 operating system’s built-in Logical Disk Manager (LDM). Also available, from VERITAS Software Corporation, is the Volume Manager for Windows 2000. This Volume Manager provides a functional superset of the Logical Disk Manager’s capabilities, and when installed, automatically replaces the Logical Disk Manager. This chapter uses the Volume Manager to illustrate a number of advanced volume management functions available in the Windows 2000 environment. Like the Logical Disk Manager it replaces, the VERITAS Volume Manager is a Microsoft Management Console “snap-in.” When the software is installed, its icon replaces the Disk Management icon in the Computer Management console window, as shown in Figure 8.1. As with the Logical Disk Manager, an administrator starts the Volume Manager console by clicking its icon in the Computer Management Tree panel. Doing so “snaps” the panels of the Volume Manager’s interface into the console’s right panel, replacing the object listing in Figure 8.1. Figure 8.2 shows the General view of the Volume Manager console (though the Computer Management object tree has been reduced to its minimum width to provide a larger viewing area for the two Volume Manager panels). Here the disks are attached to a server named LEFT. Disks 3 through 10 are used in the examples that follow (the remaining disks comprise the server’s permanent data storage, and are not used in these examples). For each disk, 163
164
CHAPTER EIGHT
Figure 8.1 Invoking the VERITAS Volume Manager for Windows 2000.
the operational status, format (Type), disk group membership, total capacity, unallocated capacity, and graphical layout are shown. This view provides significant additional information about the disks connected to a system, information that is available either by enlarging the console window horizontally or by scrolling the panel to the right, as illustrated in
Figure 8.2 General view of a system’s disks.
Advanced Volumes
165
Figure 8.3 Additional disk information available in the general view.
Figure 8.3. In this figure, the General panel has been scrolled entirely to the right to display additional information about the system’s disks. This view indicates that the eight disks to be used in the following examples have SCSI interfaces and are accessed through Windows SCSI drivers. The Windows 2000 operating system assigns a port number to each I/O interface (ASIC or host bus adapter) it discovers. Figure 8.3 indicates that the eight disks to be used are connected to the computer’s Port (host bus adapter) number 3. The SCSI target ID (Ta…) of each disk is shown, as are the channel (Ch…) for multichannel interfaces and SCSI logical unit number (LUN). For directly attached disks, the logical unit number is always zero. The virtual disks presented by RAID controllers often respond to nonzero logical unit numbers. The Vendor column in Figure 8.3 displays the vendor identification string returned by the disk in response to a SCSI Inquiry command. This information can be helpful in diagnosing and repairing disk problems that become visible through the Volume Manager console interface. Although the Volume Manager is capable of managing basic disks and creating basic partitions, it is primarily used to manage dynamic volumes allocated from storage capacity on dynamic disks. Disks on which dynamic volumes will be allocated must first be upgraded to dynamic format.
166
CHAPTER EIGHT
As with the Logical Disk Manager, most Volume Manager commands are invoked by placing the cursor on an icon representing the object to which the command applies and then right-clicking the mouse. This displays a menu of commands that apply to the object. Commands that do not apply in the object’s current context are disabled. For example, as Figure 8.4 illustrates, the Upgrade to Dynamic Disk… command is invoked by right-clicking on the icon of any basic disk. Doing so invokes the Upgrade to Dynamic Disk wizard, which starts by displaying the introductory panel common to all Volume Manager and Logical Disk Manager wizards. Following the introductory panel (not shown in this example) is the Select Disks to Upgrade panel, shown in Figure 8.5. It is in this panel that an administrator specifies the disks to be upgraded. This panel also illustrates one of the significant differences between the Volume Manager and the built-in Logical Disk Manager—the capability of organizing disks into multiple dynamic disk groups. In Figure 8.5 the administrator has elected to create a new dynamic disk group, where the upgraded disks will be placed by clicking the New button on the Select Disks to Upgrade panel. Clicking New displays the Create Dynamic Group subdialog, which requests that the administrator name the new dynamic disk group. The MirrorGroup has been entered as the name for the dynamic disk group to be created.
Figure 8.4 Invoking the Upgrade to Dynamic Disk command.
Advanced Volumes
167
Figure 8.5 Upgrading disks to dynamic format.
In Figure 8.5 Disks 3, 4, and 5 have been specified for upgrade to dynamic format using the panels Add button. All three will be placed into MirrorGroup when the command executes. The remaining steps of the Upgrade to Dynamic Disk wizard’s specification phase are functionally identical to those of the Logical Disk Manager’s Upgrade to Dynamic Disk wizard, and so are not shown here. To illustrate multiple dynamic disk groups, Figure 8.6 shows that Disks 7, 8, 9, and 10 have been similarly upgraded to dynamic format and placed in a separate dynamic disk group called RAID5Group. When this figure was captured, no volumes had been defined on either of the two dynamic disk groups. Figure 8.6 also indicates that Disk 6 is still a basic disk (with a volume defined on it). The point is that a disk’s format (basic or dynamic) is independent of the type of disk interface, as well as the format of other disks attached to the same interface. Basic Disk 6 is attached to the same SCSI host bus adapter as Disks 3 through 5 and 7 through 10, all of which have dynamic format. In the next example, a mirrored volume is created using two of the disks in the MirrorGroup dynamic disk group. As with the Logical Disk Manager, volumes are created using the Create Volume… command (shown in Figure 8.4). This command invokes the Volume Manager’s Create Volume wizard. After displaying two initial panels that, respectively, describe the wizard’s function and enable the administrator to confirm or modify the disk group specification (not shown in the example), the wizard displays the Select Volume Type panel shown in Figure 8.7. Using this panel the administrator specifies the type and capacity of the volume to be created.
168
CHAPTER EIGHT
Figure 8.6 Volume Manager console showing MirrorGroup and RAID5Group.
As with the Logical Disk Manager, concatenated, striped, RAID-5, and mirrored volumes can be created. Unlike the Logical Disk Manager, the Volume Manager supports mirrored volumes with more than two mirrors (up to 32).1 The administrator specifies the number of mirrors in a mirrored volume by adjusting the Number of Mirrors parameter in the Select Volume Type panel. For striped and RAID volumes, the Volume Manager also makes it possible to optimize for specific types of I/O load. This is accomplished by adjusting the Stripe Size parameter in the panel.2 For small random I/O requests, a Stripe Size value of between 10 and 20 times the typical I/O request size usually works well. For very large I/O requests, a Stripe Size value equal to the lesser of half the number of bytes on a disk track or the typical I/O request size divided by the number of disks in a stripe that contain data, is usually optimal. These optimizations are usually appropriate for homogeneous I/O loads (those consisting predominantly of requests of the same size). If an administrator is uncertain about the degree to which an I/O load is homogeneous, or if there is uncertainty about the probable effect of tuning, it is usually best to leave the Stripe Size parameter at its default value.
1
RAID volumes with as many as 256 columns are also supported. The parameter referred to as stripe size in this wizard panel is called stripe depth in the architectural descriptions. 2
Advanced Volumes
169
Figure 8.7 Create Volume wizard layout and capacity specification.
In Figure 8.7, the administrator has specified a concatenated volume (meaning that its volume block addresses are not striped across the disks that contribute capacity to the volume) that is to be mirrored on two mirrors. The capacity specified for this volume is 2000.25 megabytes. This rather odd number was computed and displayed by the Volume Manager by clicking the wizard’s Query Max Size button. This button simplifies the use of different disk types with different capacities. When the button is clicked, the Volume Manager computes the maximum possible capacity volume of the specified type that can be constructed from capacity available anywhere in the selected disk group. In this case, 2000.25 megabytes is the largest two-mirror volume that can be allocated from available capacity on disks in MirrorGroup. Once the volume type and usable capacity have been specified, and the administrator clicks the Next button, the wizard’s Verify Disks panel proposes a layout for the volume by indicating graphically which disks will contribute capacity to it, and how much each will contribute. Figure 8.8 shows the Volume Manager’s layout proposal for the mirrored volume in this example. The mirrored volume to be created will contain a single column (sequence of disk addresses), called Column 0 in Figure 8.8. Since the volume is mirrored,
170
CHAPTER EIGHT
Figure 8.8 Create Volume wizard layout proposal.
the column will be replicated in each of two plexes (Plex 0 and Plex 1) on separate disks. Every plex of a mirrored volume must contain a copy of every volume block, so the capacities of the two plexes are identical (2000 MB— the number is truncated for the display). When available capacity in the dynamic disk group permits, the administrator may modify the Volume Manager’s disk layout proposal by clicking the Modify button. This displays the Modify Disk Selection subdialog shown in Figure 8.9. This subdialog is a table containing a row for each disk in the disk group for which the Volume Manager has proposed to allocate capacity for the volume. The Disk entry in each row is a drop-down list which, when expanded, contains the names of all other disks in the group on which capacity could alternatively be allocated. In the example, MirrorGroup contains Disks 3, 4, and 5; the Volume Manager has proposed to use Disks 3 and 4. Disk 5 would be an alternative to either (but not both) of the proposed disks. In the figure, the list of alternatives for Disk 3 is expanded, and is shown to contain Disk 5. Each time an administrator modifies the disk configuration for a volume, the Volume Manager recomputes the alternatives for all disks in the proposal. If,
Advanced Volumes
171
Figure 8.9 Modifying the plex proposal.
for example, the administrator were to substitute Disk 5 for Disk 3 in Figure 8.9 (by selecting Disk 5 in the drop-down list), subsequent expansion of the Disk 4 table entry would show Disk 3—not Disk 5—as an alternative to Disk 4, because Disk 5 would have been selected to contain the volume’s Plex 0. In some cases, no alternatives are possible. For example, if MirrorGroup contained only two disks, there would be no alternatives for the volume in this example. When this is the case, the drop-down lists in the Modify Disk Selection subdialog are empty, indicating no available alternatives. When the administrator indicates satisfaction with the volume’s disk configuration by clicking the Next button, which brings up the Assign Drive Letter panel, shown in Figure 8.10. This panel is used to assign a drive letter through which the new volume is to be made available to applications, or, alternatively, to specify a mount point (an NTFS folder) for the same purpose. Drive letter assignment can also be deferred to a later time, in which case the volume will not be available to applications until a drive letter or mount point is assigned. In this figure, the drive letter M: has been specified by expanding the drop-down list and specifying that letter from the options displayed. Clicking the Next button displays the Format Volume panel shown in Figure 8.11. The administrator uses this panel to specify: ■■
Whether to format the volume with a file system (by checking or not checking the Format this volume check box).
■■
The file system (if any) that will be initialized on the volume.
172
CHAPTER EIGHT
Figure 8.10 Specifying drive letter for new volume.
■■
The user-readable label for the volume.
■■
The allocation unit size for the file system. In general, larger allocation units are more efficient in file systems that hold relatively small numbers of relatively large files; smaller allocation units enable more files to be managed by a file system. If there is doubt about the value to specify for this parameter, it is usually advisable to leave it at its default value.
■■
Whether to perform a quick or full format. A full format verifies the readability of all of the volume’s blocks by writing patterns and rereading them. A quick format only writes the necessary metadata structures to create an empty file system.
■■
Whether to enable file and folder compression for the file system on the volume. Using file and folder compression saves space on a volume, at the expense of processing time when a file or folder is opened or data is written.
These parameters complete the specification of the volume to be created, and the Create Volume wizard displays a summary of all this information. Fig-
Advanced Volumes
173
Figure 8.11 Choosing file system and formatting options.
ure 8.12 shows this panel for the mirrored volume in this example. At this point, all specifications for the new volume have been validated, both semantically (i.e., there are no contradictory or invalid specifications) and in terms of the current state of the disks in the disk group (i.e., the specified volume can be created using available capacity on the specified disks). No action has been taken on the disks, however. If, the system in this example were to crash at this moment, the Volume Manager console would show the disks in the state shown in Figure 8.6 after a restart. The administrator creates the volume by clicking the Finish button on the summary panel. This causes the Volume Manager to update the MirrorGroup’s on-disk metadata to indicate the presence and space configuration of the new volume, and to begin synchronizing and formatting the volume. Figure 8.13 shows the two halves of the General view of the Volume Manager’s console shortly after the Create Volume wizard has executed. In addition to reporting that the new volume’s status as Healthy (meaning all disks are functioning), the left part of this display (8.13a) shows the new volume to be Resynching (the contents of corresponding blocks in both plexes
174
CHAPTER EIGHT
Figure 8.12 Create Volume wizard summary and confirmation panel.
being made identical by the Volume Manager) and Formatting (volume block readability being verified by the Volume Manager as it formats). The right part of the display (8.13b) shows the volume’s usable capacity as 1.95 gigabytes) and the unallocated space (Free…) on each of the volume’s disks. A progress indicator (Progr…) shows how far through the resynchronization and formatting process the Volume Manager has gone. If quick formatting is specified (refer to Figure 8.11), formatting will complete in a few seconds; resynchronization will continue until all volume blocks have been read and written. A mirrored volume can be used by applications as soon as it is formatted, but will not be failure-tolerant until resynchronization is complete, at which point all data blocks is guaranteed to have identical contents in all plexes. A good administrative practice, therefore, is to avoid storing data that is not easily replaceable on a mirrored volume until it has been completely resynchronized. Figure 8.14 illustrates a two-mirror volume for which resynchronization is complete. The volume illustrated in the Volume Manager console Disk view in Figure 8.14 was created after deleting the volume from the foregoing example, using an identical procedure and the same disks, except that the volume’s capacity
Advanced Volumes
175
Figure 8.13 Newly created mirrored volume undergoing formatting.
was specified as 1,000 megabytes. This leaves unallocated space on the disks, allowing the mirrored volume to be expanded, as the next example illustrates.
Three-Mirror Volumes and Splitting One important use for mirrored volumes with three or more mirrors is to split one mirror from the main volume, thereby effectively “freezing” an image of the data on the volume at the instant of splitting. The frozen image can be mounted as a separate volume and used for other purposes, such as backup, data analysis, or testing, while the still-failure-tolerant volume continues to be read and written by applications.
176
CHAPTER EIGHT
Figure 8.14 A 1-gigabyte mirrored volume after formatting and resynchronization.
While creating frozen images of application data are very useful, for example, in reducing “backup windows” (the amount of time during which data is unavailable to applications), creating them requires coordination between system administrators and application managers. For a frozen image to be useful, it must represent application data at an instant at which: ■■
There is no application I/O activity in progress.
■■
There is no data in any cache waiting to be written.
In general, achieving this state requires shutting down the application, or at least making it quiescent, so that the disk image of its data is consistent from the application’s standpoint (e.g., there are no debit records without matching credits reflected on the disk image of the application’s data). Once an application is quiescent, an administrator can split one mirror from each of the application’s volumes and mount them as separate volumes for other use. As soon as the mirrors have been split (“broken” in Windows 2000
Advanced Volumes
177
terminology) from their volumes, applications can be reactivated and can resume use of the original volumes. Splitting a mirror from a mirrored volume is particularly attractive when the mirrored volume has three or more mirrors. With three-mirror volumes, the main volume used for application processing remains failure-tolerant when the third mirror is split. The example in this section demonstrate the two Volume Manager capabilities that implement mirror splitting: ■■
Adding a mirror to a mirrored volume.
■■
Splitting a mirror from a three-mirror volume.
Part I: Adding a Mirror For this demonstration, a third mirror is added to the volume shown in Figure 8.14. Once the mirror has been added and resynchronization is complete, the third mirror is split from the volume and mounted as a separate volume, independently accessible by applications and backup. The first step is to invoke the Add Mirror… command (see Figure 8.15) to add a mirror to a volume. The Add Mirror… command launches the Add Mirror wizard, which begins by displaying the usual introductory panel (not shown). The wizard’s first action panel, the Select Add Mirror Method
Figure 8.15 Invoking the Add Mirror command to add a mirror to a mirrored volume.
178
CHAPTER EIGHT
Figure 8.16 Selecting an add mirror method.
panel, shown in Figure 8.16, enables the administrator to specify either Express or Custom Mode of operation. In Express Mode, the wizard chooses the disk on which to allocate the new mirror, whereas Custom Mode enables the administrator to specify the disk on which to allocate the new mirror. When the administrator chooses Custom Mode, a Verify Disks panel is displayed; similar to the Create Volume wizard’s Verify Disks panel. Figure 8.17 shows this panel, along with the Modify Disk Selection dialog, which displays when the administrator elects to modify the Volume Manager’s allocation proposal by clicking the Modify button. Figure 8.17 indicates that the Volume Manager has proposed Disk 5 to allocate space for the new mirror. Since MirrorGroup contains only three disks, and Disks 3 and 4 already have mirrors of 1GBMirrorVolume allocated on them, Disk 5 is the only possible choice for this example. Disk 3 and Disk 4 cannot be used because that would allow a single disk failure to obliterate two of the volume’s mirrors. They would therefore not be displayed as options in the Modify Disk Selection dialog’s drop-down list box. Once the administrator has specified the disk on which to add the new mirror, the Add Mirror wizard specification input phase is finished and actual addition of the mirror begins. Figure 8.18 shows the Volume Manager console disk
Advanced Volumes
179
Figure 8.17 Specifying which disk to add to a mirrored volume.
view shortly after the wizard’s specification input phase is complete. Disks 3, 4, and 5 are now part of the mirrored volume (still labeled 1GBMirrorVolume). The volume status is shown as Resynching, or because the Volume Manager is copying the contents of blocks from the volume to corresponding blocks on the new mirror (The Appendix contains a complete list of volume statuses reported through the Volume Manager console and their meanings). In Figure 8.18, resynchronization is 9 percent complete. While it is resynchronizing, a mirrored volume is available for application and administrative use, even though it is not fully failure-tolerant. For example, Fig-
Figure 8.18 Disk view of three-mirror volume resynchronizing.
180
CHAPTER EIGHT
ure 8.19 shows a Windows Volume Properties page dialog, which is used in this example to change the volume’s label to 1GBWayMirror. The dialog is displayed by invoking the Properties… disk object command. It is also possible to read and write application data during mirror resynchronization. In this example, the original volume has two mirrors, so it can tolerate a single disk failure while resynchronization is occurring. If Disk 3 or 4 were to fail, the copying of data from the remaining disk to the new mirror on Disk 5 would continue until complete, at which time the volume would again be tolerant of single disk failures. If Disk 5 were to fail during resynchronization, the volume would simply revert to being a two-mirror volume. Figure 8.20 shows the detail view of the two-mirror volume (renamed as 1GB3WayMirror) after the third mirror has been added. The Healthy, Resynching status indicated for the volume in this figure indicates that resynchronization of the new mirror was still underway when the view was captured. As Figure 8.20 also shows, in the course of adding the third mirror, the Volume Manager created an additional plex structure, called Plex (Volume1-03). Creation of the plex is completely transparent to the administrator manipulating the panels of the Add Mirror wizard. In general, administrators can either
Figure 8.19 Changing a volume’s label during resynchronization.
Advanced Volumes
181
Figure 8.20 Detail view of three-mirror volume.
treat the plex architectural structure as informational or ignore it entirely. All operations on plexes are performed internally by the Volume Manager.
Part II: Splitting a Mirror from a Mirrored Volume The second part of this example illustrates the splitting of a mirror from the three-mirrored volume just created. In general, mirrors should be split only from Healthy volumes for which resynchronization is complete. The administrator is reminded of this when first invoking the Break Mirror… command, as shown in Figure 8.21. A message box appears to alert the administrator that breaking a mirror may cause data to lose its fault-tolerant status. Specifically, in this example: ■■
If either Disk 3 or 4 were split, the volume would no longer be failuretolerant (at least until resynchronization of Disk 5 completed).
■■
If Disk 5 were split, it would not be usable because its contents would not have been completely synchronized with those of Disk 3 and Disk 4.
Break Mirror… differs from Remove Mirror…, which permanently deallocates the selected mirror, making data on it unrecoverable. In practice, invocation of this command would be preceded by an application shutdown or quiescing procedure. Such a procedure is completely independent of the Volume Manager, which has no awareness of how applications maintain consistency of the data they process. Without some external assistance (e.g., from a knowledgeable system administrator or from a programmed script), the Vol-
182
CHAPTER EIGHT
Figure 8.21 Using the break mirror command.
ume Manager cannot determine whether the contents of a volume are internally consistent from an application standpoint. When the Break Mirror… command is invoked, the administrator must specify which mirror to split from the mirrored volume in the resultant dialog (see Figure 8.22). In this case, the Volume Manager has proposed to split the
Figure 8.22 Dialog for specifying which mirror to split.
Advanced Volumes
183
plex named Volume 1-03 (located on Disk 5) from the volume. The administrator may accept the Volume Manager’s proposal or specify a different mirror to be split by specifying a different plex in the dialog list. The console’s volume details view shown in the background of Figure 8.22, can help the administrator associate a plex with the disk on which it is allocated. This can be useful if, for example, an installation has a policy of rotating the disks split from a mirrored volume for backup, or if a disk suspected to be failing is removed from the mirror for data backup and eventual replacement. After the mirror to be split has been specified, and the administrator clicks OK, the split is made. The Volume Manager automatically chooses an unused Windows drive letter and mounts the split mirror as a separate volume, identified with a label identical to that of the mirrored volume from which it was split. Figure 8.23 shows the Volume Manager console’s General volume view after the mirror in this example has been split. The split mirror’s layout is listed as Dynamic Simple Volume and it has been assigned drive letter K:. Its label, 1GB3WayMirror, is the same as that for the remaining two-mirror volume addressed using drive letter M:. These labels might suggest to a user unaware of administrative events taking place in the system that either or both volumes are redundant, when in fact neither is. This scenario emphasizes the importance of prompt and rigorously implemented system administrative procedures (specifically, relabeling volumes if the labels are used to indicate volume type) so that users are not misled.
Figure 8.23 General view after splitting mirror.
184
CHAPTER EIGHT
When a mirror is split from a mirrored volume, it automatically becomes a separate volume that is immediately usable by Windows applications. No relationship exists between data on the split mirror and data on the volume from which it was split. Either the original volume or the split mirror can be used by applications, and either can be used for other purposes. The update of one has no effect on the data on the other, which remains as it was at the instant of splitting. Usually, it makes sense to continue to use the original volume with online applications, because it is still failure-tolerant and continues to be available through the same drive letter (M: in this example). The split mirror is typically used for backup, data analysis, or whatever purpose was served by splitting the mirror. Figure 8.24 illustrates two views of the Windows 2000 Explorer main window. A new disk, K:, is represented in the left panel. The data stored on 1GB3WayMirror (drive letter M:) before and after the splitting of the third mirror is shown in the background. The data on the split mirror (drive letter K:) contains identical data, as the foreground window capture in suggests. Figure 8.25 illustrates a simple use of a split mirror: the drag-and-drop copying of the data on the split mirror to another location. The Windows 2000 Explorer Address bar indicates that the path to the files displayed in the right panel is K:, the drive letter activated when the mirror was split. As can
Figure 8.24 Explorer views of mirrored volume and split mirror.
Advanced Volumes
185
Figure 8.25 Using a split mirror for file copy.
be observed from the Explorer bar, the original volume (M:) is still available to applications as well. Such a copy might be a backup in its own right, or it might be an effort to remove data from the split mirror so that the original volume can be made redundant again as quickly as possible. After a split mirror has served its backup, data analysis, or other purpose, it can be deleted, and the storage capacity it occupied can be added back to the mirrored volume using the Add Mirror… command, on the context-sensitive menu illustrated in Figure 8.15, and executing the resulting Add Mirror wizard. As with any added mirror, the block contents of the newly added mirror must be resynchronized with those of the main volume. Mirror resynchronization entails copying the contents of every volume block from the main volume to the newly added mirror (plex). Because it is an I/O-intensive process, administrators should take care to add split mirrors back to their volumes during periods of low application and other I/O activity. Figure 8.26 illustrates the deletion of the volume represented by the split mirror in this example, accomplished by invoking the Delete Volume… command. Because no parameters need be specified for volume deletion, there is no wizard associated with this; only a confirming dialog (not shown). To complete the example, Figure 8.27 illustrates the Add Mirror… command issued to the two-mirror volume addressed as drive letter M:. Invoking this command launches the Add Mirror wizard shown in Figure 8.16. The context-sensitive menu in Figure 8.27 is itself of some interest. The twomirror volume is in a state in which any Volume Manager command that oper-
186
CHAPTER EIGHT
Figure 8.26 Deleting split mirror after file copy.
ates on volumes except Reactivate Volume can legitimately be executed on it: ■■
The volume can be explored (via the Explore …) because it has a file system on it and is mounted.
■■
The volume can be extended (Extend Volume…), or a mirror can be added to it (Add Mirror…) because the disk group contains sufficient unallocated storage capacity in appropriate locations.
■■
A mirror can be removed (Remove Mirror…) or split (Break Mirror…) because the volume still consists of two mirrors.
Figure 8.27 Adding disk back to original volume.
Advanced Volumes
187
Figure 8.28 Resynchronizing third mirror for reuse.
■■
The drive letter or path to the volume can be changed (Change Drive Letter and Path…), or the volume can be reformatted (Format…), obliterating all data on it.
■■
The entire volume can be deleted, and the storage capacity it occupies deallocated or put to other use (Delete Volume…).
■■
The properties of the volume can be examined and modified, as shown earlier in Figure 8.19.
Finally, the volume’s plex structure within the newly added third mirror (plex) is shown in the background of Figure 8.28; the progress of resynchronization (Progr…) is visible in the foreground.
CHAPTER
9
More Volume Management
Extending Volume Capacity A major advantage of Windows volume management over previously available technologies is its ability to extend the capacity of a volume while it is online by adding rows to it. More specifically: ■■
A simple or spanned volume is extended through the addition of a subdisk to the end of its volume block address space.
■■
A striped or RAID volume is extended by the addition of equal-size subdisks to each of its columns.
■■
A mirrored volume is extended by the addition of equal-size subdisks to each of the columns in each of its plexes.
In the first example in this chapter, a Logical Disk Manager dynamic spanned volume is extended by the addition of a subdisk located on another disk.
Volume Extension Rules Figure 9.1 illustrates some options for extending the capacity of a three-mirror volume. To extend the volume in this figure, subdisks of equal size must be appended to each of the volume’s existing subdisks. The new subdisk (SubDisk D) appended to SubDisk A may be located on any disk in the disk group, except Disks B' and C', which contain other columns of the volume. If SubDisk D were allocated on Disk B' or C', failure of one of those disks 189
190
CHAPTER NINE
Figure 9.1 Extending a spanned volume (Logical Disk Manager).
would disable two of the volume’s mirrors, making it nonfailure-tolerant. The Windows 2000 Volume Manager does not permit this to happen. Similar restrictions apply to SubDisks E and F. A subdisk appended to a column of a mirrored volume may occupy space on the disk that contains the subdisk it is extending, but it may not be located on a disk that contains part of another of the volume’s columns. When requested to extend a mirrored volume, the Volume Manager first determines whether capacity exists within the disk group to allocate subdisks of the requested size without violating failure tolerance or performance constraints. Thus, in the example of Figure 9.4, a subdisk appended to the plex containing SubDisk A may be allocated on Disk A' without changing the volume’s failure-tolerance characteristics. It may not be allocated on Disk B' or Disk C', however, because a failure of either of those disks would result in the failure of two plexes. Thus, the general rules followed by the Volume Manager in extending the capacity of a volume are: ■■
If possible, extend subdisks by adding contiguously addressed disk blocks to them. When this is done, the existing subdisks become larger; no new subdisks are created.
More Volume Management
191
Figure 9.2 Specifying the disk on which to extend a spanned volume.
■■
If contiguous extension of a subdisk is not possible, create another subdisk of the required capacity on the same disk and append it logically to the existing subdisk.
■■
If sufficient space for extension cannot be allocated on the disk containing a given subdisk of the volume to be extended, create a subdisk of the required capacity on another disk and append it logically to the existing subdisk.
■■
If a subdisk of a mirrored volume is extended by logically appending a newly created subdisk on another disk, the new subdisk must not be located on a disk containing subdisks from other plexes of the same volume.
■■
If a subdisk of a striped or RAID volume is extended by logically appending a newly created subdisk on another disk, the new subdisk must not be located on a disk containing a subdisk from any other column of the RAID volume.
Figure 9.4 illustrates noncontiguous extension of a mirrored volume. SubDisks D, E, and F must only be created because the disk blocks immediately
192
CHAPTER NINE
Figure 9.3 Extended spanned volume.
adjacent to SubDisks A, B, and C are not available for allocation. SubDisk D could be allocated on Disk A', or on some other disk in the disk group (not illustrated), but may not be allocated on either of Disk B' or Disk C', because that would create a situation in which a single disk failure would result in the failure of two plexes. A volume may be extended any number of times, although for manageability and performance predictability volume extensions that result in complex subdisk allocation patterns are not recommended.
Extension of a Mirrored Volume This example illustrates the extension of a 1-gigabyte three-mirror volume’s capacity to 2 gigabytes. The starting state of the system, is illustrated in Figure 8.20 (except that regeneration of the mirrored volume contents is complete). Figure 9.5 illustrates invocation of the Extend Volume… command used to launch the Volume Manager’s Extend Volume wizard. The Extend Volume wizard begins with the usual introductory panel, followed by a panel in which the administrator specifies the Express or Custom mode of wizard operation. In Express mode, the Volume Manager
More Volume Management
Disk A' SubDisk A
Disk B' SubDisk B
Disk C' SubDisk C
SubDisk A Block 000–Plex Block 000
SubDisk B Block 000–Plex Block 000
SubDisk C Block 000–Plex Block 000
SubDisk A Block 001–Plex Block 001
SubDisk B Block 001–Plex Block 001
SubDisk C Block 001–Plex Block 001
SubDisk A Block 002–Plex Block 002
SubDisk B Block 002–Plex Block 002
SubDisk C Block 002–Plex Block 002
SubDisk A Block 003–Plex Block 003
SubDisk B Block 003–Plex Block 003
SubDisk C Block 003–Plex Block 003
(may not be Disk B' or Disk C') SubDisk D
193
(may not be Disk A' or Disk C') SubDisk E
(may not be Disk A' or Disk B') SubDisk F
SubDisk D Block 000–Plex Block 004
SubDisk E Block 000–Plex Block 004
SubDisk F Block 000–Plex Block 004
SubDisk D Block 001–Plex Block 005
SubDisk E Block 001–Plex Block 005
SubDisk F Block 001–Plex Block 005
SubDisk D Block 002–Plex Block 006
SubDisk E Block 002–Plex Block 006
SubDisk F Block 002–Plex Block 006
SubDisk D Block 003–Plex Block 007
SubDisk E Block 003–Plex Block 007
SubDisk F Block 003–Plex Block 007
Figure 9.4 Extending a three-mirror volume.
chooses the disks on which the expansion will occur. In Custom mode, the Volume Manager makes a proposal for where to locate the volume extensions; space permitting, the administrator may modify this proposal. Assuming Custom mode is specified, the Specify Size and Disks to Extend Volume panel, shown in Figure 9.6, appears. This panel shows the
Figure 9.5 Using the Extend Volume command to increase the capacity of a threemirror volume.
194
CHAPTER NINE
Figure 9.6 Size and disk specification for extending a volume.
important characteristics of the volume to be extended (capacity, name, and drive letter), and allows the administrator to specify the amount by which the volume is to be extended and the disks on which the extension space is to be allocated. As with other wizards, when the administrator clicks the Query Max Size button, the Volume Manager computes and displays the maximum size extension to the volume that can be allocated from available capacity in the disk group. The extension capacity of 1000.25 megabytes shown in this figure is a result of clicking this button. For this example the custom mode of the extension is specified, so that the Volume Manager’s proposal for allocating additional subdisks can be viewed. This view is shown in the Verify Disks panel, in Figure 9.7. According to the Volume Manager’s proposal, each of the three-mirror volume’s plexes will contain 2000 megabytes after extension. The Volume Manager is proposing contiguous extension, meaning that each disk containing a subdisk of the original volume has sufficient unallocated capacity adjacent to the volume’s original subdisks to simply extend the subdisk rather than allocate a new one. As Figure 9.7 illustrates, the Volume Manager has proposed that Disks 3, 4, and 5 be used to extend the three subdisks already located on Disks 3, 4, and
More Volume Management
195
Figure 9.7 Extension proposal for three-mirror volume.
5, respectively. Because existing subdisks are being extended, no additional subdisks are added to the volume’s plexes. If the administrator accepts this proposal, the resulting volume will consist of its three original subdisks, each with its capacity increased by the amount of the extension. When the wizard is executing in Custom mode, the administrator may modify the Volume Manager’s proposal by clicking the Modify… button on the Verify Disks panel. Clicking this button displays the now-familiar Modify Disk Selection dialog, also shown in Figure 9.7. Each entry in the Disk column of this dialog is a drop-down list in which valid alternatives for allocating the corresponding subdisk are displayed. In the case of this figure, the disk group contains only three disks, all of which contribute capacity to the volume, so there are no allocation alternatives. All three drop-down lists are, therefore, empty. If the wizard were operating in Express mode, or if the Volume Manager’s allocation proposal had either been accepted or modified to the administrator’s satisfaction, the Extend Volume wizard’s specification input phase would be complete, and actual volume extension would begin. Figure 9.8 shows the General view of the volume in this example while extension is occurring. The progress bar (the column labeled Progr… in the console’s information window) tracks the volume extension process. Extension of nonfailure-tolerant (simple, spanned, and striped) volumes is relatively rapid, extension of mirrored and RAID volumes is more time-consuming, because the new capacity must be resynchronized. For mirrored volumes, the contents of corresponding blocks in each plex must be made identical; for RAID volumes, the parity for each stripe must be made consistent with the data block contents in the stripe.
196
CHAPTER NINE
Figure 9.8 Three-mirror volume during extension.
Figure 9.9 shows another view of the volume for this example during extension—the Disk view; here, resynchronization has advanced from 6 percent complete to 10 percent. It is also clear that the volume labeled 1GB3WayMirror, which has a total of 2000.25 megabytes of capacity, occupies essentially the entire capacity of the three disks on which it resides.
Figure 9.9 Disk view of three-mirror volume undergoing extension.
More Volume Management
197
The status of the dynamic mirrored volume shown in Figure 9.9 is: Healthy. All six disks are functional. Resynching. The Volume Manager is in the process of making all corresponding blocks on the three volume copies identical. The “10%” denotes what percentage of the resynchronization is complete. In the Windows NT Version 4 environment, a volume cannot be used while it is being extended. In Windows 2000, however, volumes can be used while they are undergoing extension, as shown in Figure 9.10, where the extended volume being explored is still resynchronizing. From the volume’s Properties page it is clear that the extended capacity is already available to applications. As with newly created failure-tolerant volumes, however, even though the extended volume is usable while resynchronization is still occurring, it is not failure-tolerant until resynchronization is complete. It is therefore good system management practice to avoid storing data that is not easily reproduced on a mirrored volume while it is still resynchronizing due to an extension. In the Windows NT Version 4 operating system environment, a dynamic volume (the only kind of volume that can be extended) must be unmounted after it is extended and remounted in order for the NTFS file system to recognize the additional capacity and begin to use it. The Windows NT Version 4 Volume Manager will automatically unmount and remount a volume after extension as long as no applications are accessing it. If the volume is in use, however, an
Figure 9.10 Using a volume while it is being extended.
198
CHAPTER NINE
administrator must stop all application access to the volume and unmount and remount it manually. The Volume Manager for Windows 2000 forces unmounting and remounting after a volume is extended, momentarily blocking application execution while it makes newly added volume capacity available to the file system. Figure 9.11 shows the Volume Manager’s General view of the volume structure that results from extension of the three-mirror volume extended in this example. Each of the volume’s three plexes has been extended by the addition of 1000 megabytes to its existing subdisk, to bring its total capacity, as well as that of the volume, to 1.95 gigabytes.
Features Unique to Windows 2000 Volumes As already noted, Windows 2000 volume management adds significant capabilities to those available in the Windows NT Version 4 environment. The most important of these is the more dynamic behavior of Windows 2000 volumes. An example of this is the Windows 2000 Volume Manager’s transparent blocking of I/O after volume extension so that the file system can be unmounted and remounted to make use of the extended capacity. The ability to use a volume while it is being extended, as described in the preceding section, is another. Figure 9.12 expands on the preceding example slightly by using Windows Explorer to copy some data from an existing volume onto the newly extended
Figure 9.11 Internal structure of extended three-mirror volume.
More Volume Management
199
Figure 9.12 Copying data to a three-mirror volume during resynchronization.
volume before resynchronization is complete. The progress of the copy operation (in the Windows 2000 Explorer window) can be seen, along with the status of the volume using the Volume Manager console’s volume detail view. The progress value (here at 32 percent) indicates that resynchronization of the newly extended volume is not yet complete. Again, that means data on the volume is not failure-tolerant, but that the volume can be used as soon as a file system has been formatted on it.
Mount Points One behavioral difference between the Windows NT Version 4 and Windows 2000 Volume Managers can be seen in the Assign Drive Letter or Path panel of the Windows 2000 Volume Manager Create Volume wizard, shown in Figure 9.13. In addition to the options of assigning a drive letter to the new volume, or deferring the assignment entirely, the Windows 2000 Volume Manager (and the Logical Disk Manager) offers the option of establishing a mount point that can be used by applications to access the newly created volume. Windows 2000 mount points perform a function similar to the UNIX construct of the same name: They insert a volume at a point in a larger directory hierarchy. This allows path traversal to data to emanate from a single root. Mount
200
CHAPTER NINE
Figure 9.13 Assign Drive Letter or Path panel showing mount point browser.
points serve another function in Windows 2000, as well. The Windows storage device-naming scheme—using a single letter to denote each partition or volume—is a significant constraint for large servers with large numbers of storage devices attached to them. By making volumes available to applications at mount points, drive letters are conserved, thereby supporting larger storage configurations. In Figure 9.13, the administrator has specified that applications will address the volume being created using a mount point rather than a drive letter, and has clicked the Browse button, displaying the Browse for Drive Path dialog. In this example, the administrator had previously created a folder named Mount Point (either using the Windows 2000 Explorer New Folder command or creating it directly from within this dialog by clicking the New Folder button). The effect of using a mount point to expose a volume to applications is that the root directory of the volume is treated as if it were the folder specified as the mount point. Figure 9.14 shows the properties page for the folder Mount Point after volume creation is complete. An administrator or user can discern a mount point from its icon in the Explorer bar: The mount point occupies the location that a folder would occupy, but instead of a folder icon, the mount point displays as a Windows disk icon. The Properties page for this
More Volume Management
201
Figure 9.14 Explorer view of volume located at a mount point.
mount point indicates that the “target” of the mount point is a volume whose label is 1GB Simple.
FAT32 File System Another difference between volume management in Windows NT Version 4 and Windows 2000 is that Windows 2000 supports the FAT32 file system as well as the FAT and NTFS file systems, as shown in the Format Volume panel in Figure 9.15. FAT32 is a file system originally developed to enable desktop Windows systems to use large disks more efficiently. Support for it in the Windows 2000 environment enables disks initialized on desktop systems to be accessed by Windows 2000 servers. While FAT32 file systems enable compatibility with versions of Windows operating systems that do not use “NT technology,” they have certain functional limitations compared to their NTFS counterparts: ■■
FAT32 file systems cannot be extended, so volumes on which they reside cannot be extended meaningfully.
■■
FAT32 file systems cannot be mounted at mount points.
202
CHAPTER NINE
Figure 9.15 File systems supported in Windows 2000.
Moreover, the FAT32 file system is designed for the more constrained environment of desktop computers, and lacks many of the advanced performance features of NTFS. Use of FAT32 is therefore usually best limited to situations in which media compatibility with non-NT technology file systems is required. Figure 9.16 shows the Disk view of a volume formatted with the FAT32 file system, as well as the volume’s Properties page.
Figure 9.16 FAT32 file system on a mirrored volume.
More Volume Management
203
Mirrored-Striped Volumes When an application requires a storage device that is both very reliable and larger than the largest available disk, mirroring and striping techniques can be combined to provide such a device. In such a combination: ■■
Plexes that stripe data across multiple disks enable creation of volumes many times larger than the largest available disk.
■■
Mirroring two or more such plexes provides very high data reliability by protecting against data loss due to disk and I/O path failure.
There are several ways to combine mirroring and striping techniques to create high-capacity failure-tolerant volumes: ■■
Volume block addresses can be striped across several sets of disks, forming plexes of identical capacity whose contents can be mirrored. These are called mirrored-striped volumes in this book.
■■
Mirrored plexes of two or more disks can be created, and data striped across them. These are called striped-mirrored plexes in this book.
■■
Mirrored volumes can be created within a RAID subsystem, and data can be striped across the resulting virtual disks by a volume manager.
■■
Striped volumes can be created within a RAID subsystem, and data can be mirrored across two or more of them.
From a host-based volume management standpoint, the latter two configurations are precisely the same as striping and mirroring data across disks, respectively, hence they are not discussed further. Of the remaining two (purely host-based) alternatives, striping volume block addresses across two or more mirrored plexes is generally preferred because the amount of data put at risk by a single disk failure is lower than with the other alternative. For example, in a population of eight disks organized as four mirrored plexes with data striped across them, failure of a single disk leaves only the data in the volume blocks represented on the failed disk at risk. This contrasts with the same eight disks organized as two four-disk striped plexes whose contents are mirrored to each other. A failure of a single disk in this configuration makes the disk’s entire plex unusable; in effect, four disks are incapacitated by the failure of one. Not only is more data put at risk by a single disk failure; recovery is necessarily more time-consuming, because resynchronization requires that all data from the plex be copied, rather than just data from the failed disk, as in the striped-mirrored scenario. The VERITAS Volume Manager for Windows 2000 supports both mirroring of data across two or more striped plexes and striping data across as many as 32 mirrored plexes. Because of the lower risk of data loss and shorter recovery time, the latter configuration is generally preferred.
204
CHAPTER NINE
The Volume Manager and MirroredStriped Volumes Architecturally, a Volume Manager mirrored-striped volume is a mirrored volume whose individual mirrors are striped plexes. The example in this section uses the Volume Manager for Windows 2000 to create a mirrored-striped volume with 4000 megabytes of capacity. The starting point for this example is the system state shown in Figure 9.17, where all disks have been upgraded to dynamic format, but no volumes have been allocated. Also visible in the figure is the Create Volume… command that invokes the Volume Manager’s Create Volume wizard. In the Create Volume wizard’s Select Volume Type panel, shown in Figure 9.18, the parameters for this example are given. The example specifies creation of a two-mirror volume of striped plexes with a total usable capacity of 3000 megabytes. The volume is specified with three columns and a stripe unit size of 64 kilobytes. This means that each of the volume’s two mirrors (plexes) will have three columns (subdisks). For performance reasons, the Volume Manager allocates all three columns of each striped plex on different disks. For failure tolerance, the volume’s two plexes are not permitted to have disks in common. This volume configuration therefore requires subdisks to be allocated on a total of six different disks.
Figure 9.17 Initial disk state and create volume command.
More Volume Management
205
Figure 9.18 Size and layout specification for striped-mirrored volume.
In the next stage, shown in Figure 9.19, the Create Volume wizard displays its storage allocation proposal for the mirrored-striped volume. In this figure, the Verify Disks panel is shown twice, once scrolled up and once scrolled down. The panels illustrate the two plexes proposed by the Volume Manager. None of the subdisks (Columns) of either plex have any disks in common, since each plex is a backup for the other in the event of a disk failure. Similarly, the subdisks within each plex are on separate disks, to maximize I/O performance by spreading data (and therefore accesses to it) across three disks. This effect would not be achieved with two or more subdisks allocated on the same disk. The system shown in Figure 9.17 has seven dynamic disks. Thus, at this point, an administrator can click the Modify button (shown in Figure 9.19), to display the dialog shown in Figure 9.20. Here, the drop-down box shows that Disk 10 is an eligible alternative to the Volume Manager’s proposal of Disk 3. As the only dynamic disk in the group called LargeDynamicGroup (Figure 9.17) that is not already part of the Volume Manager’s proposal, Disk 10 is the
206
CHAPTER NINE
Figure 9.19 Volume Manager’s allocation proposal for mirrored-striped volume.
Figure 9.20 Modify Disk Selection dialog with no options.
More Volume Management
207
only alternative. In fact, Disk 10 would be listed as the alternative for any of the disks proposed by the Volume Manager in Figure 9.20. When the administrator either accepts or modifies, then accepts, the Volume Manager’s proposal, he or she clicks OK to continue the Create Volume wizard process. The next step is drive letter assignment, using the dialogs shown in Figure 9.21. A mount point for the new volume (represented by the empty NTFS folder MountPt) has been specified by clicking the Browse button and using the Browse for Drive Path dialog to locate the folder. The remaining steps for creating a mirrored-striped are similar to those for previous examples, and so are not shown. Figure 9.22 shows the structure of the resulting volume. Each of the volume’s two plexes has three columns, and each column of each plex is allocated on a different disk. A failure-tolerant volume can be quick formatted and used as soon as its file system structures have been written (before resynchronization of volume blocks is complete). No operating system reboot is required. Figure 9.23 shows a Windows 2000 Explorer drag-and-drop copy command with the newly created volume as the target.
Figure 9.21 Specifying a mount point for a new volume.
208
CHAPTER NINE
Figure 9.22 Structure of a mirrored-striped volume.
At the same time, as can be seen in the background of Figure 9.24, the newly created volume is still resynchronizing (indicated by the “25%” in the Progress column). The foreground shows the copy operation in progress. The source folder for the copy is highlighted; the target is the folder MountPt. Data copied to the volume is not failure-tolerant until resynchronization is
Figure 9.23 Drag-and-drop copy to mirrored-striped volume.
More Volume Management
209
Figure 9.24 Using a mirrored-striped volume during resynchronization.
complete. However, as this example shows, it is possible to simultaneously populate the volume with the initial resynchronization process.
Dynamic Expansion of Mirrored Volumes When a failure-tolerant volume is extended in the Windows 2000 environment, the Volume Manager forces the unmount and remount required for an NTFS file system to recognize and use the additional capacity.1 (In Windows 2000, the volume can be used immediately after the remount, even though resynchronization of the added capacity is not complete. Figure 9.25 shows a Windows 2000 Explorer copy operation with a three-mirror volume (addressed by applications using drive letter J:) as the target. The volume has been extended by the addition of a subdisk to each of its plexes and is still resynchronizing, as evidenced by the progress indicator (shown at 43%). This figure also shows the subdisks added to each of the volume’s plexes, as well as the total volume capacity (3.90 gigabytes). The Windows Explorer window and progress dialog in the background indicate that the data copy is still in progress. 1
In the Windows NT Version 4 environment, administrator action is required to stop all application access to an extended volume so that it can be un-mounted and re-mounted.
210
CHAPTER NINE
Figure 9.25 Data being copied to a three-mirror volume during extension.
Splitting a Striped Mirror The value of splitting a mirror from a mirrored volume has already been discussed. A split mirror becomes an image of operational data frozen at a point in time from which a backup can be made or other analysis or testing performed. With Volume Manager mirrored volumes consisting of three or more mirrors splitting one mirror for these purposes leaves a failure-tolerant twomirror volume in place; splitting a two-mirror volume results in two nonfailure-tolerant striped volumes. Figure 9.26 shows the structure of the striped-mirrored volume from a preceding example, as well as the Volume Manager’s menu of volume operations with the Break Mirror… command selected. Execution of this command results in splitting one of the striped plexes from the volume and automatically mounting it as a nonfailure-tolerant striped volume. The new volume’s contents, including its label, are identical to those of the original volume at the instant of splitting, but after splitting, there is no further relationship between contents of the original volume and the newly created one. Invoking the Break Mirror… command displays the alert shown in the foreground of Figure 9.26. Before the Volume Manager splits the mirror, the
More Volume Management
211
Figure 9.26 Striped-mirrored volume structure and Break Mirror command.
administrator must confirm the action, because as the message reads, breaking the mirror from its volume could make data on the volume nonfaulttolerant. Once the administrator has confirmed that the mirror should indeed be split, the Volume manager displays the Break Mirror dialog shown in Figure 9.27. The administrator must specify which of the volume’s striped plexes is to be split and mounted as a separate volume. Clicking the dialog’s Details button
Figure 9.27 Specifying the mirror to be split from a mirrored volume.
212
CHAPTER NINE
displays the Mirror Details dialog, which contains a structural view of the disks on which the two plexes reside. This view can be useful when the Break Mirror… command is issued from a console view that does not show the volume structure. After clicking OK, the state of the volume after splitting can be seen in both the Windows Explorer and Volume Manager Console views, as shown in Figure 9.28. The Volume Manager has just mounted the split mirror with drive letter I: (the navigation panel of the display has not yet been updated). Both the original and split mirror volumes are shown as striped volumes of identical capacity. The original volume can be made failure-tolerant again by adding a mirror to it. Split mirrors are usually disposed of in this way after they have served their backup or other purpose. An administrator would do this by deleting volume I: using the Delete command on the volume context-sensitive menu. Deleting volume I: creates unallocated space in LargeDynamicGroup. The Add Mirror wizard, described earlier, can then be used to add a striped plex to the volume. Splitting a striped mirror from a mirrored-striped volume at an application’ quiescent point is a useful way to obtain a consistent backup of a set of data without taking it out of service for a long time. Care must be taken, however, that the volume is reported as having a Healthy status when the mirror is split, and that the mirror is added back to the volume at a time when resynchronization I/O will not result in unacceptable deterioration of application I/O performance.
Creating and Extending a RAID Volume The Volume Manager also supports RAID volumes, as described starting on page 63. RAID technology was developed when disks were considerably more
Figure 9.28 Split mirror mounted as striped volume I:.
More Volume Management
213
expensive than they are today. Its premise is that a single disk can protect against data loss due to the failure of any other single disk. While this premise is correct, using RAID technology foregoes two significant advantages of mirrored volumes: ■■
Splitting a copy of data for backup or other analysis is not possible with RAID volumes, because a RAID volume’s underlying plexes contain only one complete copy of data.
■■
Unlike a mirrored volume, a RAID volume cannot be made to protect against multiple disk failures. Thus, the quality of protection offered by RAID volumes is limited compared to that of mirrored volumes, whose protective capabilities can be extended through the configuration of additional mirrors.
RAID is less popular today than it was in the days of more costly physical storage capacity, it is still useful for protecting certain kinds of data against disk failure. A Volume Manager or Logical Disk Manager RAID volume consists of a single plex with parity distributed across its columns in a stripe-by-stripe rotating pattern. RAID volumes are referred to in the console and dialogs as RAID-5 or RAID volumes. The examples that follow illustrate the creation of RAID volumes, as well as Logical Disk Manager behavior when a volume fails. A Volume Manager RAID volume can be extended by the addition of rows to the volume’s subdisks. The examples that follow illustrate this as well. As with all Windows 2000 volumes, the administrator specifies the parameters of RAID volumes using the Create Volume wizard, which is invoked via the Create Volume… command. The wizard starts by displaying the usual introductory panels (not shown here). Then, as Figure 9.29 shows, the system administrator supplies the specifications for the RAID volume in the Select Volume Type panel. Here, a four-column RAID volume with 4000 megabytes of usable capacity and a stripe unit size of 16 kilobytes has been specified. As with all volumes, the Total volume size specified by the administrator reflects the usable capacity of the volume, net of any check data such as RAID parity. A four-column RAID volume with a usable capacity of 4000 megabytes results in the allocation of four subdisks, each with 1333 megabytes of capacity. Parity check data is dispersed throughout the volume’s subdisks, with the equivalent of one column of capacity reserved for it. The Stripe Unit Size can be adjusted, as for striped and mirrored striped volumes. Larger stripe unit sizes tend to improve performance with I/O loads consisting predominantly of small, random I/O requests. Smaller stripe unit sizes tend to optimize performance for sequential I/O loads of large requests, as long as the stripe unit size is a significant fraction of a disk track (one-third or more).
214
CHAPTER NINE
Figure 9.29 Creating a RAID volume.
The Volume Manager’s allocation proposal for this RAID volume is shown in Figure 9.30. To meet the failure-tolerance requirements of RAID, columns on four different disks are proposed. Again, RAID can protect a set of subdisks against data loss due to a failure of any one of them. Thus, if two subdisks were allocated on the same disk, failure of that disk would make the entire RAID volume inaccessible. Figure 9.30 also shows the Modify Disk Selection dialog that is displayed when the Modify button on the Verify Disks panel is clicked. In this dialog the administrator can alter the Volume Manager’s space allocation proposal. Each row of the Disk column contains a drop-down list of disks eligible to substitute for the disk proposed by the Volume Manager. In this example, the administrator accepts the Volume Manager’s allocation proposal. Figure 9.31 shows the console’s Disk view of the volume shortly after creation (before the file system has been completely initialized). In the figure, the status of the RAID volume is shown as: Healthy (all disks are functional), Regenerating, and Formatting. As with mirrored volumes, the contents of the subdisks that comprise a RAID volume must be synchronized before the volume is failure-tolerant. Formatting and resynchronization can be concurrent.
More Volume Management
215
Figure 9.30 Modifying the Volume Manager proposal for RAID volume allocation.
Resynchronization of mirrored volumes is largely I/O-bound. With RAID volumes, however, not only does resynchronization require significant I/O (all blocks on all subdisks must be either read from or written to); there is computational overhead as well (see page 71 for a more complete description of updating data on a RAID volume). Figure 9.32 shows a Windows Task Man-
Figure 9.31 RAID volume regenerating shortly after creation.
216
CHAPTER NINE
Figure 9.32 CPU utilization during RAID volume resynchronization.
ager trace of CPU utilization during RAID volume regeneration. The example system is an 800-MHz Pentium system. Lower processor utilization can be expected with higher-speed processors. The point of the example is that some level of processor consumption is to be expected when writing data to hostbased RAID volumes.
RAID Volumes and Disk Failure The whole purpose of RAID technology is to protect against disk failure. When a disk that is part of a RAID volume fails, data on it can be regenerated by reading data and parity from the corresponding blocks on the volume’s remaining subdisks and computing the exclusive OR function of them. The example in this section illustrates a Logical Disk Manager RAID volume’s behavior when a disk fails.
More Volume Management
217
To begin, Figure 9.33 shows the Logical Disk Manager console’s Disk view of a RAID volume; 1000-megabyte subdisks have been allocated from each of Disks 2, 3, 4, and 5. The available capacity of this volume is 3 gigabytes (the volume label 1GB RAID notwithstanding). Figure 9.34 shows the RAID volume in use as the target of a file copy initiated from Windows 2000 Explorer. The Explorer window and the progress dialog show the progress of the file copy. The Windows Task Manager processor utilization display illustrates the processor utilization due to writing data to a RAID volume. Copying data to a RAID volume is a performance “worst case.” Virtually every I/O request to the RAID volume is a write. For each write, not only must data be written, but updated parity must be computed (from data read from the volume’s disks) and written. The point is, write-intensive applications that use RAID volumes for data storage result in significant CPU utilization. In this example, the RAID volume is still regenerating. This illustrates the point that like mirrored volumes, RAID volumes can be used before they are completely synchronized. Data is not protected against disk failure, however, until synchronization is complete. It is prudent, therefore, to avoid writing data that are not easily replaceable to a RAID volume until the volume is completely synchronized.
Figure 9.33 Healthy RAID volume.
218
CHAPTER NINE
Figure 9.34 CPU utilization while copying data to a RAID volume.
Figure 9.35 shows a RAID volume in which a disk (Disk 3) has failed. In this example, the nature of the failure is such that the disk completely fails to respond to the driver, so it has completely disappeared from the Disk view. The state of the volume is shown as Failed Redundancy. In this state, however, the volume is still operational; any data can be read or written, but data is not protected against additional disk failures. For this reason, it is good system management practice to replace failed disks promptly, and to restore RAID and mirrored volumes to a redundant state. To accomplish this, the administrator uses the Repair Volume… command on the Logical Disk Manager’s context-sensitive menu for volume objects as illustrated in Figure 9.36. In the figure the Logical Disk Manager is effectively requesting that the administrator specify a disk to replace failed Disk 3 in the RAID volume. Because Disk 3 appears as an eligible choice for replacement, the failed disk has obviously already been removed from the system and replaced with a working one. Any disk in the disk group that does not contain a component of the RAID volume is an eligible replacement. Usually, it makes the most sense to replace a failed disk in a volume with a disk occupying the same logical position in the system as the failed one. When the replacement disk has been specified, the Logical Disk Manager rewrites metadata structures to indicate the addition of the disk to the RAID volume. It also begins the process of regenerating correct contents for the newly configured disk by reading corresponding data and parity blocks from the volume’s remaining disks, computing the exclusive OR of their contents, and writing the result to the newly configured disk (Disk 3). In Figure 9.37, the
More Volume Management
Figure 9.35 RAID volume with failed disk.
Figure 9.36 Specifying a replacement disk for a RAID volume.
219
220
CHAPTER NINE
Figure 9.37 Regenerating RAID volume contents on replacement disk.
Logical Disk Manager console disk view is shown at a point when regeneration is 32 percent complete.
Extending a RAID Volume (Volume Manager Only) The Windows 2000 Volume Manager (but not the Logical Disk Manager) can extend the capacity of a RAID volume, by adding a number of rows to each of the volume’s subdisks. These rows may be appended to the volume’s original subdisks; or, if physical storage contiguous to the existing subdisks is not available, additional subdisks can be created. The following example extends the capacity of a RAID volume by 2141.95 megabytes. Volume extension is done using the Volume Manager Extend Volume wizard, invoked from the contextsensitive menu shown earlier in Figure 9.5. The Extend Volume wizard starts by displaying a Specify Size and Disks to Extend Volume panel, illustrated in Figure 9.38, in which the administrator is requested to specify the amount of storage capacity to be added to the volume, as well as the disks on which that capacity should be allocated. The latter are specified by checking boxes in a list of eligible disks displayed by the Volume Manager. The listing for each eligible disk indicates the total amount of free space on the disk. If the administrator clicks the Query Max Size button in this display, the Volume Manager will compute the maximum size by which the volume can be extended on the specified disks. In Figure 9.38 this has been done. Here, the
More Volume Management
221
Figure 9.38 Size and disk specification for RAID volume extension.
disk group contains only the four disks on which the RAID volume’s existing subdisks are allocated, so all four disks must be used in extending the volume. In larger disk groups, there might be other options. When the extension capacity has been specified, the Volume Manager displays a Verify Disks panel showing the plex and subdisk layout proposed for the extended volume by the Volume Manager (Figure 9.39). In this example, disk space contiguous to each of the volume’s existing subdisks is available, so the proposal is to extend the original subdisks. If contiguous space were not available on each of the disks, the Volume Manager would propose to add a subdisk to each of the volume’s columns. In either case, the administrator may modify the Volume Manager’s proposal by clicking the Modify button on the Verify Disks panel and dropping down the list of eligible alternatives for any of the proposed disks. Here, though, all disks in the disk group are used by the volume, so there are no alternatives. The drop down lists would therefore be empty. Figure 9.40 shows the execution of a Windows 2000 Explorer drag-and-drop copy command in progress, for which the target is the recently expanded RAID
222
CHAPTER NINE
Figure 9.39 Volume Manager Proposal for extending RAID volume.
volume (SmallRAID5Volume, addressed by applications using drive letter R:). The Explorer progress indicator is also shown as an inset in the figure. The copy was initiated while the extended volume was still resynchronizing. Figure 9.41 shows the file copy operation farther along, the progress of volume resynchronization (shown in the Progress column, at 42%), and the
Figure 9.40 Drag-and-drop Copy command in progress.
More Volume Management
223
Figure 9.41 RAID volume being extended during a copy operation.
processor load imposed by the copy (visible in the Windows Task Manager CPU Usage graph). ■■
Progress of the file copy operation is shown by the Explorer progress indicator. The image in Figure 9.41 was obviously captured at a later point in the copy than that in Figure 9.40.
■■
Progress of synchronization of the storage capacity added to the RAID volume is indicated by the Progress column of the Volume Manager console volumes General view.
■■
Processor utilization resulting from writing a steady stream of data to the volume while it is being resynchronized is indicated by the Windows Task Manager CPU Usage graph. The execution of a continuous stream of writes is a worst-case scenario for RAID volumes because every write requires that parity be read, recomputed and updated in addition to the data write.
Although it is possible to write data to a RAID volume while it is resynchronizing after extension, it is advisable to do so only with data that can easily be reproduced, in case a failure occurs. An extended RAID volume is not failuretolerant until resynchronization of the extended capacity is complete. A better system management practice would be to restrict updates to resynchronizing volumes to reproducible operations such as initial database population.
224
CHAPTER NINE
Figure 9.42 View of extended RAID volume showing allocated disk capacities.
The plex structure of the extended volume and the amount of allocated and free capacity on each disk on which volume space is allocated is shown in Figure 9.42. This figure also shows one of the minor side effects of using disks of different types to make up volumes. Disk 8 has a slightly smaller capacity than the other disks containing subdisks of this volume. All but 3 kilobytes of Disk 8 have been allocated to the RAID volume. Each of the other disks in the group with slightly larger capacities has 101.97 megabytes unallocated, even though the volume was extended to the maximum possible size using the disks available (Figure 9.38). The unused space occurs because all subdisks of a striped or RAID must be of the same size. The largest possible subdisk was allocated on Disk 8; that subdisk limited the size of the subdisks allocated on the other disks, even though slightly more capacity (101.97 megabytes) was available on them.
Multiple Volumes on the Same Disks The Logical Disk Manager and Volume Manager both can be used to create and manage multiple volumes of different types allocated on a common set of disks. To demonstrate how to share a set of disks among several volumes, the example in this section creates: ■■
A concatenated volume with a capacity of 1500 megabytes.
■■
A RAID volume with four columns and a capacity of 1500 megabytes.
■■
A mirrored-striped volume with a usable capacity of 1500 megabytes.
This example is offered as a demonstration of the capabilities of the Volume Manager and Logical Disk Manager for Windows 2000, not as an example of good data management practice. For most applications, a set of disks should
More Volume Management
225
be dedicated to a single volume so that known failure tolerance and performance characteristics can be applied to all of an application’s data. Situations in which sharing a set of disks among volumes of different types might be appropriate include: ■■
Data that is kept online for convenience but that is seldom accessed by applications or users can be located on a volume that shares storage capacity with another volume used by more active online applications.
■■
Applications that do not run concurrently and do not share data can have their data allocated on volumes that share a set of disks. In this scenario, only one application runs at a time, so there is no interference from another application’s I/O demands. If the applications have different failure-tolerance or performance requirements, these can be accommodated by different volume types.
NOTEIn this example, the steps for creating the three volumes are similar to those in previous examples and so are not repeated. All that is shown are the General and Disk views of the configuration that results from creation of these three volumes.
This example begins with the view of the three volumes (labeled 1GBStripedMirroredVolume, 2GBConcatenatedVolume, and 1500MBRAID5Volume, respectively), shown in Figure 9.43. When this figure was captured, the RAID volume had already completed its initial regeneration, but the mirrored striped volume was still resynchronizing (75 percent). Figure 9.44 shows the Disk view of the volume configuration in Figure 9.43. Each of the mirrored-striped and RAID volumes has a 500-megabyte subdisk allocated on each of the disk group’s four disks. The concatenated volume has subdisks allocated on Disks 9 and 10. Figure 9.44 also shows that a 1000-megabyte striped volume with subdisks allocated on Disks 7 and 8 is being formatted at the instant of capture. Thus, all four major volume types (simple or concatenated, striped, mirroredstriped, and RAID) have been allocated on the same set of disks in this example. This is not necessarily a prudent system administration policy; it is only an illustration of the flexibility possible with the Volume Manager and Logical Disk Manager for Windows 2000.
Monitoring Volume Performance The Windows 2000 Volume Manager monitors disk activity and can display relative activity of volumes, disks and subdisks in the console’s Statistics
226
CHAPTER NINE
Figure 9.43 General view of four volumes allocated on a disk group.
view (shown in Figure 9.45). The Statistics view presents a system’s volumes as the columns of a matrix, with the rows representing subdisks sorted by disk. This figure represents an essentially idle system, with no activity on any of the five volumes represented (T:—not used in these examples, 1GBStripedMirrored Volume, 1GB3WayMirror, 2GBConcatenatedVolume, and 1500MBRAID5Volume). The Statistics view is fully
Figure 9.44 Disk View of three volumes allocated on a disk group.
More Volume Management
227
Figure 9.45 Expanded Statistics view of volumes that share disks.
expanded, so that any I/O activity is visible at the subdisk level. This view can help the administrator make decisions about moving subdisks or file objects to balance I/O load across disks and volumes. In Figure 9.45, each volume has a column and each disk has a shaded row. If a volume has a subdisk on a given disk, there is also a nonshaded row for that subdisk marked by an activity icon at the intersection of subdisk row and volume column. In this figure, for example, 1GB3WayMirror has a subdisk on each of Disks 3, 4, and 5; 1GBStripedMirroredVolume and 1500MBRAIDVolume both have subdisks on each of Disks 7, 8, 9, and 10; 2GBConcatenatedVolume has a subdisk each of Disks 9 and 10, while volume T: has a subdisk on each of Disks 7 and 8. In the Statistics view, the level of I/O activity is indicated by a number in a disk’s or subdisk’s rectangle. In addition, four clock icons are used to summarize recent activity levels: ■■
Blue; no fill: Signifies underutilization, or no activity.
■■
Green; one-third filled: Signifies moderate utilization.
■■
Yellow; two-thirds filled: Signifies heavy utilization.
■■
Red; completely filled: Signifies critical utilization.
Though the colors cannot be distinguished in printed figure, at the bottom of the window displays is a key to the meanings of these icons.
228
CHAPTER NINE
Figure 9.46 shows a Windows 2000 Explorer view of the progress of a data copy to the RAID volume, as well as the corresponding disk and subdisk activity. The two numbers in each block of the Statistics view matrix represent, respectively, I/O operations per second and kilobytes transferred per second. More specifically, I/O performance numbers are displayed in the shaded horizontal bars for each disk for which there is I/O activity. Identical numbers are displayed in the column for each volume that has a subdisk allocated on the disk represented by the row. In this case, the numbers in the disk rows are all identical, because the only I/O activity on these volumes is due to the copy to the RAID volume. Unshaded rows of the matrix representing subdisks also have numbers in them when there is I/O to the subdisk during the sampling period. In this case, only the subdisks that are part of 1500MBRAID5Volume show activity, because the only I/O to volumes represented in Figure 9.46 is the copy operation whose target is 1500MBRAID5Volume.
Figure 9.46 Statistics view when copying data to RAID volume.
More Volume Management
229
Figure 9.47 represents disk activity during a similar Explorer copy operation to the 1GB3WayMirror volume. Two significant aspects of this I/O load compared to the load in Figure 9.46 are: ■■
The numbers of I/O requests reported for each subdisk (each disk) are identical. This is to be expected, since all disk I/O requests must be complete before completion of a write to a mirrored volume is reported to the application.
■■
The number of requests per disk reported is considerably higher than the corresponding number for the similar copy operation represented in Figure 9.46. This is also to be expected, because in the case of Figure 9.46, every write entails intervening computations required to update the volume’s parity to correspond to the newly written data.
Figure 9.48 provides additional insight into the usefulness of the Volume Manager Statistics view, as well as into the nature of concatenated volumes. This figure contains statistics captured while the same data was being copied to 2GBConcatenatedVolume. The significant point here is that while there is I/O activity reported for the subdisk on Disk 9, none is reported for the subdisk on Disk 10. This is consistent with the layout of concatenated volumes, which as their name implies, concatenate the block addresses of their subdisks to create a volume address space. Because file systems generally attempt to allocate space compactly, it is highly likely that this pattern of unbalanced I/O will be exhibited by a concatenated volume. For optimal performance, therefore, it is usually desirable to use striping whenever a single plex spans multiple disks. Figure 9.49 shows yet another aspect of I/O statistics, the effect of concurrent I/O to two volumes that share the same set of disks. In this example, data is
Figure 9.47 Statistics view when copying data to three volumes.
230
CHAPTER NINE
Figure 9.48 Statistics view when copying data to a concatenated volume.
Figure 9.49 Statistics view when copying data to two volumes.
More Volume Management
231
being copied to both 1GBStripedMirroredVolume and 1500MBRAIDVolume at the same time. The numbers in the unshaded blocks of the Statistics view matrix represent I/O load on individual subdisks, while those in the shaded blocks represent the totals for physical disks. Total I/O load for a disk is the sum of the I/O loads on all the subdisks that occupy space on it. The display in Figure 9.49 indicates that the majority of I/O operations are being performed on subdisks of 1GBStripedMirroredVolume, even though the same copy operation is being directed to both volumes. While several factors probably affect this imbalance of I/O, a major contributor is certainly the requirement for computations interspersed between every pair of writes to a RAID volume. During the time required by these computations, disk I/O operations can be scheduled on behalf of the striped-mirrored volume, which further delays those for the RAID volume when they are finally scheduled. This suggests two things: first, that extreme care should be taken to understand I/O load characteristics when allocating two volumes on the same set of disks, and, second, that RAID volumes should be specified with caution to prevent unintended side effects.
Relocating Subdisks It is often desirable or necessary to relocate a subdisk from one disk to another, to improve system I/O performance or to forestall possible volume degradation or data loss due to impending disk failure. Perhaps the most common reason for relocating a subdisk is to relieve a single disk that contains subdisks belonging to two heavily loaded volumes that are active at the same time. If unallocated capacity is available on a less heavily loaded disk in the same disk group, one of the busy subdisks can be relocated to it. This spreads the I/O load more evenly across two disks. Subdisks may also be relocated to anticipate and avoid loss of data or exposure to data loss if a disk is thought to be failing. One warning sign that a disk is about to fail is increased incidence of I/O errors. If this condition is detected early enough (for example, if disk and host are equipped with SMART, discussed in Chapter 1), subdisks allocated on it can be moved to other, healthier, disks before hard failure of the suspect disk occurs. A hard-disk failure degrades a failure-tolerant volume and results in data loss in nonfailuretolerant volumes. Movement of a subdisk between two disks is initiated by a system administrator using the Volume Manager wizard for this purpose, which simplifies the subdisk movement to the extent possible. An administrator starts the Subdisk Move wizard by invoking the Move SubDisk… command from the Statistics view (Figure 9.50). To invoke the command, the administrator
232
CHAPTER NINE
Figure 9.50 Invoking the Move SubDisk command.
first expands the view of Disk 7, the disk containing the sub-disk to be relocated, making Disk 7’s subdisks visible. Figure 9.50 illustrates the invocation of the Move SubDisk… command applied to Subdisk 1-03 (the third subdisk of plex 1) of 2GBStripedVolume. The Subdisk Move wizard begins with the usual informational panel, followed by a panel that allows the administrator to specify either Express or Custom mode for moving the subdisk. When Express mode is selected, the Volume Manager chooses the destination for the move, observing the usual performance and failure-tolerance constraints (e.g., no two subdisks from the same RAID volume located on the same disk). Custom mode allows the administrator to specify the disk to which to move the subdisk, within the same constraints. The Select Disk panel, shown in Figure 9.51, is displayed when the Custom mode of execution is specified. The Select Disk panel lists all disks in the group that meet the Volume Manager’s performance and availability constraints and that have sufficient free space to host the moved subdisk. In Figure 9.51, Disks 9 and 10 are listed because they have sufficient unallocated space to accommodate the subdisk from Disk 7; in Figure 9.51, Disk 9 has been specified as the target of the subdisk relocation. As soon as the Subdisk Move wizard finishes processing, the Volume Manager begins to move the subdisk. Two views of the progress of subdisk relocation are shown in Figure 9.52: ■■
The Volume Manager displays an informational dialog (shown in the upper right corner of the figure) to indicate that a subdisk move is in progress.
More Volume Management
233
Figure 9.51 Specifying the target location for a subdisk relocation.
This dialog remains on display until the subdisk has been moved or until it is dismissed by the administrator. ■■
The Explorer view shows the file copy operation in progress. During the move, the file copy is targeted at the volume whose subdisk is being moved.
The second view illustrates the point that moving a subdisk is, from a functional standpoint, largely a transparent operation. Any application operation on a volume can be performed while one of the volume’s subdisks is being moved. But, because moving a subdisk from one disk to another is an I/Ointensive operation (every block of the subdisk to be moved must be read and written), it is unlikely that I/O performed during a subdisk move will be transparent from a performance standpoint. It is usually preferable to avoid performance-critical operations if at all possible while subdisks are being moved. The key advantage of the Volume Manager is that this is not a hard restriction. In an emergency situation, or when response time is not critical, application I/O to a volume can be performed while one of the volume’s subdisks is being moved. When a subdisk must be moved, the administrator’s choice is a “soft” one between possibly impacting application performance and deferring the move, rather than a “hard” one involving application shutdown.
234
CHAPTER NINE
Figure 9.52 Tracking subdisk relocation.
A subdisk move in progress can also be observed in detail from the General, as shown in Figure 9.53. RAID volume status is listed as Regenerating during subdisk relocation, although the volume can be used by applications during the relocation. After a subdisk has been moved, the space it occupied before the move is deallocated and becomes available to the Volume Manager for allocating other subdisks.
Figure 9.53 Detail view of volume during subdisk relocation.
More Volume Management
235
Disk Failure and Repair The purpose of a failure-tolerant volume is to provide continuous application I/O services when one or more of the disks comprising the volume fails. To demonstrate this capability, the example in this section illustrates the behavior of the Volume Manager and the Windows 2000 OS when a disk fails. The example uses two volumes, a 500-megabyte two-mirror volume and a 1500-megabyte four-column RAID volume. One subdisk of each volume is allocated on Disk 6, as Figure 9.54 illustrates. The remaining subdisks of both volumes are allocated on separate disks. At the start of the example, both volumes are in a Healthy state (i.e., all disks are functioning, and resynchronization and regeneration are complete). To simulate a disk failure for this example the disk enclosure where Disk 6 is located is powered down. Disks 2, 3, 4, and 5 are housed in a separately pow-
Figure 9.54 Volume configuration for disk failure and repair example.
236
CHAPTER NINE
Figure 9.55 Volume Manager and OS disk failure notification.
ered enclosure and so are not affected by this action. Power failure results in an immediate and obvious reaction by both the Volume Manager and the Windows 2000 OS itself. Figure 9.55 shows three Volume Manager and OS informational displays that result from powering off Disk 6 (again, which contains subdisks from both the 1500-megabyte RAID volume and from the 500megabyte two-mirror volume). All three of these windows appear as soon as the disk’s failure is detected (by its failure to respond to I/O commands). For the purposes of this example, the two operating system dialogs and the Volume Manager event display shown in this figure can all be dismissed. In actual practice, however, an administrator would determine the reason for the messages, for example by examining event logs or by physically inspecting hardware and taking action to repair the problem. In addition to the attention-getting messages shown in Figure 9.55, both Volume Manager and Windows 2000 operating system record events that result in discovery of a failed disk in their respective event logs. Figure 9.56 shows the Windows 2000 and Volume Manager event logs for the simulated disk failure in this example.2 Finally, Figure 9.57 shows the Volume Manager Disk view of the configuration after the disk failure. Because the failed disk, Disk 6 (listed as Missing Disk in the display), contains subdisks belonging to both failure-tolerant volumes, 2
The operating system event log shows multiple events because the disk enclosure whose power is cut contains multiple disks. For this example, however, only one of the disks holds sub-disks of failure tolerant volumes, so the Volume Manager event log shows only one event.
More Volume Management
Figure 9.56 Volume Manager and Windows 2000 event logs.
Figure 9.57 Disk view of configuration with failed disk.
237
238
CHAPTER NINE
both volumes report a Failed Redundancy status, meaning that the volumes still can perform their respective I/O functions, but are no longer failuretolerant. In this example, Disk 6 was powered off with no volume metadata activity outstanding, so its volume metadata remains intact. The scenario is equivalent to an actual power failure in a disk enclosure. When power is restored, the Volume Manager can begin restoring failure-tolerant volumes to optimal (Healthy) status as soon as it recognizes that the volumes’ disks are again available. But when disk power is restored after a failure, no event indicates to the Volume Manager that disks are again present or that a failed disk has been replaced with a functional one. Therefore, after effecting a repair or replacement, the administrator must click the Rescan command on the Volume Manager toolbar (Figure 9.58) to initiate I/O bus scanning (or disk discovery in the case of Fibre Channel) so that newly added or recovered devices can be recognized by the Volume Manager. Scanning a system’s I/O buses takes a few seconds, during which the Volume Manager displays a progress indicator, also illustrated in Figure 9.58. During the rescan, the Volume Manager reads metadata for the newly discovered disks and acts according to it. In the example, Disk 6 has reappeared, with metadata indicating that it should contain subdisks from both the 1500MB RAID and 500MB Mirror volumes (The Volume Manager checks metadata on Disk 6 for consistency with metadata on the surviving disks of those volumes.) Since both volumes are failure-tolerant, and the configuration is such that they survived the power outage in this example, subdisks on Disk 6 could simply
Figure 9.58 SCSI bus Rescan command and progress indicator.
More Volume Management
239
Figure 9.59 Disk view of volume configuration during repair.
be added back to their respective volumes. But volume contents may have changed while power was off, hence the contents of the reappearing subdisks cannot be guaranteed to be consistent with the rest of their volumes’ contents. The Volume Manager must therefore regenerate block contents for the reappearing subdisks. Figure 9.59 shows the Disk view after Disk 6 has reappeared and while the 500MB Mirror volume was still in the regenerating state (the 1500MB RAID volume had already completed regenerating at the moment of capture, and is Healthy). This example used power failure to simulate the effect of a disk failure. In the case of an actual disk failure, rather than a power outage, an administrator would: ■■
Replace the failed disk with a working one.
■■
Upgrade the replacement disk to dynamic format.
■■
Use the Repair command on the context-sensitive menu for volume objects to allocate a new subdisk on the replacement disk and add it to the volume.
240
CHAPTER NINE
Volume Management Events Both the VERITAS Volume Manager and the Windows 2000 built-in Logical Disk Manager maintain persistent logs of all significant volume-related events. A sample is shown in Figure 9.60. The log is displayed by clicking the Events tab in the console information window of either the Volume Manager or the Logical Disk Manager. As the event times in this figure suggest, event logs are displayed in old-to-new sequence—that is, with the oldest events listed at the top of the display. The first event listed in Figure 9.60 marks the creation of a volume by the Volume Manager’s volume creation service (vmnt). The next event is from the operating system’s mount service, invoked by the Volume Manager, which assigns drive letter J: to the newly created volume. Next, the fsys (file system) operating system component formats the volume. In this example, the eightsecond duration of the formatting operation makes it evident that the volume was quick formatted. At this point, the volume is ready for application use. The next event shows the addition of a mirror to a volume. The last two events report the splitting of a mirror from the volume and the mounting of the split mirror as a separate volume. The Volume Manager and Logical Disk Manager event logs are conceptually similar to other Windows 2000 event logs; however, they only record events related to volume management. Volume management event logs provide system administrators with focused records that can be useful for troubleshooting. Warning and error events related to volumes are recorded in both the Volume Manager and system event logs.
Figure 9.60 Sample Volume Manager event log.
More Volume Management
241
Using Windows Command-Line Interface to Manage Volumes The examples in this and the two preceding chapters have used the Volume Manager’s graphical console exclusively. Volume Manager functions can also
Figure 9.61 Volume Manager command-line interface examples
242
CHAPTER NINE
be invoked using a command-line interface (CLI), examples of which are shown in Figure 9.61. Some administrators, particularly those with UNIX backgrounds, find the CLI more familiar or comfortable to work with than the graphical one. A more important use for the CLI is to include Volume Manager commands within larger system management scripts, or stored sequences of commands that are executed periodically to perform recurring management operations. The Volume Manager CLI consists of three commands that operate on disks (vxdisk), disk groups (vxdg) and volumes (vxvol), respectively. Reading from top to bottom, the first three screens in Figure 9.61 show each of these commands used to obtain information about a disk, a disk group, and a volume, respectively. A fourth command, vxassist, shown in the bottom screen, internally combines the more primitive commands as necessary to accomplish nine common administrative tasks, such as creating mirrored volumes and splitting mirrors from them. Here, the vxassist command’s help function has been invoked to display these functions. For example, a periodic backup of an application’s data made from split third mirrors could be made using a console command script to: ■■
Add third mirrors to the application’s volumes.
■■
Monitor volume state to determine when third-mirror contents are consistent (i.e., when resynchronization is complete).
■■
Quiesce the application so that volume contents are internally consistent.
■■
Split the third mirrors from all volumes.
■■
Reactivate the application, using the original volumes.
■■
Run the backup against the split mirrors.3
Such scripts can be prestored and run automatically on either a periodic or an as-needed basis, thereby reducing or eliminating the need for skilled administrator involvement in routine system management tasks. 3
The script description has been simplified to include only elements essential to illustrating the use of Volume Manager console commands.
CHAPTER
10
Multipath Data Access
Physical I/O Paths As a consequence of Fibre Channel finding common use as an I/O interconnect for Windows servers, more disks and RAID subsystems are being connected to servers on more than on one I/O path, or I/O interconnect. For the purposes of this discussion, an I/O path emanates from a host bus adapter or a port on a multiport adapter, and includes the disk or RAID subsystem virtual disk (logical unit, or LUN) to which I/O commands are addressed. Figure 10.1 illustrates two disk drives that are connected to the same host computer on two I/O paths, one emanating from the host bus adapter, labeled HBA 1, and the other from HBA 2. Of course, a configuration such as this requires disks or RAID subsystem LUNs that can support multiple paths. Fibre Channel disks are typically equipped with two I/O ports and are capable of existing in configurations such as that shown in the figure. Multiple I/O paths to a single storage device have two basic purposes: ■■
They can improve I/O performance, for example by diverting some I/O requests from a path that is momentarily congested to one that is better able to handle them.
■■
They provide failover capability in the event that a host bus adapter or the cabling of a path should fail. I/O requests can be diverted to the other path connecting the I/O device to the computer.
243
244
CHAPTER TEN
Host Computer
App App
File System
Specialized Driver for Dynamic Multi-path Volume Manager
HBA 1
DMP Driver
PCI Interface
HBA 2
App
Disk I/O buses (paths)
Figure 10.1 Multipath disk attachment.
The Volume Manager for Windows 2000 supports disks that can be reached from two or more different I/O paths on the same system. This feature of the Volume Manager is called dynamic multipathing, and abbreviated DMP in some of the documentation and console interactions. The Volume Manager supports disks that can be reached on two access paths in one of two configurations: Active/passive. In this configuration, the Volume Manager directs all I/O to the active path. The passive path is not used unless there is a failure of the active path (a host bus adapter or cabling failure) and the active path becomes unusable. Active/active. In the configuration of Figure 10.1, the Volume Manager attempts to balance I/O across the paths to the device. This is accomplished by funneling all I/O requests through a DMP driver, which directs each access to a device to one of its access paths. The DMP driver makes the application I/O stream appear to be using a single path. Figure 10.2 shows a Volume Manager console view of a system in which dynamic multipathing software has been installed during Volume Manager installation. An additional tab, labeled Paths, appears in the main window. In this particular view, the disk called Harddisk2 has been selected, so path information about it is displayed. As the display indicates, there are two paths to Harddisk2, one of which is functional (Path 2), and the other of which is listed as Not Healthy. The Volume Manager does not automatically recognize and enable multiple paths to a storage device. This must be done through a series of administrator actions. Figure 10.3 illustrates the Volume Manager console in a system with dynamic multipathing software installed (the same system as illustrated in Figure 10.2 shown at an earlier point in time), and with some of its disks
Multipath Data Access
245
Figure 10.2 Volume Manager Path view of a disk.
NOTEThe screen images in this chapter were captured from a newer version of the Volume Manager than most of the others in this book. Moreover, the hardware configuration used in these examples has Fibre Channel-attached disks rather than SCSI ones. These two differences account for the slightly different appearance of the screen captures in this chapter.
accessible on two paths. Path information is shown for the highlighted disk. Only one path is visible to the system at the point of screen capture. Also displayed is the menu of commands that can be executed on path objects in systems like this one with dynamic multipathing installed. The first step in enabling multipath functionality is to use the Array Settings… command to display the Array Settings… dialog (Figure 10.4). An array, in this context, is the entire set of disk devices (disk drives or LUNs presented by RAID subsystems) that share common path connectivity. Thus,
246
CHAPTER TEN
Figure 10.3 Disk with a single path enabled and the Array Settings command invoked.
path-related changes made in the dialog shown in this figure apply to all devices in the array. Other path-related commands include: Device Settings…. Sets path-related parameters for individual devices on a path Enable Path and Disable Path. Enables or disables, respectively, the selected path. Preferred Path. Declares the selected path to be the preferred one for active/passive configurations. Properties. Displays the properties of the selected path. The Array Settings dialog shown in Figure 10.4 is used to alter path characteristics for some or all of the storage devices on a path. Devices can be included in or excluded from the path characteristics set in this dialog by selecting them in the scrolled list and checking or not checking the Exclude check box. Two important path-related parameters can also be set in the Array Settings dialog: Load Balancing. By specifying Active/Active, the administrator permits the Volume Manager to route requests to the selected devices to either
Multipath Data Access
247
Figure 10.4 The Array Settings dialog.
path. By specifying Active/Passive, the administrator inhibits this capability, and restricts I/O to the preferred path (discussed later). Monitor Interval (in seconds). This is the time frame during which the Volume Manager monitors the specified devices to verify that the paths to them are functional. Proactive monitoring enables the Volume Manager to detect and report on path-related problems as they occur, rather than reacting to them when application I/O requests are made. The Array Settings dialog applies to any set of devices with common access paths. In the example, the disks named Harddisk0 through Harddisk9 are all connected to the same Fibre Channel Arbitrated Loop (FC_AL). Figure 10.5 shows the Array Settings dialog for another disk connected to the same system, Harddisk10. This is a parallel SCSI disk, and is the only such disk connected to its I/O bus. The salient points made by Figure 10.5 are these: ■■
Only one disk, Harddisk10, is shown in the Devices in this Array section of the dialog, because Harddisk10 is the only disk that can be reached on this particular path.
■■
Either the I/O bus does not support dynamic multipathing or no second path to the Harddisk10 device was found, as evidenced by the disabled (grayed-out) Load Balancing options.
248
CHAPTER TEN
Figure 10.5 The Array Settings dialog for a parallel SCSI disk.
Returning to the Array Settings dialog illustrated in Figure 10.4, by checking the Active/Active Load Balancing option, the administrator implicitly prompts the Volume Manager to verify that both paths to the devices are valid by rescanning the I/O paths affected by the dialog. Figure 10.6 illustrates the visible evidence of the rescan: a progress monitor in the Tasks section of the Volume Manager console main window. In Figure 10.6, the path to Harddisk10 is being rescanned because it was the disk selected when the Array Settings… command was issued. Since the disk is on a parallel SCSI bus with no other disks and no other initiators (host bus adapters) connected, the rescan does not detect any additional paths to the device. This contrasts with a rescan of the Fibre Channel Arbitrated Loops to which devices Harddisk0 through Harddisk9 are connected. Figure 10.7 illustrates the results of this rescan. In this case, Harddisk3 is selected, so path information pertinent to it is shown. Both paths report a Healthy status in the State column, meaning that the Volume Manager is able to access Harddisk3 on both paths. Physical access to Harddisk3 is set to Active/Active in in Figure 10.4. This can be changed by selecting one of the paths in Figure 10.7 and doubleclicking to display the Device Settings dialog shown in Figure 10.8. Here, the Active/Passive mode of access, in which only one of the paths to the
Multipath Data Access
Figure 10.6 Progress of administrator-initiated path rescan.
Figure 10.7 Fibre Channel loop rescan complete, showing two paths to device.
249
250
CHAPTER TEN
Figure 10.8 Device Settings dialog for Harddisk3.
Figure 10.9 Two paths in Active/Passive mode.
Multipath Data Access
251
Figure 10.10 Invoking the Preferred Path command.
device is used, has been specified. The Exclude check box can be checked to indicate that the selected device (Harddisk3 in this example) should be excluded from the change in settings. The point is, the paths used to storage devices can be controlled on a per-device basis or for the entire path. When the Active/Passive mode of access is in effect, different icons are displayed in the Volume Manager console to denote the active and passive paths. Figure 10.9 shows the result of the path configuration change from Figure 10.8 as viewed through the Volume Manager console. In Figure 10.8, the icon for Path 1 includes a check mark, indicating both that the device access mode is Active/Passive and that Path 1 is currently the active path. An administrator may change the active path via the Preferred Path command, as Figure 10.10 illustrates. Here, the active path is being changed from Path 1 to Path 2. Using this command, an administrator can specify the path on which I/O commands and data travel to the selected devices. In general, if multipath access to a storage device is available, the best system administration policy is to enable Active/Active mode and allow I/O load to be balanced by the Volume Manager. Active/Passive mode and preferred paths are best reserved for situations in which I/O load must be balanced by manually, say, for example when one device has time-critical response requirements, or when a path is suspect.
CHAPTER
11
Managing Hardware Disk Arrays
RAID Controllers Chapter 1 discusses aggregating disk controllers. These controllers coordinate the activities of several disks, using concatenation, striping, mirroring, and RAID techniques to represent sets (arrays) of physical disks to host computers as though they were individual disks. Aggregating disk controllers have several advantages: Connectivity. They typically connect to a single port on a server’s internal I/O bus (e.g., its PCI or S-Bus) or to a single port on an external Fibre Channel or SCSI bus, but emit two or more back-end buses to which disks are connected. This often enables the connection of more disk storage capacity to a server than would be possible with directly attached disks. Processor offloading. All except the smallest aggregating RAID controllers incorporate microprocessors that perform many of the tasks that a hostbased volume manager would ordinarily perform. Particularly with RAID arrays, host server offloading can free significant processing capacity for application use. High performance. Aggregating RAID controllers are designed to perform one task—disk I/O management—very well. To that end, many contain specialized hardware and software components that would be difficult if not impossible to implement in host server environments. Single-address space software environments, specialized RAID computation engines, dual-port
253
254
CHAPTER ELEVEN
memory controllers, and other unique features all contribute to the highly optimized I/O performance of aggregating RAID controllers. Chapter 1 also points out that there are two basic RAID controller architectures: ■■
Embedded RAID controllers mount in server frames and connect directly to their hosts’ internal I/O buses. Embedded RAID controllers rely on their hosts for operating power and cooling, and are intimately integrated with host I/O bus protocols. They are intrinsically in the availability domains of their host computers. A host computer failure generally makes an embedded RAID controller and the storage devices connected to it inaccessible (although there are some RAID subsystems in which two embedded controllers share disks connected to common I/O buses).
■■
External RAID controllers are packaged and powered separately from their host computers. They typically connect to external I/O buses such as Fibre Channel or SCSI, and emulate disks to their hosts. Because their operating power comes from a source separate from that of their host computers, they can survive host computer failures. External RAID controllers are ideal for cluster storage because they can easily be connected to multiple servers.
Embedded RAID Controllers At some level, both embedded and external RAID controllers emulate disks. External RAID controllers map sets of physical disks to host I/O bus addresses called logical unit numbers (LUNs). For most purposes, a host computer I/O driver cannot distinguish a LUN represented by an external RAID controller from a physical disk. Embedded RAID controllers are somewhat different. Often, they require specialized drivers, at least for purposes of configuring the disks connected to them into arrays. Figure 11.1 shows a typical embedded RAID controller architecture. The function of an embedded RAID controller is similar to that of a volume manager. It organizes physical disks connected to it and presents them as virtual disks, or volumes, to a host interface. A volume manager is equivalent to an I/O driver in the server I/O stack. A hardware driver manipulates embedded RAID controller hardware registers to communicate with the controller. In both cases, layers of software above the respective drivers interface with disklike abstractions. This is true even when the upper software layer is a hostbased volume manager. A host-based volume manager manages the virtualized disks presented by embedded (or external) RAID controllers as if they were physical disks. This is explained more fully in the sections that follow.
Managing Hardware Disk Arrays
One or more disk I/O buses (e.g., SCSI or Fibre Channel)
Host Computer
App
Embedded RAID Controller
Specialized Driver for RAID Controller
App File System
255
Volume Manager
RAID Driver
PCI Interface
App
PCI Bus transmits data and commands unique to the particular RAID controller
Figure 11.1 Embedded RAID controller.
Array Managers Because the functionality of RAID controllers is so similar in many respects to that of host-based volume managers, it would be ideal if the management interfaces they exposed to system administrators were similar as well. This is the philosophy behind a class of Windows 2000 software components known as array managers. Array managers present the administrator with a management interface for RAID controllers that is almost identical to the Logical Disk Manager interface. The Windows 2000 Logical Disk Manager is specifically designed so that RAID controller management tools can be integrated with it to present a common interface for managing both RAID controllers and hostbased volumes.
NOTEThe examples in this chapter use Dell Computer Corporation’s OpenManage array manager to illustrate the array manager concept.
Figure 11.2 shows the Windows 2000 Logical Disk Manager’s Disk view of a system that seems to have three disks connected to it. In fact, each of these disks is an array of physical disks represented as a single disk by an embedded RAID controller. The Logical Disk Manager’s only view of disks is through the disks’ I/O driver (in this case, a specialized driver for the RAID controller), so there is no way for it to distinguish between disks, physical or virtual, connected to an external I/O bus and virtual disks represented by an embedded RAID controller. The embedded RAID controller and its I/O driver emulate physical disks so well that the Logical Disk Manager can treat the virtual disks as though they
256
CHAPTER ELEVEN
Figure 11.2 Disk view of a system with two hardware disk arrays.
were physical, including writing signatures on them, formatting them, and creating volumes on them. Figure 11.3a shows the Create Volume wizard’s final confirmation panel, as well as some volumes created on the virtual disks shown in Figure 11.2. Figure 11.3b shows the final verification stage in the creation of a simple volume that occupies the entire usable capacity of what the Logical Disk Manager perceives as a 1-gigabyte disk. In many cases, it is appropriate to create simple volumes on RAID controller-based virtual disks, because what the Logical Disk Manager perceives as a disk is actually a failure-tolerant virtual disk created and managed by the RAID controller. Here, Disk 0 is a virtual disk presented by the RAID controller, which stripes and mirrors data across subdisks on two physical disks connected to it. RAID controller-provided disk failure tolerance is indeed a useful capability, but it raises questions as well: ■■
How can an administrator determine whether a given virtual disk presented by a RAID controller is failure-tolerant, or whether it stripes data across two or more physical disks?
■■
Must different user interfaces be used to manage physical and virtual disks?
■■
Must administrators resort to completely external management tracking techniques such as spreadsheets to track the status or online storage?
Managing Hardware Disk Arrays
257
a.
b. Figure 11.3 Logical Disk Manager administration of hardware RAID arrays.
It would be far better to be able use the Logical Disk Manager console to “look inside” a RAID controller and determine unequivocally which disks were attached to it and how they were being managed. Doing so “Looking inside” a RAID controller connected to a Windows 2000 system to observe and manage the disks connected to it requires that an array manager for the particular type of RAID controller be installed (again, in these examples, Dell’s OpenManage array manager). In the Windows 2000 operating system, array managers are integrated with the logical disk management function, so that both directly
258
CHAPTER ELEVEN
attached disks and disks connected through RAID controllers are managed using the same interface. When an array manager is installed, it replaces the built-in Logical Disk Manager. The OpenManage Array Manager console can be invoked using any of the usual Windows 2000 methods of starting applications. Figure 11.4 illustrates launching the Array Manager console from the Windows 2000 Start menu. The OpenManage Array Manager console has a nearly identical look and feel to that of the Windows 2000 Logical Disk Manager. In fact, its Disk and Volume views are identical to the corresponding Logical Disk Manager views. But, as mentioned, because the Array Manager can interact with the RAID controller driver, it can also look inside certain RAID controller models and retrieve information about the disks connected to them and about the array configurations in which those disks are being used. This picture is presented through the OpenManage General view, shown in Figure 11.5.
NOTEEach type of controller is unique, hence presents a slightly different view. The particular controller model used in these examples is called PERC 2/Si.
Two views of the RAID controller and the disk drives connected to it (which it collectively refers to as an array) can be seen in this figure: ■■
The physical view, which depicts the RAID controller’s disk buses and the disks connected to them. For example, the RAID controller in Figure 11.5 has a single disk bus with five disks connected to it.
■■
The logical view, which depicts the disk groups and virtual disks being managed by the RAID controller. In Figure 11.5, one virtual disk has been defined within the RAID controller (BOOT 0).
Developers of array manager software for Windows 2000 make every effort to provide a similar interface to the administrator for managing both directly
Figure 11.4 Invoking the Dell OpenManage Array Manager.
Managing Hardware Disk Arrays
259
Figure 11.5 Array Manager General view.
attached disks (or virtual disks presented by external RAID controllers, which are indistinguishable to both the volume manager and to operating system I/O drivers) and disks connected to an embedded or external RAID controller. Thus, for example, with the OpenManage Array Manager, to create a virtual disk, the administrator clicks on the Logical Array icon for the RAID subsystem on which the virtual disk is to be created and invokes the Create Virtual Disk… command to run the accompanying wizard on the Array (see Figure 11.6). As the rightmost screen in this figure shows, virtual disks
Figure 11.6 Using the Create Virtual Disk command on the OpenManage Array Manager.
260
CHAPTER ELEVEN
are created in Express or Custom mode, as in the Logical Disk Manager. Again, in Express mode, the Array Manager chooses the location(s) for the new virtual disk’s storage. In Custom mode, the administrator specifies the disks on which storage is to be allocated. Figure 11.7 shows these two modes in action. As with the Windows 2000 Logical Disk Manager and Volume Manager, the Array Manager requires only that the administrator supply data it cannot infer or derive. In both Custom and Express modes, the administrator must specify the name, type, usable capacity, stripe depth (called Stripe Size in the panels) and read and write cache policies. The left side of Figure 11.7 illustrates express mode virtual disk creation. Express mode, the administrator does not specify subdisk locations; instead, when all virtual disk parameters have been specified, the Array Manager displays a panel showing the disks on which it will allocate subdisks. The right side of Figure 11.7 illustrates custom mode virtual disk creation. In Custom mode, the administrator uses checkboxes to specify the physical disks on which to allocate storage). This panel also shows the virtual disk Type drop down list box expanded, indicating that the RAID controller model used in this example supports five different virtual disk types. (Simple and concatenated virtual disks are both treated as concatenated by this RAID controller.)
Figure 11.7 Express and Custom modes in action.
Managing Hardware Disk Arrays
261
Like the host-based Volume Manager and Logical Disk Manager, this RAID controller model supports multiple virtual disks that use storage on the same set of physical disks. Figure 11.8 illustrates the Array Manager view of a RAID controller with these three virtual disks allocated: ■■
A simple volume occupying essentially the entire capacity of the 8.47gigabyte physical disk on which it is allocated. As the array name (Boot 0) suggests, this is the boot disk for the system, and is not used in these examples.
■■
A 1-gigabyte RAID (RAID5VirtDisk) virtual disk allocated on the other four disks connected to the controller.
■■
A 1-gigabyte striped mirrored virtual disk (RAID10VirtDisk) allocated on the same four disks.
These three virtual disks are represented to the Windows operating system (including the Logical Disk Manager) as physical disks. In order for them to be formatted with file systems and used by applications, basic partitions or dynamic volumes must first be created on them. Figure 11.8 also shows, in the Windows Explorer panel, volumes created on the three virtual disks. Like host-based volumes, newly created failure-tolerant virtual disks must have their media contents synchronized before they are actually failuretolerant. For mirrored virtual disks, this means that the contents of one mirror must be copied to all other mirrors in the virtual disk. For RAID virtual disks, each parity block’s contents must be computed from the contents of corresponding data blocks. Synchronization is performed when virtual disks are
Figure 11.8 RAID Controller with two arrays defined.
262
CHAPTER ELEVEN
created. Until it is complete, virtual disks are not failure-tolerant. Thus, any unrecoverable operations involving critical data, such as copying to a virtual disk, should be deferred until initial synchronization is complete. An administrator can determine whether a virtual disk is failure-tolerant by clicking on the Array Manager General tab. Figure 11.9 shows this view after synchronization of a striped-mirrored virtual disk is complete, but while a RAID virtual disk is still synchronizing. Host-based volumes can be created, and file systems can be formatted, on these virtual disks while they are still synchronizing. Any data written to a failure-tolerant virtual disk during synchronization will be written correctly, but not protected against loss due to disk failure until synchronization is complete.
RAID Controllers and the Volume Manager When the VERITAS Volume Manager for Windows 2000 is installed on a system on which an array manager has already been installed, the Volume Manager integrates itself with the array manager functionality and manages RAID controllers as well as directly attached disks through its console. Figure 11.10 shows the Volume Manager General view for the system used in these examples. The system’s storage is in the state illustrated in the Array Manager view shown earlier in Figure 11.9. Here, the Logical Array object in the navigation panel is fully expanded to show the makeup of the three virtual disks managed by the RAID controller (PERC 2/Si Controller 0). The BOOT
Figure 11.9 RAID5 virtual synchronization.
Managing Hardware Disk Arrays
263
Figure 11.10 Volume Manager view of RAID controller with two arrays defined.
0 disk occupies capacity on Array Disk 0:0, meaning channel (device I/O bus) 0, SCSI target ID 0. The RAID5VirtDisk array consists of four columns with subdisks on array Disks 1 through 4. The RAID10VirtDisk array stripes data across two mirrors, one allocated on Disks 1 and 2, and the other allocated on Disks 3 and 4.
Dealing with Disk Failures The purpose of failure-tolerant volumes and virtual disks is that data continues to be available to applications if a disk should fail. Figure 11.11 is the Volume Manager’s Events view of a disk failure event, simulated here by removing the disk from its enclosure. As the event log indicates, Disk 0:4 has failed (is offline). Because Disk 0:4 contributes storage capacity to two failure-tolerant virtual disks (Figure 11.10), the event log also includes entries indicating that the two virtual disks are no longer redundant (that is, protected against further disk failures) and that Disk 0:4 is offline (has failed). Figure 11.12 is the Volume Manager’s General view of this system after Disk 0:4 has failed. Warning symbols are displayed in the navigation panel on:
264
CHAPTER ELEVEN
Figure 11.11 Event log for disk failure.
Figure 11.12 Volume Manager view with failed disk.
Managing Hardware Disk Arrays
■■
All icons representing the failed disk itself
■■
The icon for the mirror of which the failed disk is a part
■■
The icons for the two virtual disks to which the failed disk contributes storage
265
In the General panel, Disk 0:4 shows an Offline status, meaning that it is no longer accessible by the RAID controller. When one disk in an enclosure containing several disks fails, it is important that the failed disk (and no other) be removed and replaced. Removing the wrong disk could result in failure-tolerant virtual disks becoming nonfailuretolerant. Worse, already-degraded failure-tolerant virtual disks could fail completely, resulting in data loss and application failure. To aid in identifying disks for service purposes, the Volume Manager makes it possible for an administrator to invoke the Blink command, which flashes a light-emitting diode (an LED) on a specific disk connected to a RAID controller so that it can easily be identified for removal and replacement (Figure 11.13). In the current example, the Blink command would have no effect because the failed disk has been completely removed from its enclosure, and so is receiving no electrical power. When a failed disk has been replaced, the replacement disk must be: ■■
Recognized by the RAID controller.
■■
Initialized or have any “orphan” subdisks on it deleted.
■■
Added to an array group (disk group).
■■
Incorporated into any virtual disks of which the failed disk had been a part.
Figure 11.13 Using the Blink command to identify a failed disk.
266
CHAPTER ELEVEN
Just as with directly attached disks, a RAID controller’s SCSI or Fibre Channel buses must be rescanned when disks are added or removed so that disk configuration changes can be recognized. Figure 11.14 shows the Volume Manager Rescan command applied to the RAID controller’s single device I/O bus (Channel 0) after the failed disk has been replaced (by reinserting the same disk that was removed back into the enclosure). Rescanning SCSI and Fibre Channel buses typically takes only a second or two. Prior to rescanning, the console reports a newly inserted disk’s status as Offline, because the controller has not been requested to discover it. Figure 11.14 illustrates this state because although the Rescan command has been selected on the menu, it has not yet been invoked. After the controller rescans the disk’s channel, its status changes to Degraded, meaning that the disk contains metadata but that it does not correspond to the metadata for other disks attached to the controller (Figure 11.15). The metadata on the disk in this example is there because, formerly, the disk had been part of two volumes. If no data had been written to any of the surviving disks between the removal and reinsertion events, it is possible that the disk’s contents might still be valid. The controller has no way of ascertaining that, however, and so it makes the safe assumption that the newly inserted disk’s contents are not current. Because the RAID controller cannot guarantee the currency of the subdisks (that is, “segments”) on the newly inserted disk, they must be removed before the disk can be used. Figure 11.16 displays the Remove Orphan/Dead Disk Segments command used for this purpose, and the resulting cautionarydialog. Removing the orphaned segments (or initializing and possibly formatting a truly new disk) prepares the disk for use by the RAID controller. In Figure 11.17, Disk 0:4 appears as ready for use. The disk is identified by the controller channel to which it is connected (0) and SCSI target ID to which it responds (4). RAID subsystems typically include circuits that set a disk’s SCSI target ID based on the enclosure slot in which it is mounted. The newly inserted disk in this case would therefore be identified as Disk 0:4, no matter whether it was the same disk or a different one.
Figure 11.14 Invoking the Rescan command.
Managing Hardware Disk Arrays
267
Figure 11.15 Result of Rescan command.
The next step in recovering from failure of a disk connected to a RAID controller is to add the disk to an array group (disk group). Since the controller in this example has only one array group defined, the disk is automatically added to it. Finally, the newly inserted disk must be merged into any virtual disks of which the failed disk had been a part. This example uses the Configure Dedicated Hot Spare… command to do this, as shown in Figure 11.18.
Figure 11.16 Removing orphan segments from recovered disk.
268
CHAPTER ELEVEN
Figure 11.17 Disk showing Ready status after removal of orphan segments.
The resulting dialog enables the administrator to designate an eligible disk as a hot spare for a failure-tolerant virtual disk. In Figure 11.18, the command is applied to a degraded mirrored pair whose subdisks had occupied space on Disk 0:3 and Disk 0:4. Disk 0:3 is the remaining mirror, and so is not eligible to host a replacement mirror. Disks 1, 2, and 4 are eligible1. In Figure 1 Disk 0:0 would also be eligible from a failure tolerance standpoint, but it does not have sufficient unallocated capacity to host the replacement mirror.
Figure 11.18 Merging virtual disks using the Configure Dedicated Hot Spare… command.
Managing Hardware Disk Arrays
269
Figure 11.19 Rebuilding of RAID and striped-mirrored virtual disks.
11.18, Disk 0:4 has been specified. Clicking the OK button results in a subdisk being allocated on the specified disk. In a system with sufficient disks, a hot spare can also be predesignated. If a virtual disk does have a hot spare designated, a disk failure that degrades the virtual disk results in the designated spare disk automatically being moved into service as a replacement. In this example, no spare disks were available, so the hot spare had to be designated after the failed disk was replaced. Figure 11.19 shows the Volume Manager view of the example system after hot spares have been designated for both the RAID10VirtDisk and the RAID5VirtDisk. When a replacement subdisk for a failure-tolerant virtual disk is allocated on a spare disk, the subdisk’s contents must be resynchronized with those of the rest of the volume. In this figure, RAID10VirtDisk 62 (the mirror in which a disk had failed) is reported to be in a Rebuilding status. The inset shows that RAID5VirtDisk is also rebuilding, although the console general view does not indicate this. The significant point about this failure and recovery scenario is that, throughout, any host-based simple volumes on these virtual disks remained available to applications as though there had been no disk failure. Ideally, disk failure and repair in a system with RAID controller-attached storage is completely transparent to applications.
CHAPTER
12
Managing Volumes in Clusters
Clusters of Servers An increasingly popular means of enhancing application availability and scaling is to organize two or more servers as a cooperative cluster. Very simply, a cluster is a set of complete servers, each capable of functioning usefully by itself. Clustered servers are interconnected and cooperate with each other to appear externally as a single integrated computing resource. Figure 12.1 illustrates a typical cluster configuration. Three key attributes of clustered servers illustrated in Figure 12.1 are: ■■
Connection to the same storage devices. In principle, this allows any server in the cluster to process any data.
■■
Connection to the same clients. In principle, any server in the cluster can perform any available service for any client.
■■
Connection to each other. Each server in the cluster can monitor the “health” of the other servers in the cluster, and take corrective action when something goes wrong.
Clustered servers provide three fundamental benefits: ■■
They enhance application availability. If a server in a cluster fails, its applications can be restarted either automatically or by administrator command on another server, because other servers are connected to the same data and clients.
271
272
C H A P T E R T W E LV E
Clients
Cluster interconnect
Servers
Disks or external RAID subsystems Figure 12.1 A computer cluster configuration.
■■
They may improve the ability of applications to “scale.” The term scaling is used to mean growth to support larger client request loads. Some clusters support application scaling because they enable more computing power to be applied to a set of data. For a cluster to support application scaling, file systems on its commonly accessible storage devices must be shareable; that is, they must be mountable concurrently by all of the cluster’s servers without risk to data integrity.1
NOTEShareable file systems are not generally available for Windows servers as this book is written. Application scaling in Windows clusters is therefore limited. Cluster scaling technology for Windows 2000 is being developed, however, and Windows 2000 clusters that allow applications to scale may be expected in the future. ■■
1
They improve system manageability. In several important ways, clusters allow administrators to treat a group of servers as a pool of computing capacity. An administrator can force applications to move from one server to another, to balance load across computing resources, or to allow upgrades or maintenance to be performed on systems while applications continue to run elsewhere in the cluster. The capability to manage computing resources as a single pool becomes increasingly important as more servers are added to a cluster.
If an application is read-only, that is, if it does not update its data, a cluster such as that diagrammed in Figure 12.1 can enable the application to scale to multiple servers without the benefit of a shared file system. Web servers often have this attribute. For read-mostly applications, various delayed update techniques permit some scaling in clusters without shared file systems.
Managing Volumes in Clusters
273
Different clustering technologies provide these benefits to varying degrees. Whatever the benefits, though, clustering is implemented through a distributed software component called a cluster manager that runs in each of the servers in a cluster.
Cluster Manager Data Access Architectures Today, the cluster managers most commonly used in the Windows environment support a shared-nothing architecture for commonly accessible resources. Shared-nothing simply means that, at any moment, storage devices, data objects, and other cluster resources are completely controlled by one of a cluster’s servers. In a shared-nothing cluster, only one server at a time is permitted to access any disk or volume and the file systems located on it. Sharednothing clusters can be contrasted with the shared data clusters available for some other system architectures. In a shared data cluster, a single file system can be mounted on two or more servers simultaneously. In a cluster that uses SCSI or Fibre Channel interconnects, however, storage devices are physically connected to all servers. If one server fails, another server can take logical “ownership” (the right to access) of the failed server’s storage and other resources and restart its application(s).
Resources, Resource Groups, and Dependencies Two key architectural properties of cluster managers that are of importance from a Windows 2000 disk storage management perspective are: ■■
Cluster managers manage abstract resources. Resources may include storage devices, logical volumes, file systems, file shares, network cards, TCP/IP addresses, applications, databases, and others. Resources may be independent of each other or they may have interdependencies (e.g., a database manager depends on data stored in file systems built on volumes made up of disks).
■■
Cluster managers organize their resources into resource groups. A resource group is any collection of resources required for an application to run. Resource groups are significant because they are the atomic unit in which control of resources moves from one server to another (that is, fails over).
An accounting application is organized as a cluster resource group in Figure 12.2, although this example has been simplified to more clearly illustrate the concept of resource groups. In fact, as the examples in later sections will
274
C H A P T E R T W E LV E
show, the resource tree for such an application would be larger (i.e., would contain more resources). In the application resource group diagrammed here, the application program uses a database manager (e.g., Oracle or Microsoft SQL Server) to access its data. Clearly, for the application to run, the database manager must first be active. Similarly, for the database manager to run, the file systems containing its database container files must be mounted and accessible to it. Thus, the accounting application program has a dependency on its database manager, which in turn has a dependency on the data storage resources it manages. The accounting application makes its services available to clients over a network. Clients locate the application using its 32-bit Internet Protocol (IP) address (presumably after obtaining the IP address from a name server). The application’s IP address must, therefore, be recognized by the server on which it is running. The IP address is thus a cluster resource as well. For an IP address to be usable, it must be bound to a functioning network interface card (NIC). The IP address thus has a dependency upon a NIC cluster resource.
Resource Dependencies For the accounting application diagrammed in Figure 12.2 to function, both its database manager and its IP address must be active. The database manager and IP address do not depend on each other, however; they are said to
Accounting Resource Group Accounting Application Image Database Manager
192.168.0.224
IP Address
Network Card Data (disk or volume) Figure 12.2 A cluster resource group.
Managing Volumes in Clusters
275
be in different branches of the application’s resource tree. Resource tree structure becomes significant when a cluster manager is starting or stopping an application’s resource group. Resources in different branches of a resource tree can be started or stopped concurrently, minimizing overall application startup or shutdown time. By contrast, resources in the same branch of a tree must start and stop in sequence—for example, an NIC must be operational before an IP address can “start” (be accessed by clients). Similarly, a file system must be mounted before a database manager that uses it can open a database. In the case of Figure 12.2, both database manager and IP address must be operational before the application program image itself can start. Stopping a resource group, as when, for example, an administrator decides to move an application from one server to another, follows exactly the opposite sequence. In the example of Figure 12.2, the application program would stop executing first. Only when the application program has terminated (closing files, disconnecting from databases, and flushing cached data), can the database manager be stopped and the IP address disabled. The general principle is that each of the resources in a group depends on resources below it in its tree branch. Resources lower in a tree branch cannot be stopped safely until the resources above them have been stopped. Thus, another function of cluster managers is to ensure this kind of orderly transition of all the resources constituting an application from one server to another. Just as individual resources may have interdependencies, entire application resource groups may be dependent on each other. An application may depend upon other applications, as for example, an order processing application that requires invoicing and shipping applications to be operating in order to fulfill its function. Conversely, successful operation of an application may require that certain related applications not be operating. For the accounting application illustrated in Figure 12.2, it might well be unfeasible for both production and test versions of the accounting application to run concurrently, due perhaps to database or client access conflicts. Figure 12.3 diagrams application resource group interdependencies. Cluster managers include facilities that allow administrators to specify dependencies among resource groups and to define policies to be invoked automatically when conflicts occur. Here, for example, one policy might be to stop execution of the test version of the accounting application on a server when the resource group for the production version of the application starts up on that server. While they are important to the overall function of clusters, interdependencies of resource groups have little or no effect on volume management.
276
C H A P T E R T W E LV E
Accounting Resource Group
Resource group dependencies, eg.: • Must coexist • May not coexist
Accounting Application Image Database Manager
Accounting Resource Group (Test) Accounting Application Image Database Manager
IP Address
Network Card 192.168.0.224
IP Address
Data
192.168.0.224
Data (disk or volume)
Network Card
Figure 12.3 Cluster resource group interdependencies.
Clusters and Windows Operating Systems Cluster managers available for the Windows 2000 Advanced Server and Data Center Editions enable the clustering of two or more loosely coupled servers. “Loosely coupled” in this context means that the servers are connected to the same storage devices and to each other through I/O buses and private networks, respectively, but do not share memory or processing resources. The two cluster managers most widely used with the Windows 2000 operating system are: ■■
Microsoft Cluster Service (MSCS; also known informally as Wolfpack), Microsoft Corporation’s built-in clustering technology for Windows 2000 systems that consist of two or four servers.2
■■
VERITAS Cluster Server (VCS), developed and sold by VERITAS Software Corporation, which supports clusters of up to 32 servers. At the time of writing VCS is available only for the Windows NT Version 4 operating system, so the examples later in this chapter use Windows NT systems. VERITAS has announced its intention to make similar clustering technology available for the Windows 2000 operating system.
2
At the time of writing, 4-computer clusters are supported only with the Data Center Edition of the Windows 2000 operating system.
Managing Volumes in Clusters
277
Both of these cluster managers improve Windows application availability and manageability by enabling applications and their resources to restart on secondary servers if a primary server fails. The process of restarting an application on a secondary server is called failover; a restarted application is said to have failed over. Failover may be either automatic or administrator-initiated. Both MSCS and VCS treat program images, network addresses, file systems, disk groups, and other objects used by applications as cluster resources. To both cluster managers, a cluster resource is any physical or logical object plus an associated program library that implements certain common cluster management functions. Both cluster managers organize resources into dependency trees, called cluster resource groups (MSCS) and service groups (VCS), respectively. These groups of resources form the building blocks for highly available computing because they are the units in which applications and other resources fail over from one server to another.
How Clustering Works Cluster managers perform two basic functions: ■■
They monitor the state of the servers that constitute a cluster and the resources running on them and react to changes according to predefined policies.
■■
They manage resource group ownership transitions by automatically stopping (if possible) and starting the resources in a group in a specific order.
Each server in a cluster uses the cluster interconnect to broadcast periodic “heartbeat” messages to the other servers.3 Each server also monitors heartbeat messages from the other servers in the cluster. When heartbeat messages fail to arrive on time, the servers negotiate to determine which of them are still operational. If the negotiation determines that a server has failed, the surviving servers reconstitute the cluster without the failed one, and execute predefined policies to restart the failed server’s applications on alternate servers as appropriate. If an application fails (crashes or freezes), a cluster manager can either attempt to restart it in its original location or it can initiate failover of the application to another server. System administrators can also use a cluster manager’s console (user interface) to move applications from one server to another in order to balance processing load or to perform server maintenance with minimal application downtime. Manual application failover is initiated by 3
In general, vendors of cluster server software either require or recommend multiple interconnects among the clustered servers to decrease the chances of a cluster interconnect failure causing a cluster failure. As long as there is one link connecting all the servers in a cluster, the cluster can survive and detect server failures.
278
C H A P T E R T W E LV E
specifying the resource group to be moved. The cluster manager instance on the application’s server stops the resources in order, as described earlier. When the resource group has been completely stopped, the cluster manager instance on the failover server restarts the resources in the opposite order. When failover occurs because an application server has crashed, there is no opportunity to stop application resources. Once the surviving cluster manager instances have renegotiated membership and determined the application’s restart location, cluster manager instances on the failover servers start each application’s resources in order. In this case, however, the state of resources is typically unknown (e.g., databases and files may reflect incomplete transactions), so each resource must verify its own integrity before allowing applications and clients to use it. Some cluster resources, such as NICs, are stateless, that is, they do not store any persistent information reflecting their usage while they are being used. Others, such as file systems and databases, are inherently stateful; while they are being used, their persistent data can become momentarily inconsistent. After a crash, the consistency of stateful resources must be verified before they can be used by applications. File systems must be integrity checked, for example, using the Windows CHKDSK program or its equivalent. Similarly, database managers replay activity logs against their on-media images to guarantee database integrity. When a failed server has been repaired and has rejoined its cluster, a cluster manager can perform failback, moving applications back to the repaired server. Stateless applications typically fail over and back transparently. Stateful applications may have to recover prior to restarting after failback. How state is recovered depends on the type of resource and the type of shutdown (crash or orderly). For example, a file system that dismounts cleanly for failback can simply be mounted on the failover server. Its client connections, however, are simply broken by the shutdown, so clients must be responsible for reconnecting after failback. In terms of the actions performed, failback is equivalent to administratorinitiated failover. Some cluster managers that support failback incorporate administrator-defined policies that restrict failback timing, so that, for example, failback does not occur at times of expected peak application load. Both MSCS and the VERITAS Cluster Server have this capability.
Microsoft Cluster Server The Windows NT Version 4 Volume Manager supports the Microsoft Cluster Server (MSCS), sometimes informally known by its code name Wolfpack. In
Managing Volumes in Clusters
279
the Windows 2000 operating system, MSCS is an integral part of the Advanced Server and Data Center Editions. The Data Center Edition of Windows 2000 also supports clusters of four servers. The servers comprising an MSCS cluster are connected to common storage, common clients, and to each other, as illustrated in Figure 12.1.
MSCS Heartbeats and Cluster Partitioning MSCS cluster manager instances running in each clustered server use local area networks (LANs) to exchange heartbeat messages. Multiple network connections are recommended for this purpose, at least one of which should be dedicated to MSCS message exchange. In spite of this, instances can arise in which the servers in an MSCS cluster cannot intercommunicate. When this occurs, cluster protocols allow the servers that can intercommunicate to form a new cluster, thereby forcing isolated servers to relinquish control of resources and stop acting as cluster members. Since MSCS clusters typically consist of an even number of servers (two or four), some failure modes can result in one server being unable to communicate with the other, or in two servers being unable to communicate with two others. When this situation, illustrated in Figure 12.4, occurs, there is no majority to determine which servers should survive to form the “real” cluster. Each server or pair of servers could reasonably assume that the other has failed and that it should form a new cluster. A cluster in this state is said to be partitioned. To ascertain survivorship and avoid cluster partitioning in these scenarios, MSCS uses the common storage interconnect.
Clients
Partition A
Partition B failed interconnects
Both partitions can still access storage devices. Figure 12.4 A partitioned cluster.
Which partition should control data and execute applications?
280
C H A P T E R T W E LV E
Determining MSCS Membership: The Challenge/Defense Protocol In a two-server MSCS cluster, each server periodically verifies that the other is operational by receiving heartbeat messages over the network links connecting them. If a server fails to receive a heartbeat message within the allotted time, it must determine whether: ■■
The network links connecting it to the other server have failed, but the other server is still operational. If both servers are operating, but cannot intercommunicate, there must be a way to determine whether applications should continue to run or be failed over and restarted.
■■
The other server has failed, in which case the surviving server may have to restart applications that had been running on the failed one.
MSCS uses SCSI RESERVE and bus RESET commands to determine whether missing cluster heartbeat messages signify the failure of a server or the failure of the links connecting two clustered servers. A SCSI RESERVE command reserves a target device (a physical disk or an external RAID controller’s LUN) for the exclusive use of the initiator (host bus adapter) issuing the command. A reserved device rejects all commands except those issued by the reserving initiator. A SCSI reservation can be cancelled by issuing a SCSI RELEASE command or by resetting the SCSI bus. Any initiator on a SCSI bus can issue a SCSI RESET command. During normal operation, the first of the two clustered servers to start up an MSCS cluster manager instance issues a SCSI RESERVE command to a predesignated SCSI device called the quorum device. It repeats this SCSI RESERVE command every three seconds (a RESERVE command issued to a SCSI device by an initiator that already has the device reserved has no effect). The second server to start a cluster manager instance attempts to reserve the quorum disk and fails. It concludes that its cluster partner already has the disk reserved. Once the cluster is running, ownership of the quorum disk can be given to whichever of the servers is preferred by policy. If a server in an MSCS cluster stops receiving heartbeat messages over the links connecting it to its partner server, it must determine the state of its partner so it can fail applications over as appropriate. To make this determination, MSCS uses a challenge/defense protocol based on SCSI reservations. The server that has reserved the quorum disk (the preferred server for the quorum disk resource) is called the defender. If it is running when heartbeat fails, it assumes that it is the surviving server. The other server is called the challenger. If it is running when heartbeat fails, it must determine whether:
Managing Volumes in Clusters
281
■■
The links connecting it to the defender have failed, but the defender is still operational. If this is the case, the defender must cease cluster operations (i.e., stop the cluster resource groups that are running on it) and allow the challenger to become the entire cluster. (Both servers can no longer operate together as a cluster because they cannot intercommunicate.)
■■
The challenger has failed. In this case, the defender must become the cluster and restart cluster applications that had been running on the failed challenger.
When a challenger detects heartbeat failure, it immediately resets the SCSI bus, breaking the defender’s quorum disk reservation. After a pause of 10 seconds, the challenger attempts to reserve the quorum disk for itself. If the defender is operational, it will already have renewed its quorum disk reservation on the three-second cycle, so the challenger’s reservation attempt will fail. When its reservation attempt fails, the challenger concludes that the defender is still operating. The challenger ceases to be a cluster member in order to permit the defender to restart cluster applications as appropriate. This sequence of events, illustrated in Figure 12.5, is called a successful defense. The challenger in this situation may continue to operate as a standalone server, but must fail cluster applications immediately, because the defender will be attempting to restart them and access their data on the commonly accessible storage. If, on the other hand, the reason for heartbeat failure is that the defender has failed and is therefore not issuing any SCSI RESERVE commands, the
Defender
SCSI RESERVE
SCSI RESERVE
SCSI RESERVE
SCSI RESERVE
SCSI RESERVE
SCSI RESERVE
SCSI RESERVE
SCSI RESERVE
time ~10 second waiting period
0
3
6
9 SCSI BUS RESET
Challenger
Loss of heartbeat detected.
12
15
18
21
Conclusions: SCSI Defender is functional. RESERVE Heartbeat link has failed. (fails) Result: Challenger shuts down cluster operations.
Figure 12.5 Microsoft Cluster Server failure detection: successful defense.
282
C H A P T E R T W E LV E
challenger’s quorum disk reservation succeeds. When it sees this, the challenger concludes that the defender has failed and it fails over and restarts cluster applications. Figure 12.6 delineates the sequence of events comprising a successful challenge. This simple protocol works in a two-server MSCS cluster. In clusters of four or more servers, similar, but more complex mechanisms for determining server state and avoiding partitioning are used.
MSCS Clusters and Volumes For an application to move from one clustered server to another, the alternate server must be able to take control of and access the application’s data. For disks that contain only basic partitions, this is not difficult; in fact, MSCS supports basic disks as cluster resources. Either a physical disk or an external RAID subsystem’s LUN can be formatted as a basic disk and used as an MSCS cluster resource. Disk write operations in progress when a clustered server fails may or may not complete, so after a server failure, the consistency of on-disk data cannot be assumed. File systems and database managers all log updates in some form so that they can restore data to a consistent state as part of the system restart process. File system and database recovery mechanisms do assume disk semantics, however. In particular, they assume that disk writes are persistent (as described in Chapter 1). This prohibits them from simply restarting immediately after a crash if their data is on failure-tolerant host-based volumes that Defender Defender fails SCSI RESERVE
SCSI RESERVE
SCSI RESERVE
time ~10 second waiting period
0
3
6
9 SCSI BUS RESET
Challenger
Loss of heartbeat detected.
12
15
18 SCSI RESERVE (succeeds)
21 Conclusion: Defender has failed. Result: Challenger restarts cluster applications.
Figure 12.6 Microsoft Cluster Server failure detection: successful challenge.
Managing Volumes in Clusters
283
may have nonpersistent internal state associated with write operations in progress. For example, a mirrored-striped volume with two three-disk striped plexes might have as many as six internal write commands outstanding (one to each of its disks) when the server controlling it fails. When an alternate server imports a host-based volume after a crash, it must make sure that the volume’s contents are internally consistent before allowing the file system on it to be mounted for use. Writes left partially complete by the crash must be repaired. The Volume Manager’s import function must therefore be integrated with cluster functionality. Volumes must be organized and metadata structures designed so that when volume failover is required by cluster resource group policy, the servers in the cluster respect disk ownership rules and restore internal consistency of volume contents. The Volume Manager treats disk groups as cluster resources and fails over all disks in a group as a unit.
Volumes as MSCS Quorum Resources Either a basic disk or a dynamic volume can serve as the MSCS quorum device. Using a physical disk as the quorum device introduces a nonredundant component into an otherwise highly available system. A failure-tolerant volume used as a quorum device provides a level of availability consistent with that of the rest of the cluster. A volume to be used as a quorum device should be the only volume in its disk group. Any other volumes in the same group would necessarily fail over whenever the quorum device changed ownership. A disk group containing only a three-mirror volume makes an ideal quorum device. Such a device survives both disk failures (because it is mirrored) and server and interconnect failures (because it can always be imported if the disks and at least one server are running). The challenge-defense protocol just described is more complex when the quorum device is a volume in a disk group. For a server to import a disk group containing the cluster quorum device, its Volume Manager must successfully obtain SCSI reservations on more than half of its disks. This is what makes disk groups containing odd numbers of disks the most appropriate for use as quorum devices.
Volume Management in MSCS Clusters The Volume Manager supports host-based volumes as MSCS cluster resources, allowing cluster manager instances to automatically migrate control of an
284
C H A P T E R T W E LV E
application’s storage devices to a failover server when the application itself fails over. Dynamic disk groups (introduced in Chapter 7) are central to volume management in Windows clusters. The disk group is the atomic unit in which ownership of disks and volumes is transferred from one server to another. Windows 2000’s built-in Logical Disk Manager supports only a single dynamic disk group. The Volume Manager for Windows 2000 supports multiple disk groups, enabling more flexible application (and data) failover options.
Preparing Disks for Cluster Use The Windows 2000 Volume Manager manages clusterable disks as a unique resource type called cluster disk. Since the disk group is the unit in which disks fail over, all the disks in a group must be cluster disks. This restriction is enforced by a third type of disk group called a cluster disk group. The only important distinction between dynamic disk groups and cluster disk groups is that the Volume Manager does not automatically import the latter when a server starts up. This allows a cluster manager instance to import each cluster disk group on the appropriate server, preventing all the servers in a cluster from attempting to import and use the group’s disks simultaneously. For a disk to be used as all or part of a cluster resource, the disk must be upgraded to be a cluster disk and placed in a cluster disk group using the Upgrade to Cluster Disk… command, highlighted in Figure 12.7, to invoke the Upgrade to Cluster Disk wizard, which begins with the typical introductory panel (not shown). In the wizard’s first input panel, the administrator specifies the disks to be upgraded. While this panel is displayed, it is also possible for the administrator to create and additional cluster disk groups, by clicking the New button, as shown in Figure 12.8. In Figure 12.8, clicking the New button brings up the Create Dynamic Group dialog, in which a new cluster disk group named DiskGroupR is being created. (Cluster disk group DiskGroupM, shown in the wizard’s main panel, already exists.) When the Create Dynamic Group dialog is dismissed, the administrator specifies the disks to be upgraded and the cluster disk group to which they will belong. As with all Volume Manager operations, physical disks and the virtual disks (LUNs) of external RAID subsystems are indistinguishable.
MSCS Resource Types: Resource DLLs and Extension DLLs Both the Microsoft Cluster Server and the VERITAS Cluster Server support the disk group as a cluster resource type. Figure 12.9 illustrates the available resource types in an MSCS cluster in a system with the Volume Manager installed.
Managing Volumes in Clusters
Figure 12.7 Upgrading a disk for use as a cluster resource.
Figure 12.8 Creating a cluster disk group.
285
286
C H A P T E R T W E LV E
Figure 12.9 Microsoft Cluster Server cluster disk group resource type.
As a cluster resource, a disk group can fail over between clustered servers, for quick resumption of data access when a server fails. An administrator can also force cluster disk groups to fail over, for example, to balance processing or I/O load. Forced failover is also useful for system maintenance. A server’s applications can be failed over to a secondary server temporarily while maintenance is performed on the primary server, then failed back when maintenance is complete. The Volume Manager Disk Group resource type listed in Figure 12.9 is associated with a resource dynamic link library (Resource DLL) called vxres.dll. The MSCS cluster manager invokes resource DLL APIs to operate on cluster resources in these ways: ■■
Creating and deleting instances of the resource type.
■■
Starting a resource so that it can be used by applications.
■■
Stopping a resource so that it can be failed over to another server.
■■
Periodically verifying that a resource is functioning properly (MSCS supports two levels of verification—a superficial one called Looks Alive and a more thorough one called Is Alive—that can be invoked at different time intervals).
APIs for these functions are required for all cluster resources. The breadth of resource types listed in Figure 12.9, however, suggests that the actions involved in performing these functions might differ substantially for different resource types. For example, starting a print spooler resource might mean
Managing Volumes in Clusters
287
locating and opening print queue files and querying the printer to verify that it is operational. Starting a disk group, on the other hand, means verifying that a critical minimum number of disks in the group are operational and writing disk metadata to indicate current disk group ownership. In addition to its resource DLL, each MSCS resource type may have an extension DLL that extends the MSCS user interface. The extension DLL enables specification of resource type-specific parameters for instances of the resource type.
Using Host-Based Volumes as Cluster Resources This section and those that follow illustrate the use of host-based volumes as cluster resources. The examples use two servers called NODE-3-1 and NODE3-2. The two servers share access to six disks on a common SCSI bus. The disks have been initialized as cluster disks using the Upgrade to Cluster Disk wizard (shown in Figure 12.8). To begin, Figure 12.10 shows the Volume Manager console view of a cluster disk group called ClusterGroup that contains the six disks. The small crosses superimposed on the icons indicate that they represent a cluster disk group and cluster disks. Once created, a cluster disk group is managed as any other dynamic disk group. For example, Figure 12.11 displays the Select Volume Type panel
Figure 12.10 The cluster disk group ClusterGroup.
288
C H A P T E R T W E LV E
Figure 12.11 Creating a three-mirror volume in a cluster disk group.
from the Create Volume wizard for the cluster disk group. Also shown is the use of the Query Max Size button. When this button is clicked, the Volume Manager displays the highest possible capacity for a volume of the specified type (a three-mirror volume in this example) that can be created from available capacity in the disk group. In this case, which is based on the ClusterGroup shown in Figure 12.10, the largest three-mirror volume that could be created from the six disks would be 8,675.66 megabytes. Clicking the Query Max Size button results in a Volume Manager capacity proposal that the administrator may accept or override with a volume capacity specification.
Multiple Disk Groups A slightly different cluster disk scenario is shown in Figure 12.12: it’s the same system but with two cluster disk groups rather than one. DiskGroupM, containing Disks 2, 4, and 5, has been used to create a mirrored volume. DiskGroupR, containing Disks 3, 6, and 7, will be used to create a RAID volume in this example. While fewer larger disk groups allow more flexible allo-
Managing Volumes in Clusters
289
Figure 12.12 Two cluster disk groups.
cation of storage capacity, multiple disk groups are more natural in clusters, because the disk group is the unit in which disks fail over. All disks in a cluster disk group are “owned” (accessible) by the same server at any point in time. When a cluster disk group fails over, all of its disks fail over together. Creating multiple cluster disk groups enables multiple applications to fail over independently of each other. The 1,000-megabyte three-mirror volume (addressed as drive letter G:) depicted in Figure 12.12 has components on disks in DiskGroupM. This volume and any others created on the disks in DiskGroupM would be accessible by NODE-3-1, but not by NODE-3-2. If an application that uses the volume at drive letter G: fails over, the disk group resource and all the volumes in it fail over to the same secondary server at the same time. Volumes in DiskGroupR are unaffected by DiskGroupM failover. Thus, for example, an administrator could move an application that uses drive letter G: to a secondary server and leave other applications that use volumes in DiskGroupR running on NODE3-1. This capability could be used, for example, to upgrade a database manager on NODE-3-1, while leaving other applications on NODE-3-1 that do not use the database manager undisturbed.
290
C H A P T E R T W E LV E
Cluster Resource Group Creation Creating a Volume Manager cluster disk group (Figure 12.8) does not automatically make the group into a MSCS cluster resource. An administrator invokes the MSCS Cluster Administrator New→Group command shown in Figure 12.13 to create a resource group of which a cluster volume group may be a part. As with the Volume Manager, most MSCS functions are specified by wizards invoked by console commands. The New→Group command, for example, starts the MSCS Cluster Administrator’s New Group wizard. Figure 12.14 contains two of the panels displayed during creation of a disk group cluster resource (some MSCS Cluster Administrator wizard panels are generic, and are not shown here). At the top, in the New Group panel, the administrator has specified the name MSCS Disk Group for the cluster resource group. The Description box is used to include a comment that is echoed for recognition elsewhere in the console displays. At the lower right panel, the Preferred Owners panel is used to specify the cluster servers (called Available nodes in the dialog) on which the resource group may be started. Preferred owners are listed is the order of preference when the resource group is started. The cluster used in these examples consists of two servers; both are designated here as servers on which MSCS Disk Group can be started, with NODE-3-1 set as the first preference, meaning that the cluster
Figure 12.13 Invoking the MSCS New Cluster Resource Group command.
Managing Volumes in Clusters
291
Figure 12.14 Establishing MSCS functions.
manager would start MSCS Disk Group on server NODE-3-1 if possible. If NODE-3-1 were not running, the cluster manager would start MSCS Disk Group on NODE-3-2. During resource group creation, an administrator may specify that a group is restartable. If the cluster manager detects that a restartable group is not running on some server in the cluster, it automatically restarts the group on its most preferred server.
Making a Cluster Disk Group into a Cluster Resource Before a cluster disk group can fail over, it must be made into an MSCS resource. MSCS cluster resources are created using the New→Resource
292
C H A P T E R T W E LV E
command invoked from the Cluster Administrator’s File menu (Figure 12.15). Invoking this command starts the New Resource MSCS wizard, which is used to create all types of MSCS cluster resources, including both built-in types and types resulting from the installation of other software such as the Volume Manager. Figure 12.16 shows three of this wizard’s panels that are particularly significant for disk group resources. The panels are shown the order of display during wizard execution, reading clockwise. In the first panel, the administrator names the resource (MSCS Disk Group) and provides a descriptive comment. The Resource Type (Volume Manager Disk Group) and the resource group to which it is to belong (MSCS Disk Group) are specified in this panel from drop-down lists. In the second panel, the administrator designates the cluster servers on which the resource may be started. This is similar to Figure 12.14 for the entire cluster resource group. Note: This list overrides the cluster resource group list if the two differ; thus, the administrator can specify that the group as a whole can fail over but that individual resources within it can only be used on one server or the other. In the lower panel, resource dependencies are declared. Cluster resources specified in this panel must be started before the cluster manager will start the subject resource, and the subject resource must be stopped before the cluster manager will stop any of the resources specified in this panel.
Figure 12.15 Invoking the MSCS New Cluster Resource command.
Managing Volumes in Clusters
293
Figure 12.16 Selected panels from MSCS New Resource wizard.
Controlling Failover: Cluster Resource Properties By clicking on the Properties… command and then on the Advanced tab, the administrator can manipulate advanced properties of a disk group cluster resource to influence failure detection and, therefore, failover behavior (see Figure 12.17). The MSCS Disk Group (the cluster resource name supplied by the administrator in Figure 12.16) is designated for restarting by the cluster manager if it fails. (Giving the cluster resource group and an individual resource in it identical names is permissible, but is not necessarily the best administrative practice.) The resource is set to Restart if it fails (before failing over), and to Affect the group (i.e., to fail the resource group over to the next eligible server) if restart is required three times within a period of 900 seconds (15 minutes). The panel shown in Figure 12.17 is also used to specify the intervals at which the cluster manager invokes the LooksAlive and IsAlive APIs to determine the health of the resource. In the figure, both functions are invoked at 300-millisecond intervals, although this is not a requirement. Each resource type (Figure 12.9) has unique default intervals for calling these APIs. For disk groups, the default intervals are recorded automatically when the Volume Manager is installed. An administrator can override these default intervals by replacing the values in the resource group’s Properties dialog. The Parameters tab in the Properties dialog is used to specify resource type-unique properties. Parameter specification is one of the functions of the extension DLL installed when a new MSCS resource type is added to a system.
294
C H A P T E R T W E LV E
Figure 12.17 Setting advanced group resource properties.
Bringing a Resource Group Online Creating a resource group does not automatically make the volumes in it available for use by applications. The resource group must first be brought online to the server that is to control it. Figure 12.18 shows the Volume Manager console view of a cluster disk resource group before it has been brought online to server NODE-3-1. Because the disk group is not online to this server, the Volume Manager cannot access the group’s metadata, so no information about volumes or other disk group state is displayed. In Figure 12.19 the Cluster Administrator Bring Online command has been invoked to make MSCS Disk Group available to NODE-3-1. When an MSCS resource group is brought online, the cluster manager starts its resources in dependency order. As resources start, they become available to applications.
Managing Volumes in Clusters
Figure 12.18 Cluster volume group before being brought online.
Figure 12.19 Using Cluster Administrator to bring a cluster volume online.
295
296
C H A P T E R T W E LV E
Figure 12.20 Cluster volume online on NODE-3-1 and offline on NODE-3-2.
To start a Volume Manager disk group resource, the cluster manager imports the disk group and mounts the file systems on its volumes. Figure 12.20 shows the three-mirror volume from Figure 12.12 in use by server NODE-3-1. In the top panel, the volume is presented on server NODE-3-1 as drive letter T:. The bottom panel show the console view of the cluster disk group from NODE-3-2. From NODE-3-2’s viewpoint, the disks in the group are known to exist, butthey cannot be accessed, because server NODE-3-1 “owns” them. This is the essence of the aforementioned “shared-nothing” cluster model-only one server can access a given Volume Manager disk group (or any other resource) at any instant.
Administrator-Initiated Failover An administrator can use the MSCS Move Group command to force ownership of a resource group to move from one server in an MSCS cluster to another. This is called administrative failover or forced failover. Administrative failovers differ from failovers initiated by the cluster manager itself, when a server failure results in missing heartbeat messages. With administrative failovers, the cluster manager is able to stop each of a group’s resources in order before restarting the group on the failover server. Thus, for example, file systems on the volumes in a cluster disk group resource can be dismounted “cleanly” (with all on-media metadata consistent). When failover results from a crash, resource group shutdown is not orderly, and resources that have volatile state must be checked for consistency. For example, file systems
Managing Volumes in Clusters
297
mounted on a system that crashes must have the Windows 2000 CHKDSK program run against them before they are mounted on a failover server.
Failback Whether initiated by an administrator or a server crash, failover places a cluster in what is usually regarded as a temporary state, with applications and other resource groups displaced from their normal locations. When a failed or crashed server is again available for use, it is usually desirable to restore the normal operational configuration by moving failed over resource groups back to their primary servers. This is called failback, and is implemented using the MSCS Move Group command shown in Figure 12.21. The MSCS Disk Group resource group had been failed over from its primary server (NODE3-1) to NODE-3-2, where it went online. Here, the initiation of MSCS Disk Group’s failback to server NODE-3-1 is underway. Since the cluster in this example contains only two servers, there is no need to supply a parameter for this command—the group is moved to the cluster’s only other server. The resource group (not the individual resource) is the atomic unit in which MSCS changes ownership of resources between two servers. The Move Group command changes ownership of all the resources in a group. When resource ownership is changed administratively (as opposed to by server failure), the cluster manager shuts down the resources in the group in the order specified by the dependency tree established when the resource group was created. This is important, for example, with volumes whose file systems are mounted and may have metadata and user data cached in server memory. An orderly shutdown of the file system resource flushes the cache, making restart on the destination server faster (because CHKDSK need not be run).
Figure 12.21 Disk group and volume during administrator-initiated move.
298
C H A P T E R T W E LV E
The bottom panel in Figure 12.21 illustrates one of several states taken on by MSCS Disk Group during the shutdown process. At the instant of capture, the Shared Mirrored Volume resource in the group was already in the Offline state with respect to server NODE-3-2, but the internal bookkeeping to complete the transfer of MSCS Disk Group had not been completed, so its state is reported as Offline Pending.
Multiple Disk Groups in Clusters Using disk groups to segregate disks by application makes it possible to move applications and other resource groups between servers independently of each other. By aligning Volume Manager cluster disk groups with individual applications, administrators enable an application and its data to fail over to a different server without affecting other applications. An administrator may move an application and the resources on which it depends to an alternate server, so that, for example, software can be upgraded or other maintenance tasks performed on the primary server. Since the atomic unit of failover is the Volume Manager disk group, all of the disks in a group—and hence all the volume, file system, and file share resources that depend on them—fail over together. Thus, it makes administrative sense to place data belonging to different applications on different cluster disk groups, so that failover of one application has no effect on others. Figure 12.22 illustrates disk group independence. Here, two cluster disk groups are shown, one of which (DiskGroupR) is online to the local server (NODE-3-1) and the other of which is not. There is a RAID volume on the disks of DiskGroupR. (Since DiskGroupM is not online to NODE-3-1, NODE-3-1’s Volume Manager cannot access its disks, so no information about their state is displayed) The RAID volume on the disks of DiskGroupR is completely independent of any volumes on the disks of DiskGroupM, which might be online to server NODE-3-2.
Making a Volume Manager Disk Group into an MSCS Cluster Resource This section illustrates the creation of a new MSCS resource in a system with two Volume Manager disk groups. For this example, the two MSCS cluster resource groups, MSCS Disk Group M and MSCS Disk Group R, have already been created. Figure 12.23 shows the initial panel of the MSCS New Resource wizard for making the Volume Manager disk group DiskGroupM (Figure 12.22) into an MSCS resource of the Volume Manager Disk Group type. (The Volume Manager Disk Group resource type becomes available to MSCS when the Volume Manager is installed.)
Managing Volumes in Clusters
299
Figure 12.22 Volume Manager view of online RAID volume.
The MSCS cluster resource name MSCS MirroredVolume is given to DiskGroupM (perhaps not the ideal choice of name from an ongoing administrative standpoint). The MSCS resource name is used in MSCS management operations such as those shown in Figures 12.23 and 12.24. The name DiskGroupM is used by the Volume Manager (Figure 12.22). Figure 12.24 shows, at the top, the initial state for a newly created MSCS cluster resource on the owning server (offline), and, at the bottom, the Bring Online command to make the resource available for application use. The Bring Online command applies to individual resources. Here it has been invoked in the context of the MSCS MirroredVolume cluster resource created during the New Resource wizard execution shown in Figure 12.23.
300
C H A P T E R T W E LV E
Figure 12.23 MSCS New Resource wizard panel for MSCS MirroredVolume.
Figure 12.24 Bringing the Resource Group containing MSCS MirroredVolume online.
Managing Volumes in Clusters
301
Making Cluster Resources Usable: A File Share This section illustrates the creation of cluster resource groups for a popular MSCS application—file serving. As with nonclustered servers, a file share to be served to clients is created from a directory tree in a FAT or NTFS file system on a Windows volume. The file share cluster resource to which the file system file share will be bound is created using the MSCS New Resource wizard (Figure 12.25). Here, in the initial parameter specification panel, File Share has been specified as the type of resource to be created.4 Defining a resource type specifies the DLL that the cluster manager calls to perform cluster management functions, as illustrated in Figure 12.9. Using this panel, the administrator names the resource being created (MSCS Mirrored Share) and, optionally, supplies a descriptive comment. 4
File Share is a built-in MSCS resource type. The drop-down list in Figure 12.25 is a partial list of MSCS built-in resource types.
Figure 12.25 Creating a file share cluster resource.
302
C H A P T E R T W E LV E
A file share can perform its function only if the volume containing the shared directories is accessible. A file share cluster resource therefore depends on the Volume Manager disk group resource that contains the disk(s) that holds the volume and file system containing the shared directories. The MSCS administrator declares these dependencies when the resources are created. Figure 12.26 shows where dependencies for the MSCS Mirrored Share file share are specified. In the Dependencies panel, all resources in the cluster resource group are listed on the left. To declare that the resource being created depends on one or more existing resources, the administrator highlights the resource (MSCS MirroredVolume, in this example) and clicks the Add→ button, to include it to the Resource dependencies list on the right side of the panel. (Figure 12.26 was captured after the MSCS MirroredVolume resource had been selected, but before it had been added to the Resource dependencies list.) An MSCS cluster resource may depend on more than one underlying resource; for example, an application that reads orders from a database and prints shipping invoices might depend on both a database manager resource and a network printer resource. Moreover, a resource may depend on resources that in turn depend on other resources. For example, the database image for a database manager might be located on a volume in a Volume Manager disk group resource. The database manager resource would depend on the Volume Manager disk group resource.
Figure 12.26 Specifying a resource dependency for a file share.
Managing Volumes in Clusters
303
Resource Parameters Each type of MSCS cluster resource has unique parameters. In the case of a file share resource, the shared directory path and access permissions are parameters whose values must be supplied. This is done in the Parameters panel of the New Resource wizard for the file share resource type (Figure 12.27). This panel is quite similar to the dialogs used to create a file share in a single-server system. The administrator specifies the Share name by which the file share will be known to clients (MirroredShare, in the example), the path to the shared data (G:\MirroredSharePath), and the maximum number of users permitted to access the share concurrently. Access permissions and other advanced parameters may also be supplied, just as with a singleserver file share. The Permissions dialog displays when the Permissions button is clicked. The FileShare Parameters Panel is the final step in creating a file share cluster resource. To access this file share, a client could, for example, invoke the Windows 2000 Explorer Tools→Map Network Drive command, specifying the share name qualified by the cluster name—for example,
Figure 12.27 Supplying parameters for a File Share.
304
C H A P T E R T W E LV E
\\CLUSTER-3\MirroredShare.5 If the server on which the file share is online (CLUSTER-3-1, in this example) were to fail, its partner server would notice the failure (signaled by missing heartbeats) and would restart the file share resource. Once restarted, the file share would be made available to clients using the same fully qualified share name—again, \\CLUSTER3\MirroredShare. Thus, except for a time lag of a few seconds between discovery that server CLUSTER-3-1 had failed and restart of the file share on server CLUSTER-3-2, the failure is transparent to file share clients.
MSCS and Host-Based Volumes: A Summary Four architectural properties of host-based volumes make them usable as MSCS cluster resources: Cluster disk groups. Cluster disk groups are identical to dynamic disk groups, except that they are not automatically imported when the servers to which they are connected start up. Deferring importation allows cluster manager instances to import each disk group on only the appropriate server. The disk group MSCS cluster resource type. Disk group cluster resources contain or consist of Volume Manager disk groups. Disk group cluster resources typically belong to larger MSCS resource groups that include applications, common functions like file shares, and network interface cards and IP addresses that bind client access to particular network ports. The ability to host multiple disk groups. Since the disk group is the atomic unit of disk ownership in a cluster, multiple disk groups are required to enable multiple unrelated applications and their data to be moved between clustered servers independently of each other. The disk group resource DLL and extension DLL. The disk group resource DLL implements the functions required of all MSCS resources in a manner specific to disk groups. These functions include creating and deleting cluster resources, starting and stopping these resources, monitoring the resources’ “health” to determine if failover is warranted, and resource setup and configuration. The extension DLL adds disk group-specific management capabilities, such as parameter input and display, to the MSCS console user interface. 5
It is also common practice with MSCS clusters to include an IP address and network interface card in a resource group that presents a file share. This forces the file share to be presented on a particular IP address and network link.
Managing Volumes in Clusters
305
Disk Groups in the MSCS Environment When a server in an MSCS cluster fails, the cluster manager: ■■
Redetermines the state of the cluster so that all surviving servers have the same view of cluster membership.
■■
Restarts cluster resource groups according to cluster administrative policy.
Bringing all servers in the cluster to a common view of membership before any resources are restarted prevents more than one server from attempting to restart a given application and its resources, such as disk groups. In a nonclustered server, the Volume Manager automatically imports disk groups (takes ownership of them and makes the volumes and file systems on them available for application use) when the server starts up. This is obviously inappropriate for clustered servers, since it would mean that two or more servers would attempt to take control of the same disk group(s). Disk groups created as cluster disk groups are therefore not automatically imported at system startup. Instead, the MSCS Cluster Service imports cluster disk groups on the appropriate server as part of its startup procedure using services of the vxres.dll DLL (Figure 12.9). For a disk group to be imported on a server, the Volume Manager instance on that server must be able to reserve more than half of the group’s disks using SCSI device reservations. This algorithm verifies the health of the disk group and, equally important, reserves the group’s disks for the exclusive use of the importing server. SCSI device reservations also provide an additional level of protection against errant software attempting to write to a disk its server does not own.
Disk Groups as MSCS Quorum Resources A volume in a Volume Manager disk group can serve as the MSCS quorum device, a cluster resource that a server must control before it can start or restart a cluster. The MSCS quorum resource prevents more than one server from attempting to take control of a cluster in cases where negotiation among the servers is impossible because network connections are broken or not yet established. A failure-tolerant volume is an ideal quorum device because it makes the quorum device as highly available as the rest of the cluster. The Volume Manager’s disk group “over-half ” importation rule suggests that disk groups containing odd numbers of disks make the most appropriate quorum resources. A disk
306
C H A P T E R T W E LV E
group containing only a three-mirror volume makes an ideal quorum device, protecting both against disk failures (because it is mirrored) and server and interconnect failures (because it can always be imported by the Volume Manager if disks and servers are running).
Configuring Volumes for Use with MSCS The MSCS Server Configuration wizard automatically detects all available clusterable resources. The steps for creating a cluster that includes disk groups as cluster resources are as follows: 1. Install MSCS, followed by Volume Manager. This automatically makes the vxres.dll DLL (Figure 12.9) available to MSCS. 2. Use the Volume Manager console to create cluster disk groups for the storage devices that will be used as cluster resources. It is desirable to include a quorum disk group containing an odd number of disks. 3. Run the MSCS Cluster Administrator to configure cluster resource groups and Volume Manager disk group resources. All cluster disk groups are available for management through the MSCS Cluster Administrator interface.
VERITAS Cluster Server and Volumes As noted at the beginning of the chapter, the second cluster manager in widespread use with the Windows 2000 operating system is the VERITAS Cluster Server (VCS) offered by VERITAS Software Corporation. Conceptually, VCS is similar to MSCS in that it: ■■
Manages cluster membership using heartbeat message exchanges over network links connecting the clustered servers.
■■
Organizes resources into bundles called application service groups or, simply, service groups, which are similar to MSCS resource groups.
■■
Fails service groups over by restarting them on alternative servers when a primary server fails or in response to administrative command.
The primary differences between MSCS and VCS are: ■■
VCS supports clusters of up to 32 interconnected servers.
■■
VCS relies completely on network protocols to ascertain cluster membership. VCS can use, but does not require, a cluster quorum device.
Managing Volumes in Clusters
307
VCS and Cluster Disk Groups With the exception of the MSCS quorum device, VCS use of host-based volumes is very similar to that of MSCS. In order for volumes to be VCS cluster resources, they must belong to cluster disk groups so that they are not automatically imported when the Volume Manager starts up. Figure 12.28 shows the Volume Manager console view of a cluster disk group called VXCLUSTER prior to importation on a server called DMPVCS-1. As the figure suggests, properties of disks in disk group VXCLUSTER cannot be viewed from server DMPVCS-1. The disk group has not been imported; therefore, the Volume Manager instance on DMPVCS-1 does not “own” (have the right to access) the disks in it. A cluster disk group can be imported in either of two ways: ■■
Explicitly, using the Volume Manager Import Disk Group… command (on the disk group context-sensitive menu).
■■
Implicitly, when a VCS cluster resource comprising the disk group is brought online using a similar VCS command.
VCS Service Groups and Volumes Like MSCS, VCS manages online storage resources at the Volume Manager disk group level. Figure 12.29 shows a VCS console view of a service group containing eight file shares (FS_USER1_1…FS_USER2_4) corresponding to directories in the file systems of two volumes. The volumes, which are also
Figure 12.28 A cluster disk group before importation.
308
C H A P T E R T W E LV E
Figure 12.29 VCS view of the Volumes Service Group.
encapsulated within cluster resources, are allocated from a Volume Manager disk group cluster resource (VMDg—the VCS resource type name) called DG_VXCLUSTER. Figure 12.30 depicts the VCS cluster configuration used to produce this and subsequent examples. It includes two servers, called DMPVCS-1 and DMPVCS-2, with eight disks attached to a common Fibre Channel hub. The eight disks are organized as two volumes in a Volume Manager disk group. The Volume Manager disk group name (which is not visible in these examples) is VXCLUSTER. The administrator has named the VCS resource that contains the disk group DG_VXCLUSTER. It is a resource of type VMDg. VMDg resources become available when the Volume Manager is installed on a system with VCS already installed. The VCS resources that contain the two volumes are named VM_MOUNT_F and VM_MOUNT_G, respectively. The icons representing them may be observed in Figure 12.29. VM_MOUNT_F and VM_MOUNT_G are resources of type MountV (a VCS built-in resource name short for “mounted volume”). Each of these volumes contains four directories that are shared to clients. The VCS File-
Managing Volumes in Clusters
309
Client
Server DMPVCS-1
Cluster Interconnect (Ethernet)
Server DMPVCS-2
Fibre Channel Hub
Volume F (RAID) (VM_MOUNT_F)
Volume G (Mirror) (VM_MOUNT_G)
Figure 12.30 VCS configuration used for this example.
Share resources that contain these shares are named FS_USER1_1, FS_USER2_1, and so forth. Icons representing these resources are visible in Figure 12.29 as resources of type FileShare. Figure 12.29 also shows an icon for a resource named ExportName of type Lanman. This resource is the virtual host name under which the file shares are made available to clients. To a client, this is the name of the host through which the eight file shares are accessed. This name remains constant when ownership of the service group containing the file shares and the other resources upon which they depend passes from one server to another. The entire collection of resources shown in Figure 12.29 is a VCS service group (which is equivalent to a MSCS resource group) called Volumes. The Volumes service group contains: ■■
A Volume Manager disk group resource
■■
Two volume resources
■■
Eight file share resources
■■
The virtual host name resource used to make file shares available to clients
The main (right-hand) panel of Figure 12.29 graphically depicts these resources and their interdependencies. At the bottom, no resources depend on the DG_VXCLUSTER disk group. At the next layer, VM_MOUNT_F and VM_MOUNT_G depend on the disk group because the file systems cannot be mounted if the disk group holding them is not online. Similarly, the file shares
310
C H A P T E R T W E LV E
depend on the volumes, since they cannot be mounted if the volumes containing their data are not imported. Finally, the virtual host name can only be used to access files if the file share resources are accessible. Like an MSCS resource group, a VCS service group is online to one server at any instant in time. Figure 12.31 shows VCS console views of the Volumes service group from both of the clustered servers. The highlighted icons in the DMPVCS-1 view indicate that the service group is online to that server. The grayed icons in the DMPVCS-2 view indicate that the service group exists, but is not online to server DMPVCS-2. The administrator-assigned name (which is a parameter value for the ExportName resource) through which file shares are accessed by clients is VCSDMPName. Clients use this name to access the file shares as if it were a server name; but VCSDMPName is used no matter which of the clustered servers has the Volumes service group online. Figure 12.32 shows one client view of the file shares in the Volumes service group. On the left is the dialog displayed when the Windows Explorer Tools→Map Network Drive… command in invoked. In this instance, the administrator has selected the share name \\vcsdmpname\USER1_2, which corresponds to the VCS resource named FS_USER1_2 in Figure 12.31. Figure 12.32 also shows a partial Explorer view of files on \\vcsdmpname\USER1_2 after the file share has been mapped and the Map Network Drive dialog has been dismissed.
Figure 12.31 VCS Manager views of the Volumes Service Group from two servers.
Managing Volumes in Clusters
311
Figure 12.32 Client Explorer view of vcsdmpname cluster file shares.
Service Group Failover in VCS Clusters As with MSCS resource groups, control of VCS service groups can fail over to alternate servers in case of primary server failure. An administrator can also initiate failover to balance load or for other management purposes. In VCS terminology, this is called switching the service group to another server. Figure 12.33 illustrates the use of the VCS Switch To command to initiate a change in ownership from server DMPVCS-1 to the other eligible server in the cluster. The Switch To command, shown here in the service group menu in the context of the Volumes service group, allows an administrator to switch control of a service group to any server in the cluster eligible by policy to run it. The list of servers eligible to run a service group is created when the service group is created and may be modified by administrative action at any time. In Figure 12.33, the only other server in the cluster, DMPVCS-2, has previously been designated as eligible to run service group Volumes and has been specified by the administrator as the switchover target. The figure also shows the confirming dialog displayed when a target for the Switch To command has been specified. When a service group switchover command is issued, the group’s resources are stopped in sequence from the top of the tree to the bottom. Figure 12.34 shows the Volumes service group in the process of switching from DMPVCS-1 to another server. The graying of icons indicates that the ExportName resource, the FS_USERx_y resources, and the VM_MOUNT_F and VM_MOUNT_G resources have been stopped. The DG_VXCLUSTER Volume Manager disk group resource is in the process of stopping, as indicated by the downward-
312
C H A P T E R T W E LV E
Figure 12.33 Switching ownership of Volumes Service Group to the DMPVCS-2 server.
pointing arrow beside its icon. When all resources in the service group have been completely stopped, the VCS cluster manager instance on DMPVCS-2 begins to restart the resources from the bottom of the graph to the top. When the ExportName resource has been restarted, clients can reconnect to file share \\vcsdmpname\USER1_2, and requests can again be serviced. In most
Figure 12.34 Volumes Service Group stopping on server DMPVCS-1.
Managing Volumes in Clusters
313
cases, clients making requests during the switchover will experience delayed response during switchover itself and while connections are reestablished.
Adding Resources to a VCS Service Group Thus far, this example has shown a service group that had been created prior to the events described. This part of the example demonstrates the addition of a resource to that service group, to show how VCS resources can be created and managed without repeating the creation of the entire group. The example begins with the creation of a file share that will be made into a VCS cluster resource. The resource is a folder called New Folder in the file system on VCS cluster resource VM_MOUNT_F (presented as drive letter P: in Figure 12.35). This is a Windows Explorer view of the shared folders on the volume, including New Folder, which has not yet been made into a cluster resource. For New Folder to be a cluster resource, it must be described in the VCS configuration file. Since the New Folder resource is similar to other alreadyexisting resources in the Volumes service group, this process can be simplified by starting with a copy of an existing resource in the same service group. Figure 12.36 highlights the VCS Copy command used to create a resource of type FileShare in the Volumes service group. Though the command is shown as being issued from the Resources menu in the context of the FS_USER1_2 resource, any resource of type FileShare would be an equally appropriate context for invoking the command. This figure also illustrates two key points about managing VCS cluster resources from the cluster console: ■■
During operation, VCS cluster configuration files are normally in read-only mode, with no modifications permitted. Upon invoking any VCS command
Figure 12.35 Making a new folder into a file share VCS cluster resource.
314
C H A P T E R T W E LV E
that would result in modification of cluster configuration files, the administrator is required to confirm the intention to alter the files. This is primarily a precaution against inadvertent changes to configuration files. ■■
When requesting that a resource be copied, an administrator may also request copies of all the resources below it in the tree (called Child Nodes in Figure 12.36). In this example, the new file share resource will depend on already-existing volume and disk group resources, so there is no need to copy Child Nodes. If a new file share resource were to be based on a different volume and disk group that were not already cluster resources, the Copy→Self and Child Nodes command would be useful.
Once the VCS Copy command has executed, thereby creating the data structures that will describe the new VCS resource, the administrator next sets the resource’s parameters so that it can be incorporated into the cluster (Figure 12.37). This figure shows four of the key parameters of a VCS resource of type FileShare: PathName. Identifies the file system path that will comprise the file share. Its value in this example is \New Folder. ShareName. The name by which the file share will be known to clients. Its value in Figure 12.37 is NEW1_1. MountResName. Identifies the mounted volume resource (and therefore the file system) on which the file share resides. Its value in this example is VM_MOUNT_F.
Figure 12.36 Copying an existing cluster resource.
Managing Volumes in Clusters
315
Figure 12.37 Setting parameters for new VCS resource.
MaxUsers. Denotes the maximum allowable number of simultaneous client connections to the file share. When this number of connections is active, further connection requests are rejected. Its value in this example is 200. The next step is to establish the new cluster resource’s dependencies on other resources. In this example, the FS_NEW1_1 file share resource includes a file system folder on a volume that is already a VCS cluster resource. The new resource therefore has a dependency on the VM_MOUNT_F mounted volume cluster resource. A resource dependency may be established by selecting the VCS console icon representing the parent resource, and dragging and dropping it on a child resource (the resource on which the parent will depend). Figure 12.38 illustrates this technique used to specify that FS_NEW1_1 (the parent) depends on VM_MOUNT_F (the child). When a dependency is specified by dropping the parent resource’s icon on the child’s icon, VCS displays the confirming dialog shown in Figure 12.38, that states the nature of the dependency in words. Clicking Yes here establishes the dependency. In Figure 12.38, the administrator has already specified the required dependency of the ExportName resource upon the FS_NEW1_1 resource, completing the configuration of the FS_NEW1_1 resource. The remaining step is to
316
C H A P T E R T W E LV E
Figure 12.38 Creating a resource dependency.
invoke the VCS Save Configuration command to save the updated configuration (Figure 12.39). The VCS Close Configuration command is also visible in this figure, though it is grayed out. If a VCS configuration is closed without first saving it, any changes made since the last Save Configuration command are lost. Once cluster configuration is complete and the configuration file has been saved and closed, the resource is part of the cluster configuration and can be brought online. Figure 12.40 illustrates bringing the FS_NEW1_1 file share resource online using the the Online command on the resource menu in the context of the FS_NEW1_1 resource. The Online command cascades to a list of clustered servers eligible to run the new resource. In this example, since the VM_MOUNT_F mounted volume resource upon which the FS_NEW1_1 resource depends is online to server DMPVCS-1, an attempt to bring the FS_NEW1_1 resource online to server DMPVCS-2 would fail because its underlying resource cannot be brought online. Once a newly configured cluster resource is online, it may be used by clients. Figure 12.41 illustrates the FS_NEW1_1 file share resource from a client’s viewpoint. Here, the Windows 2000 Explorer Map Network Drive dialog can be seen running on a client, with \\vcsdmpname selected for browsing.
Managing Volumes in Clusters
317
Figure 12.39 Saving the VCS configuration after new resource creation.
The Shared Directories box shows that the new resource, with share name NEW1_1, is available. The Explorer window, captured after the share was mapped and the dialog dismissed, indicates that the NEW1_1 file share has been successfully mapped by this client and can be accessed as drive letter L: by client applications.
Figure 12.40 Bringing the new resource online.
318
C H A P T E R T W E LV E
Figure 12.41 Client view of new resource.
Troubleshooting: The VCS Event Log VCS records all cluster events of significance in an event log viewable through the VCS console. This log can be used to trace the history of cluster events for problem analysis. Figure 12.42 shows the VCS log for this example, illustrating some of the cluster events that occur in the course of bringing the Volumes service group online on server VCSDMP-1. Both the initiation and the success of individual resource online events are recorded in the log.
Cluster Resource Functions: VCS Agents VCS resources have functional requirements similar to those of MSCS resources. In the VCS context, resource-specific functions such as starting, stopping, and monitoring are performed by software components called agents that are roughly equivalent to the resource DLLs used to manage
Figure 12.42 VCS event log.
Managing Volumes in Clusters
319
resources in MSCS. Agents for supported resource types, including most of those in this example, are supplied with the VCS software. Other resource agents, such as the Volume Manager agent, are added when the corresponding software components are installed. VERITAS Software Corporation also supplies documentation to enable developers to create agents for new types of VCS resources.
Volume Manager Summary The chapters so far have described and illustrated volume management in Windows 2000 server environments. An extensive set of examples has shown how to set up both failure-tolerant and nonfailure-tolerant volumes suitable for a variety of purposes. The examples are by no means exhaustive; they are limited by time and space and by the size and capabilities of the systems used to create them. The main points made by the examples are: ■■
It is possible for administrators to create and manage both simple and complex volume structures in the Windows 2000 environment. The Logical Disk Manager and Volume Manager make as many configuration decisions as possible and supply intelligent default values where administrator input is required.
■■
A complete range of volume types is available, allowing meaningful online storage configuration choices with regard to volume function, failure tolerance, and performance.
■■
Windows 2000 volume management is extremely dynamic. Volumes can be created on disks in use on a running system, extended while being used by applications, and populated while being initialized. Volume configuration changes in Windows 2000 do not require system reboots.
■■
The Windows 2000 Volume Manager supports both the Microsoft Cluster Service and the VERITAS Cluster Server, enabling the use of host-based failure-tolerant volumes as cluster resources. All volumes in a cluster disk group fail over as a unit, allowing cluster applications to use failuretolerant volumes for data storage. Failure-tolerant volumes can even be used as MSCS quorum devices, providing both robust failover and failuretolerant storage for any data located on the cluster quorum disk.
In conclusion, host-based managed volumes are a major step forward in the management of disk storage for both individual and clustered Windows 2000 servers.
CHAPTER
13
Data Replication: Managing Storage Over Distance
Data Replication Overview It is often desirable to maintain identical physical copies or replicas of a master set of data at one or more widely separated locations. Examples include: Publication. Distribution, or publication, of data from a central data center at which it is maintained to geographically distributed data centers where it is used is a common requirement. Perhaps the best example of publication is the multilocation Web sites maintained by global companies and organizations, but other examples, such as catalog distribution, also exist. From a storage management standpoint, data publication has three important characteristics: (1) data is published from one source location to many target locations, (2) published data is updated seldom, if at all, at the sites to which it is published, and (3) the published data represents a point in time, or frozen image of the master data. Consolidation. Similarly, many distributed data-processing strategies involve periodic consolidation of data accumulated at many geographically distributed sites (e.g., field offices) to a central data center (e.g., headquarters), where it is “rolled up” into enterprisewide management information. In this case, data produced at many source locations is replicated to a common target location. As with publication, consolidated data is seldom modified at the central data center and replicas of data images frozen at the source are usually desirable.
321
322
C HAPTE R TH I RTE E N
Off-host processing. Often, it is desirable to perform some operation on a frozen image of an application’s data without impacting the application’s execution. Perhaps the most frequently encountered example of this is backup. For more and more applications, business considerations are making it impractical to stop or even impede execution for any appreciable length of time to back data up or to mine or otherwise analyze it. Even when split mirrors (page 189) are used to freeze an image of application data, the processing requirements of backup or analysis on the live system can lead to unacceptable application performance. If data is replicated to another server, however, that server can perform backups or other types of processing on the replicated data. Since they run on a completely separate server, these operations do not impact application performance. Disaster protection. Perhaps the most prevalent use of data replication is to maintain copies of data that are vital to enterprise operation at locations widely separated from the main data center. These remote copies enable an enterprise to resume data processing operations relatively quickly after a disaster, such as a fire, flood, or power grid outage, that incapacitates an entire data center and makes the failure-protection mechanisms discussed earlier ineffective. Unlike publication and consolidation, disaster protection requires that replicated data be kept in close synchronization with live operational data on an ongoing basis. Remote replicas must be updated immediately each time data is updated at the enterprise’s main data center.
Alternative Technologies for Data Replication Data replication is not the only way to solve these problems. Publication, consolidation, and off-host processing needs can also be met by any of several techniques, including backup and restore and network file copying. In some instances, disaster protection can be provided by mirroring data over extended Fibre Channel links. Each of these techniques has drawbacks, however. ■■
Backup requires relatively expensive tape drives and removable media, as well as error-prone media manipulation. It also requires that applications be quiesced periodically so that consistent backup copies can be made1. Moreover, the time required to ship and restore backup copies is time during which applications are unavailable after a disaster.
■■
Network file copies must be administered. Even if administrative needs are minimized through the use of scripts, network file copies require that appli-
1
See page Chapter 4 for a description of one technique for creating consistent backups of application data.
Data Replication: Managing Storage Over Distance
323
cations be quiescent for relatively long periods while the copies are made. Again, this means relatively long periods when applications cannot be used. ■■
Mirroring is useful only over short distances. Remote mirrors are guaranteed to be up to date and there is no delay until the remote data can be used. However, when the distance between mirrored data copies is so great that transmission time is significant, the time required to copy data remotely results in unacceptably high application response times.
Moreover, mirroring and network file copying both require continuously operational network links in order to work reliably. Longer communication links are physically more complex than the short links found within a data center and are, therefore, inherently more susceptible to failure, particularly transient failures of short duration. Neither mirroring nor network file copying is designed to recover rapidly from transient network outages.
Data Replication Design Assumptions A set of related technologies under the collective name of data replication have evolved to meet the needs of data publication, consolidation, and disaster recovery. Data replication technologies take three important factors into account: ■■
The communication time (latency) between a system on which applications are processing data and distant systems on which copies of data are being maintained may be significant. It may be unacceptable to burden applications with waiting for remote writes to complete.
■■
The links connecting a system on which applications are processing data to remote systems on which those data is being replicated may not be totally reliable. Transient link outages should be transparent to application processing, and high-overhead recovery procedures should be required only when a link outage is of relatively long duration (tens of hours).
■■
A significant percentage of applications for data replication require one-tomany replication. In other words, data from a single source must be faithfully replicated to several destinations in real or almost real time. Application response time should be independent of the number of replicas.
These assumptions lead to data replication software designs that are considerably more elaborate than those for mirroring, which superficially provides the same result.
Server-Based and RAID Subsystem-Based Replication Data replication has been implemented in servers, in enterprise RAID subsystems, and, more recently, in storage area network (SAN) infrastructure com-
324
C HAPTE R TH I RTE E N
ponents. Figure 13.1 illustrates server-based replication. As the figure suggests, with server-based replication, application I/O requests are intercepted somewhere in the primary system’s I/O processing software stack before they reach the disk driver. Intercepted requests and their data are sent to secondary systems, where they are written to persistent storage. The advantages of server-based replication are: ■■
Replication is independent of the type of storage. The storage devices at source and target locations may be different. This allows, for example, less expensive storage devices to be employed at publication target locations or even at disaster recovery locations.
■■
Different types of data objects can be replicated. Because the entire I/O stack from application to disk driver runs in the server, replication can be done at any of several levels. Databases, files, and volume or disk contents can all be replicated, each with unique properties and semantics. Administrators can specify the best replication option for individual applications and systems.
■■
Replication can typically share enterprise network facilities. While in extreme instances dedicated network links may be required for performance, in most cases, server-based replication can share enterprise network facilities with other types of network traffic. This makes replication less costly to implement and less complex to manage.
In general with today’s server-based replication, primary and secondary systems must be of the same architecture and run the same operating system version, because they must run cooperating replication managers.2 Thus, in 2
Although some developers are working on replication of data among servers of unlike architecture.
Primary Location
Network Secondary Location Connection
Replication Manager
Replication Manager
Disk Driver
Disk Driver
Figure 13.1 Server-based replication.
Data Replication: Managing Storage Over Distance
Primary Location
Synchronization and Control
Secondary Location
Replication Control
Replication Control
Disk Driver
Disk Driver
Enterprise RAID Subsystem Replication Manager
Network Connection
325
Enterprise RAID Subsystem Replication Manager
Figure 13.2 Enterprise RAID subsystem-based replication.
summary, server-based replication can replicate data between homogeneous computers, but may use heterogeneous storage devices. Server-based replication may be either synchronous, with primary location applications blocked until data has been safely written at secondary locations, or asynchronous, to minimize the effect of replication on application performance. Server-based replication software typically supports dynamic switching between synchronous and asynchronous replication modes. Replication can also be implemented almost entirely within an intelligent enterprise RAID subsystem, as Figure 13.2 illustrates. Some enterprise RAID subsystems can be interconnected directly with peer subsystems of the same architecture, as shown here. When a system is configured in this way, data can be replicated over the subsystems’ private interconnection without host involvement or overhead. Some coordination among hosts is necessary to start and stop replication and to initiate failover to the recovery site, but in the course of normal operations, data written by applications at the primary location are transparently copied to secondary locations and replicated on equivalent volumes there. There are two principal advantages of enterprise RAID subsystem-based replication: ■■
It uses very little application server resource. While the dedicated communication facilities to connect primary and secondary subsystems can be costly, the impact on application execution that can result from server-based replication overhead can be an offsetting factor. Some RAID subsystem vendors have developed replication interfaces to enterprise TCP/IP networks, eliminating the expense of private network facilities for replication.
326
C HAPTE R TH I RTE E N
■■
It is independent of host computer architecture. A single enterprise RAID subsystem at each location can replicate data for different hosts of different types to one or more remote replication sites.
RAID subsystem-based replication typically requires RAID subsystems of the same type at both primary and secondary locations. Thus, subsystem-based replication can replicate data for heterogeneous computers, but requires homogeneous storage. RAID subsystems emulate disk drives to their host computers. They deal in disk block address spaces and have no information about file system or database context. RAID subsystem-based replication is, therefore, always at the volume level. Like server-based replication, it may be synchronous or asynchronous.
Elements of Data Replication A given data object is replicated from one source, or primary location, to one or more targets, or secondary locations. In general, replication occurs while data at the primary location is being used by applications (otherwise, it would amount to a network file or volume copy operation).
Initial Synchronization Before updates to operational data at a primary location can be replicated to secondary locations, the secondary locations’ storage contents must first be made identical to those of the primary site storage to be replicated.3 This is called initial synchronization of the replication source with its targets. With some replication technologies, initial synchronization is an integral part of the replication process; with others, specialized initial synchronization techniques, such as backing up data from the primary location and restoring them at the secondary location, are required.
Replication for Frozen Image Creation If data is replicated in order to establish a frozen image at a secondary location, then replication ceases at some point after data at the primary and secondary locations are synchronized. For example, daily sales data might be replicated to a backup server for the purpose of creating a backup at the end of each business day. Replication can begin with synchronization at any time throughout the business day. When primary and secondary locations’ data is synchronized, replication simply continues until the business day ends. At that 3
Unless the source is newly-initialized storage containing no valuable data. In this case, initial synchronization can be bypassed.
Data Replication: Managing Storage Over Distance
327
point, replication is stopped by administrative action (which can be automated in most cases), and backup of the replicated data at the secondary location can commence. Compared to a network file copy, this technique allows backup to commence much earlier, since little if any data remains to be copied to the secondary location after the close of business day. As with any frozen image technique, application or administrative action is required to ensure that replication is stopped at a point at which replicated data is consistent from an application standpoint. In general, applications and databases must be paused momentarily, and primary storage devices unmounted, to ensure that replicated data at secondary locations reflects the consistent state of primary location data. Once this has happened, replication can be stopped, and applications and database managers can resume operation with nonreplicated data. Since most data replicators support simultaneous replication to multiple secondary locations, it is possible to prepare data for multiple processing functions at the same time. In the preceding case, for example, sales data could be replicated to two secondary locations simultaneously. At the end of the business day, data from one secondary location could be transformed to a data warehouse format for mining while data from the other was backed up. This style of information management is especially applicable in Windows environments, where the low cost of servers makes it possible to design processing strategies that use several single-purpose servers in preference to one larger multipurpose one.
Continuous Replication Stopping a replication job at some point after data is synchronized is most useful when the goal of replication is to establish a frozen image of operational data. If, however, data is replicated to enable recovery from a primary location disaster, replication does not end when initial synchronization is achieved. Instead, it continues indefinitely until a disaster actually occurs, at which time the replicated data at a secondary site can, with suitable preparation, become the enterprise’s “real” online data. Replicated data at a secondary location must generally be checked for consistency before it is used, because disasters can leave data inconsistent from an application or data manager standpoint. For example, a disaster may incapacitate a sales application during the processing of several online customer orders from different client computers. Some delivery orders may be entered in the replicated database without corresponding customer account debits (or the reverse). If this condition is not remedied, products could be delivered to customers and the customers never billed; or, perhaps worse, customers
328
C HAPTE R TH I RTE E N
could be billed for products that were never scheduled for delivery. To avoid situations like this, replicated data must be made consistent before use. Making replicated data consistent for application use can take several forms: ■■
If the replicated data is a database, stored either in raw (i.e., with no file system) volumes or in container files within a file system, then the database manager’s restart procedure must be executed to validate or restore database integrity.4
■■
If the replicated data is a file system, then the file system must be checked (e.g., using CHKDSK) before it can be mounted for application use. Checking a file system verifies the file system’s structural integrity; that is, lost files and multiply allocated space are detected and repaired, but does not repair user data in the file system.
■■
If the replicated data consists of files within a file system, a primary location disaster may well leave the target file system at the secondary location intact, but the same cannot necessarily be said for data in the replicated files, from an application consistency point of view.
In all of these cases, it is usually necessary to run an application-specific procedure to validate the state of replicated data before applications are permitted to restart. With database management systems, this is easy and may even be unnecessary. Database managers typically log all update activity in a way that associates the database updates comprising a transaction with each other. As long as a database’s log is replicated, it can be read during a secondary location’s database restart procedure. Complete transactions can be applied to the database, and any effects of incomplete transactions can be backed out. File system restart procedures (e.g., CHKDSK) verify the consistency of file system metadata, but do not generally incorporate the transaction concept at the user data level, so it is incumbent upon applications to perform any necessary consistency checking for themselves.
What Gets Replicated? Restart procedures for replicated data depend upon the nature of the replicated data objects. As implied by the preceding section, technology exists for replicating: ■■
4
Volume contents, regardless of the file system or database structures built on them.
In UNIX environments, both raw volumes (volumes with no file system structure on them) and files are used as containers for database data. In Windows environments the use of files as containers for database data is universal.
Data Replication: Managing Storage Over Distance
329
■■
Files within a file system, including both data and metadata, such as ownership, access permissions, and time stamps, used to manage the files.
■■
Database contents, regardless of the file or volume containers in which the database resides.
Each of these forms of replication has different properties, because the replicated objects themselves have different properties, as the following sections describe.
Volume Replication The most fundamental form of replication is the replication of the contents of a volume or disk. The logical block model of a disk is described in Chapter 1. From a file system and database manager standpoint, a volume is a disk, consisting of a fixed-size set of sequentially numbered blocks of storage.5 The important semantic of a disk or volume is that data can be read from or written to any consecutively numbered sequence of blocks with a single operation. Disk drivers and volume managers have no information about the higherlevel meaning of disk or volume blocks that are read or written by file systems and database managers. For example, they cannot distinguish a user data write request from a file system metadata update. When a volume is replicated, the replication manager logically “taps in” to the system I/O stack just above the volume manager or disk driver, as Figure 13.3 illustrates. As 5
Actually, volumes can grow and shrink, but this is done with management operations that are typically widely spaced in time and can be ignored for purposes of this discussion.
Primary Location
Secondary Location
Application
Application Read/Write File (offset, byte count)
File System
File System Replication Manager
Read/Write Volume (block, block count)
Network Connection
(Source)
Volume Manager Read/Write Disk(s) (block, block count)
Disk Driver
Figure 13.3 Server-based volume-level replication.
Replication Manager (Target)
Volume Manager Read/Write Disk(s) (block, block count)
Disk Driver
330
C HAPTE R TH I RTE E N
shown, a source volume replication manager at the primary location intercepts each request to the volume manager (or disk driver, if no volume manager is present) and creates equivalent replication requests to be sent over a network link to target replication managers at one or more secondary locations. It then passes the original request through to the volume manager or disk driver. Logging techniques (discussed later in the chapter) are typically used to enable replication to occur asynchronously with application execution. Asynchronous replication makes application and data manager performance at the primary location nearly independent of the number of secondary replicas. Because a volume manager has no information about the meaning of requests made to it in file or database terms, it has no way to determine the state in which any given write request leaves a volume. For example, when a file system creates a new file, several metadata updates are necessary. A descriptor of the file must be inserted into a directory, the file’s characteristics (e.g., access rights, time stamps) must be recorded, and the file system’s free space pool must be updated. These operations must occur atomically (i.e., either they must all occur or none of them may occur); however, each consists of one or more block write operations to the volume. No information about the interrelationship of write operations that update file system metadata (or about their lack of interrelationship with other operations) is available to volume managers, hence none can be made available to a volume-level replication manager. Since a primary location’s volume-level replication manager has no knowledge of file system metadata state, it cannot pass any such knowledge on to secondary location replication managers or file systems. In effect, if a file system were mounted at a secondary location, its data and metadata would constantly be changing without its knowledge. File systems cannot tolerate this; therefore, it is not possible to use secondary location data while volume replication is occurring. Finally, file systems and database managers often hold both user data and metadata in cache. No information about this cache is available at the volume level, so there is no way to keep file system or database cache memories at primary and secondary locations consistent with each other.6 The result is that file system structures on a secondary replicated volume cannot be accessed by file system or application code at a secondary location while replication is occurring, because knowledge of metadata consistency is not available at secondary locations. 6
Keeping the contents of multiple cache memories consistent with each other is called maintaining cache coherency.
Data Replication: Managing Storage Over Distance
331
In spite of the fact that volume-level replication precludes the use of replicated data at secondary sites during replication, volume replication is often favored by system administrators for two reasons: Low overhead. Because it occurs at a low level in the I/O stack, volume replication creates relatively little overhead in the primary system. Since relatively little context is required between source and targets, overhead message traffic between primary and secondary locations is low.7 Data manager independence. Replicating a volume replicates all of the data in it, no matter which file system or database manager stores the data. This property is particularly important for database applications, which may store control, log, and other ancillary information in files that might not be in obvious locations. Since volume replication copies all block changes at the source to the target, it captures all modifications to all data structures, whether the modifications are to databases, files, file system metadata, or other system data structures. For these reasons, volume replication is generally preferred to other forms of replication for disaster recovery. Because all volume contents are replicated to the recovery site, administrators need not remember to add newly critical files or directories to replication job lists when the objects are created. The disadvantage of volume replication, as noted, is that replicated data at secondary sites is typically not usable during replication. In order to use replicated data at a secondary site, replication to that site must cease, the secondary replicated volumes must be mounted, and any necessary data recovery actions, such as database journal playback or CHKDSK, must be performed. These recovery mechanisms are designed for recovering from failures within a single system. As long as a volume replication manager preserves write ordering (discussed later in this chapter), however, every state of a replicated volume at a secondary location is identical to a state of the primary volume at some previous time. Thus, if a recovery mechanism designed for recovering from local failures is able to recover from any given primary volume state, the same mechanism will be able to recover data at a secondary replication location to the same degree if a disaster occurs when secondary volumes are in that state. The key to this recoverability is the preservation of write ordering. Some replication techniques simply track the data addresses of primary location updates using a bit map or similar structure. Such a structure contains information 7
Of course, data traffic between primary and secondary locations is proportional to the amount of data updated at the primary location.
332
C HAPTE R TH I RTE E N
about which data has been modified, but no information about the order in which it was modified. If a disaster occurs while a replication manager is processing updates based on this type of structure, secondary location volumes may reflect some newer updates, while some older ones may not yet have been transmitted from the primary. Such volumes cannot be recovered to a consistent state because recovery tools are based on assumptions about the order in which file systems and database managers perform their updates. Volume-level replication is the only form of replication that is possible when the replication engine runs in an enterprise RAID subsystem. Figure 13.4 illustrates volume replication between a pair of enterprise RAID subsystems. As the figure suggests, RAID subsystem-based volume replication is conducted between two RAID subsystems of identical architectures. Write requests to replicated virtual disks at the primary location are transmitted to secondary locations and applied to the replicated volumes there. This form of replication is completely transparent to applications. Starting and stopping replication, as well as changing replication parameters such as the degree to which I/O is throttled, is typically done using server-based administrative tools that issue administrative commands to the RAID subsystems, either in-band (using the I/O path) or out-of-band (using an auxiliary communications mechanism such as an Ethernet or serial port into the RAID subsystem). Like server-based volume managers, RAID subsystems have no access to contextual information about I/O requests that would allow them to identify file system and database metadata updates. Like server-based volume managers, therefore, they cannot communicate the state of the data objects in a repli-
Primary Location
Secondary Location Application
Application
File System Read/Write Volume (block, block count)
Replication Administration
Enterprise Network
Replication
Volume Manager
Volume Manager Read/Write Disk(s) (block, block count)
Read/Write Disk(s) (block, block count)
Disk Driver
Disk Driver
RAID Firmware
File System
Administration
RAID Subsystem
RAID Subsystem Replication Manager
Private Network
Figure 13.4 RAID subsystem-based volume replication.
Replication Manager
RAID Firmware
Data Replication: Managing Storage Over Distance
333
cated volume to secondary locations, and so it is generally not possible to use replicated data while replication is occurring, except in very limited circumstances.
File Replication It is also possible to replicate data at the file level, as Figure 13.5 illustrates. With file replication, the replication manager intercepts I/O requests before they are processed by the source system’s file system. Because file system structure is visible at this level, replication can be selective-replication “jobs” can be configured to replicate only necessary files. This can minimize bandwidth consumption and storage capacity requirements at secondary locations. If a file replication manager allows multiple jobs to be active concurrently, then different files can be replicated to different secondary locations at the same time. As with volume replication, file replication is typically implemented using a log at the primary location in which writes to replicated objects are recorded. As network bandwidth becomes available, the primary location’s replication manager sends file data from this log to replication managers at secondary locations, from whence it is written to storage devices. A big advantage of file-based replication is selectivity. File objects (the objects upon which applications operate) can be specified individually or in groups for replication. This is advantageous for applications whose data is always stored in a fixed set of directories. As pointed out previously, however, selective replication can be more complex to administer for applications that use
Secondary Location
Primary Location Application Read/Write File (offset, byte count)
Replication Manager (Source)
Network Connection
Replication Manager
Application Read/Write File (offset, byte count)
(Target)
File System
File System Read/Write Volume (block, block count)
Read/Write Volume (block, block count)
Volume Manager
Volume Manager Read/Write Disk(s) (block, block count)
Disk Driver
Figure 13.5 File-level replication.
Read/Write Disk(s) (block, block count)
Disk Driver
334
C HAPTE R TH I RTE E N
multiple data managers (e.g., both files and databases) and for applications that store data in unpredictable directory locations. File-based replication requires that every file or directory tree to be replicated be explicitly named when the replication job is defined and whenever the list of critical data is updated. For this reason, file-based replication is usually better suited to Web page publication and similar applications, whereas volume-based replication is more often used for disaster recoverability. Another advantage of replicating at the file level is that it permits data at secondary locations to be read during replication under some circumstances. As Figure 13.5 suggests, replicated data is written through secondary location file systems. Replication writes are therefore coordinated with writes from applications running on secondary systems. File system access permissions, locking, and simultaneous access capabilities are all in effect. From the point of view of the secondary location file system, the replication manager is just another application. Since file systems are arbiters for all access to file data, including accesses made by target-side replication managers, replicated files at secondary locations can be used by applications as long as they are not being used in conflicting ways by the target-side replication manager. This can be particularly useful, for example, if replication is used to consolidate field office data at corporate headquarters for aggregation and analysis. As soon as any individual file has been completely replicated to the headquarters system, it can be read and processed. Thus, applications that aggregate and analyze field office data can make progress while data is still being transmitted to the headquarters location. File-based replication is also flexible in that source and target storage locations need not be identical. For example, primary location files can be replicated to a secondary location directory with a different name. This feature is particularly important when replication is used to consolidate data from several field locations to a central location. Each of the primary (field) locations may be replicating identically named data files to the same central file system. The central file system must use a different name for data replicated from each field location if the data is stored within the same file system. This can be accomplished by defining replication jobs in such a way that each field location’s data is replicated to a different directory in the central file system.
Database Replication Some database management systems support the replication of data between widely separated databases. Database replication replicates updates to databases, either row by row or, more frequently, in the context of database transactions. Figure 13.6 depicts database replication.
Data Replication: Managing Storage Over Distance
Secondary Location
Primary Location
Application
Application
Database Manager
Database Manager Read/Write File (offset, byte count)
335
Replication Manager
Network Connection
Replication Manager
Read/Write File (offset, byte count)
File System
File System
Volume Manager
Volume Manager
Disk Driver
Disk Driver
Figure 13.6 Database replication.
Database replication takes several forms, all of which share a common property suggested by Figure 13.6—replication is controlled by the database manager. This is significant because database managers have much more contextual information about user data updates than do volume managers or even file systems. In particular, database managers have information about transactional relationships among data updates. For example, a transaction that transfers money from one bank account to another typically includes two key data updates—a write of the debited account’s balance reflecting the debit and a write of the credited account’s balance reflecting the credit. The transaction semantics on which database applications are based are predicated on the premise that if both parts of the transaction cannot be reflected in the database (e.g., because of a failure of some kind), then it is better to reflect neither. The two key writes in this transaction may be to widely separated block addresses, to different files in the same or different file systems or even to different volumes. Neither a file system nor a volume manager has any way of relating the writes to each other. A database manager can, however, relate corresponding debit and credit updates because they occur in the context of a transaction defined by the application. Replication at the database level can thus result in only complete transactions being replicated. This tends to make database replication a good candidate for distributed applications in which data that is primarily updated at one location must be read at another. Database replication software is typically capable of delaying the application of updates at secondary locations. This makes it useful for protecting against data corruption due to application or data entry errors. As long as an error is discovered before updates are applied at the secondary location, the repli-
336
C HAPTE R TH I RTE E N
cated database at the secondary location can be copied over the corrupted one at the primary location. Another advantage of some database replication techniques is the flexibility with which database operations can be expressed. A replicating database manager can bundle the database updates that result from each complete transaction and send the bundles to secondary replication sites, or it can send a description of the transaction itself and allow the secondary locations to execute it against database replicas. A trivial example of this is the “give each employee a 5 percent raise” transaction. This transaction can be expressed in a single SQL procedure of a few hundred bytes. The result, however, is that every employee record in the database is read and rewritten. For large databases, this can mean thousands of writes and millions of bytes transferred.
NOTESeveral different database replication techniques are implemented in commercial database managers. Which is more advantageous depends upon the nature of the application and replication requirements. Database replication is beyond the scope of this book; it is mentioned here as an option that should be considered by application designers and administrators as part of an overall disaster recovery strategy.
While database replication techniques are useful, most database replication managers have functional limitations (for example, some do not support the replication of structural changes to the database) that make it difficult to use them for disaster recovery. A disaster recovery strategy should assume that no retrieval of data or state information from the primary location is possible. All information needed to restore data and resume processing must be at the recovery location (secondary replication location). From a data standpoint, the safest solution is replication of all volumes belonging to an application, since all of the application’s data is guaranteed to be stored them. The only administrative effort required to ensure that all application data is replicated is the exercise of the discipline to use only volumes in the replicated volume group for application data, program images, control files, logs, and so forth. Whether data is part of a database or stored in files, volume replication will transparently copy all of it to the recovery location.
How Replication Works The replication of data over long distances is different from mirroring in two important ways: Performance. Mirroring designs are predicated on an assumption that all mirrors can be read and written with roughly equal performance. Replication
Data Replication: Managing Storage Over Distance
337
designs, on the other hand, assume that writes to secondary locations can take significantly longer than the corresponding writes at the primary location and, therefore, include features to minimize impact on application performance. This is particularly important when primary and secondary replication locations are separated by a wide area network (WAN). Not only do WANs support large enough distances between primary and secondary locations that propagation delays can become significant, WANs also typically include routers and other devices that store and forward data, introducing further latency in remote writes. Connection reliability. Likewise, mirroring designs are based on the assumption that no individual connection to a mirror is any more or less reliable than another. Any I/O path (link) failure in a mirrored volume is typically treated as though it were a device failure. Replication designs, on the other hand, typically assume that the links connecting secondary locations to the primary one will experience occasional failures during the normal course of operations. They treat brief link outages as “normal,” and include strategies for making them transparent to applications at the primary location. In addition, they typically include fallback strategies to minimize the impact of resynchronization of data at primary and secondary locations after lengthy link outages. These two differences in design assumptions lead to significant differences in the roles of primary and secondary locations in replication: Whereas the mirrors of a mirrored volume are all treated equally, replicated data is treated very differently at the primary and secondary locations. Figure 13.7 presents an overview of the factors affecting I/O performance during data replication. When an application makes a write request, the request must be processed and data must be written. At the lowest level, a disk write must ultimately occur,8 which means that an application’s execution is delayed for a few milliseconds each time it makes a write request. Application designers are aware of this, and designs are predicated on it. Thus, an application interaction with a user might require 100 I/O requests and still deliver response time of a second or two, whereas an interaction that required thousands of I/O requests would have to be designed so the resulting response of tens of seconds or more would be perceived as reasonable by the user. The time required to complete a local I/O request is indicated as T1 in Figure 13.7. When data is replicated remotely, each I/O request must be intercepted by the replication manager and analyzed to see if it has implications on replication. (Read requests, for example, do not, nor do write requests to nonreplicated data.) This adds a small amount of processing time to all I/O requests, which is 8
This sample analysis ignores the effect of nonvolatile write-back cache found in some enterprise RAID subsystems.
338
C HAPTE R TH I RTE E N
T 2:
Primary Location
Secondary Location
Transmit replicated data to secondary location.
Application Replication Manager File System
T1: Process I/O request and write data at primary location.
Network Connection
T4 4:: Transmit acknowledgment to primary location.
Application Replication Manager File System
T 3: Process messages and write data at secondary location.
Volume Manager
Volume Manager
Disk Driver
Disk Driver
Figure 13.7 The timing of replication.
usually a negligible contributor to application response. Write requests and data must be transmitted to secondary locations (T2 in Figure 13.7) and processed and written to persistent media (T3 in Figure 13.7). In order for the replication manager at the primary location to know when data has been safely stored at secondary sites, so that it can signal completion of the I/O request and unblock the application to continue execution, the secondary location must send an acknowledgment indicating that it has written data successfully back to the primary one. This is indicated as T4 in Figure 13.7. Breaking down a replicated I/O operation in this way clarifies the difference between long-distance replication and short-distance mirroring. Operation T1 may overlap with T2, T3, and T4; however, operations T2, T3, and T4 are necessarily sequential—no one of them can begin until the previous one has ended. Figure 13.8 presents a time line that illustrates this: I/O at the primary location can overlap in time with transmission of the data and enough information to allow it to be written at secondary locations. Writing data to persistent storage at secondary locations, however, cannot occur until data is actually there. In other words, writing data at secondary locations cannot begin until the transmission is complete. Similarly, a secondary location’s acknowledgment that it has written data must follow the writing of that data; otherwise it is meaningless. The net effect is that effective execution time for each write request is elongated by more than 100 percent (assuming approximately equal local I/O performance at primary and secondary locations). For applications that make
Data Replication: Managing Storage Over Distance
ack to primary
T4 write at secondary
T3 T2
339
send to secondary primary write
T1 Time Application I/O request issued
I/O request completion signal
Figure 13.8 Replication I/O time line.
large numbers of write requests, this can significantly increase user response time; usually an unacceptable outcome. For some applications of replication, such as database logs, absolute synchronization of primary and secondary location data is necessary. For these, replication must be synchronous, using a time line similar to that illustrated in Figure 13.8. For most data, however, a more aggressive approach can be adopted to make application response times less dependent on replication. These approaches basically make some or all of T2, T3, and T4 asynchronous with each other.
Asynchronous Replication Two basic techniques are used to make replication asynchronous: ■■
Acknowledge receipt of data at secondary location without waiting for it to be written to disk media.
■■
Allow application execution to continue without waiting even for data to be sent to secondary locations.
Figure 13.9 presents a data replication I/O time line in which the first of these techniques is used. In this case, each secondary location sends an acknowledgment to the primary location as soon as it receives the data corresponding to an application write (without waiting for the data to be written to disk). In effect, this technique changes the meaning of the acknowledgment from “data received and stored” to “data received.” This technique also changes the semantics of replication slightly. A recoverable failure of a secondary location and an almost simultaneous unrecoverable
340
C HAPTE R TH I RTE E N
T4
ack to primary
T3
write at secondary
T2
send to secondary local I/O
T1 Time Application I/O request issued.
I/O request completion can be signaled.
Figure 13.9 Replication I/O time line for asynchronous writes.
disaster at the primary location can result in the loss of replicated data that has been acknowledged by the secondary location but that is still buffered there (not yet written to persistent storage). The chances of simultaneous primary location disaster and secondary location system failure are so remote, however, that developers and users are often willing to accept the risk in return for the response time improvement that results from performing secondary location disk writes asynchronously. By acknowledging data immediately, a secondary location effectively eliminates its I/O time, which is necessarily comparable to T1, from overall application response time. The second technique for improving replication performance is to decouple all secondary location activity from application response time, as Figure 13.10 illustrates. Here, another step has been added to the data replication process: data is buffered locally at the primary location. Instead of transmitting replicated data updates to secondary locations and waiting for acknowledgment before allowing applications to continue, the replication manager buffers updates locally and transmits them as soon as local processing resources and network bandwidth permit. When this strategy is adopted, each I/O request’s contribution to application response time is reduced to the longer of T1 and T1', essentially the same contribution as with unreplicated data. Of course, adopting this strategy introduces a risk: if the primary system fails while acknowledged data is buffered locally for replication, that data can be lost. To mitigate this risk, unsent updates are usually logged, or buffered, on persistent storage at the primary location. The primary location replication
Data Replication: Managing Storage Over Distance
341
ack to primary
T4
write at secondary
T3 send to secondary
T2 local buffering
T1' local I/O
T1 Time Application I/O request issued.
I/O request completion can be signaled.
Figure 13.10 Replication I/O time line for asynchronous transmission.
manager adds an entry to the end of its log for each application write request and removes entries from the front of the log when the writes they represent have been sent to all secondary locations. Because the log is persistent (survives local system failures), replication can resume from where it was interrupted after the primary system recovers from its failure. When updates to replicated data are logged, there is still some risk that the primary location will suffer an unrecoverable disaster with buffered data that has not been sent to secondary locations. Administrators must take this eventuality into account: ■■
When deciding which replication technology for use in any given situation.
■■
When defining disaster recovery procedures for secondary locations.
Asynchronous replication also makes momentary network overloads, as well as momentary processing overloads at both primary and secondary locations, transparent to applications. With asynchronous replication, applications continue execution immediately after their writes are buffered in the replicated data log, whether or not there is a resource overload. If bandwidth or processing time to build messages is not immediately available, the log may build up, but there is no impact to application execution. Asynchronous replication is a virtual necessity when very long links with associated transmission delays are in use.
342
C HAPTE R TH I RTE E N
Replication and Link Outages This persistent logging technique also enables replication processes to recover from brief outages of the links that connect a primary replication location to secondary locations, as well as from secondary location outages. When the communications link connecting the primary replication location to a secondary one fails, the primary location’s replication manager continues to record application write requests in its log. The amount of data in the log grows, because data cannot be sent to secondary locations and removed from it. As long as the link or secondary system resumes service before the log overflows, replication can “catch up” transparently to applications, with the primary replication manager sending log entries to secondaries with which communication had been interrupted. There is risk of lost data if an unrecoverable disaster occurs at the primary location while a link is inoperative—all replicated data updates that are in the primary log but that have not been transmitted to a disaster recovery secondary location are lost. The only alternative choice, however, is to stop application processing whenever the secondary disaster recovery location cannot be reached. In fact, some replication managers support this feature as an option. Replicated data logs are typically designed to streamline I/O so that replication has minimal impact on application performance. In many cases, this means that the logs consist of a single range of consecutively numbered disk or volume blocks, and cannot be expanded. It is therefore possible for a replicated data log to fill if a network or secondary location outage persists for too long. Replication managers use two basic techniques to deal with this: I/O throttling and application blocking. Some replication managers are able to inject short delays into application response when their logs approach an administrator-defined “high water mark.” This has the effect of making applications run more slowly, and buys more time in which to repair the network or secondary location outage. When a replicated data log fills completely, there is typically a predefined administrative choice between blocking further application execution, abandoning replication altogether, and dropping into a fallback mode, as described in the next paragraph. Fallback mode. A replicated data log can grow without bound because there is no bound on the number of writes that applications can request; nor is there any bound on the amount of data they can write. If, instead of keeping track of each individual write, however, a replication manager were simply to keep track of which blocks in a replicated volume group had been written at the primary location, the amount of space required for tracking would be both bounded and small. For example, each megabyte “chunk” of storage in a volume could be represented by a bit. A replication manager could set the bit when any data within that megabyte was modified. When
Data Replication: Managing Storage Over Distance
343
access to secondary locations had been restored, only data within modified “chunks” would need to be transmitted and written to restore synchronization between primary and secondary volumes. Some replication managers do, in fact, implement this technique to provide the ability to survive network outages of indefinite duration without impeding primary location application execution. This fallback mode of operation is not the same as asynchronous logging because the order in which application writes occur is lost. After communications between primary and secondary locations are restored, all modified data must be copied atomically, because there is no information about which data was updated in which order. The amount of data modified, however, is typically much less than the full capacity of all replicated volumes, so recovery time is considerably shorter than it would be if the entire primary location volume contents had to be recopied. Once data is resynchronized, replication can resume as normal.
Replication Software Architecture Figure 13.11 diagrams the functional architecture of a primary location replication manager. Reduced to its essence, a primary location replication manager consists of an I/O request filter, which determines which application I/O requests are of interest to replication, and a log manager, which manages interactions with secondary locations.
Secondary Location
Primary Location
Application
Application Network Connection
Replication Manager
Replication Manager File System
File System
Volume Manager
Replicated Data
from application
Volume Manager
Disk Driver
Disk Driver Replication Manager Request Filter Log Manager log to local I/O stack
Figure 13.11 Timing replication.
acknowledgments
344
C HAPTE R TH I RTE E N
The role of the request filter shown in the figure is to intercept all I/O requests from applications, determine which of them impact replication, construct and write log entries for those that do, and pass the original I/O requests on to the local I/O stack. The application I/O requests of interest to the replication manager are any that modify data objects that are being replicated. For volumelevel replication, this means writes to replicated volumes. For file-level replication, user data writes, file create, delete, open and close requests, directory modifications, and all operations that modify file metadata (e.g., change access permissions) all must be replicated, because all of these operations change the state of replicated data objects at secondary locations. Replication log entries contain the nature of the operation and the data to be replicated (for write requests). The primary location replication manager must keep track of which secondary locations have received and acknowledged each log entry. Only when data has been transmitted to all secondary locations can an entry be removed from the primary location’s replicated data log.
Replicated Data Write Ordering In spite of the asynchronous nature of most data replication, it is essential for data integrity that the order of secondary location writes be the same as the order of primary location writes. An example may help make this clear. Suppose a demand deposit application processes three items (e.g., checks deposited to the same account) in succession. Each check should add an amount to an account balance. To do this, the application reads the record containing the current balance, adds an amount to a field and rewrites the record. From a file system or volume manager point of view, each addition to the balance is seen as a read and a write of data at the same address. The data for each write contains the updated current balance, but no information about the prior balance or about the amount that was added. Each record written by the application has a larger balance than the previous one. If a replication manager writes the records out of order, however, a lesser balance from an earlier update will overwrite a greater balance from a later update, with no indication of error. The effect of an update has been obliterated, even though all writes have been faithfully replicated. Executing write requests at secondary locations in the same order in which they were executed at the primary location solves this problem. A slightly different solution called causality, shown in Figure 13.12, allows for more parallelism. The principle behind causality is that if applications allow two write requests to be processed simultaneously by the local I/O system, there is no guarantee of the order in which those requests will complete. The applications
Data Replication: Managing Storage Over Distance
Write C
345
A and C may be written simultaneously. B and C may be written simultaneously.
Write B overlap
Write A C must follow A.
Time
Figure 13.12 Replication I/O time line illustrating causality.
must therefore be designed so that they perform correctly regardless of the order in which simultaneously issued write requests are executed. If applications are indifferent to the order in which a set of writes are executed, then a replication manager may be indifferent as well. A replication manager that implements causality in write ordering can use the same parallelism in executing secondary writes that was used at the primary location when applications initially wrote the replicated data.
Initial Synchronization of Replicated Data One of the most significant problems in replication technology is simply getting started. In most real-life applications of the technology, large amounts of data is replicated—tens or even hundreds of gigabytes. The update rate at the primary site may be relatively low—a few tens of kilobytes per second or even less. Making secondary location storage contents identical to primary location contents to start with, however, requires copying the entire set of data to be replicated from the primary location to all secondary locations. With volume-level replication, no assumptions are made about the contents of replicated data blocks. For primary and secondary locations to be in synchronization, therefore, all block contents of all replicated volumes at both locations must be identical, even if some blocks represent unallocated space.9 Some volume replication managers start by bulk-copying data across the network that will be used for replication. Others permit an administrator to sim9
An exception to this is the case in which both primary and secondary volumes are about to be initialized with file systems. In this case, there is no user data or file system metadata on either primary or secondary volumes, so the contents of both are irrelevant. Some replication software packages allow initialization to be bypassed for cases such as this.
346
C HAPTE R TH I RTE E N
ply declare that volume contents at primary and secondary locations are identical. This strategy allows an administrator to make image copies of the volumes to be replicated (on tape, for example) and restore those copies to volumes of identical capacity at each secondary location. Image copies are crucial in this case, because with volume replication, the contents of primary and secondary blocks with the same block numbers must be identical. A fileoriented backup and restore would not guarantee this property. When the volume contents at all locations are identical, replication is started by administrative action. Figure 13.13 illustrates initial synchronization for replicated volumes. Both of these initial synchronization methods share a significant drawback: they require that data be frozen (i.e., unmodified by applications) for the entire time during which initial synchronization is occurring. This is unacceptable for many applications, particularly in cases where replication is used to create periodic frozen images of continuously active operational data. To alleviate this limitation, some replication managers allow replication to start in response to administrator commands issued at an arbitrary time, such as when a primary volume image copy has been completed, with links to secondary locations administratively stopped (as if they had failed). This simulates link outage and results in the replication log filling as applications write data at the primary location. When the initial data has been restored at a secondary location, the link connecting it to the primary is administratively started and the updates in the log are transmitted to the secondary. This minimizes the time during which primary location applications must be down for replication initialization.
Primary Location
4: Start application.
1:
2: Restore image to volumes at secondary location.
Image copy of primary volume(s) to be replicated.
Secondary Location
Restore Application
Application Backup
3: File System
Start replication with identical data at both locations.
Replication Manager
File System Network Connection
Replication Manager
Volume Manager
Volume Manager
Disk Driver
Disk Driver
Figure 13.13 Initial synchronization for volume replication.
Data Replication: Managing Storage Over Distance
347
Initial Synchronization of Replicated Files The availability of information about individual data objects results in an important difference between volume replication and file replication with respect to how data is initially synchronized. With file replication, only files need to be copied to start replication. There is no need to make the unallocated space contents or metadata at primary and secondary locations identical. If a file system is sparsely populated, the amount of data that must be copied to initiate replication for the first time is minimal. Moreover, target directories at secondary locations and the data in them can be read by applications while files are being copied into them. On the other hand, a file system with most of its space allocated to a large number of files can take longer to synchronize than a volume of equivalent size. Because initializing a volume for replication requires copying all of the volume’s blocks, large sequential I/O requests can be used to maximize physical I/O efficiency. File replication, however, uses file read and write operations. These generally utilize physical I/O resources less efficiently than sequential block operations.
Resynchronization Brief secondary location outages, or outages of a primary-secondary link, are usually not fatal to replication. As long as the primary site log does not overflow, replication resumes as soon as both the secondary location and the link connecting it to the primary are again operational. When an outage lasts long enough to fill the primary replication log, however, and applications at the primary location are allowed to continue, secondary location replicas become unreconstructable, and replication must be restarted after the outage is repaired. When this occurs, primary and secondary location contents must be resynchronized. Resynchronization is required whenever a link or secondary system outage persists for long enough to exhaust the primary location’s log space.10 Resynchronization is also required if replication is stopped for administrative reasons, such as creating a backup using a data image frozen at a secondary location. At its simplest level, volume resynchronization can be regarded as equivalent to initial synchronization, and the same techniques can be used. Some volume replication managers go a step further than this, keeping track of changed block regions in primary location volumes when a replicated data log overflows (for example, using a bit map with a bit for each region). When the outage is repaired, only changed block regions must be copied in order to 10
Unless application execution is blocked so that no more I/O requests are made to the replicated data.
348
C HAPTE R TH I RTE E N
resynchronize. Even for lengthy outages, this typically results in significantly reduced resynchronization times, because in a large data set it is typical that much of the data changes infrequently or never. This technique permits replication to resume with optimal efficiency after outages of arbitrary duration; but it is important to note that replication with write order preservation semantics cannot resume until all modified block regions have been copied to all secondary locations. This is because a map that contains only information about which block regions have been modified gives no information about the sequence in which modifications occurred. Thus, all modified block region contents must be copied atomically, and only then can write-ordered replication resume. When file replication restarts after a stoppage, a replication manager can take advantage of contextual information to shorten the time required to resynchronize. For example, if all of a file’s metadata is identical at both primary and secondary locations, it is reasonable to conclude that the file’s data is also identical and that the file need not be copied to achieve resynchronization. Some file replication managers resynchronize by computing checksums on files that are apparently identical at primary and secondary locations. These checksums are designed so that it is essentially impossible for two files whose contents are different to produce the same checksum value. Before copying a file to a secondary location, its checksum is compared with that of any apparently identical file at the secondary location, and if the two checksums match, no copy is made. In general, file system replication managers are in a better position to avoid needless copying during resynchronization. Large files whose checksums do not match can be broken down into regions and each individual region checksummed. Only regions whose checksums do not match with their secondary location counterparts are copied. These techniques further minimize the amount of data that must be copied to resynchronize primary and secondary location file data after a replication stoppage.
Using Replication This section describes some techniques for using data replication technology alone and in combination with other techniques to solve some common information processing and management problems.
Bidirectional Replication The examples thus far have shown locations that play only one replication role—either primary or secondary. In principle, it is possible for a location to
Data Replication: Managing Storage Over Distance
349
play both roles for different applications’ data, creating a degree of load balancing as well as protection against disaster in either of two locations. Figure 13.14 illustrates how replication can be used to protect two locations against disaster at either of them. Here, Location L is the primary location for Application A. Application A processes data on disks or volumes at Location L. The source replication manager running at Location L intercepts Application A’s I/O requests and directs the appropriate ones to its target counterpart at Location M, where they are replicated. Either file or volume replication can be used in this scenario. Similarly, Application B normally runs at Location M, processing data stored on disks at that location. Application B’s data is replicated at Location L, which serves as a secondary location for it. If a disaster incapacitates either of the locations, the surviving location can run its application, because it has an up-to-date copy of the disaster location’s data. Of course, servers at both locations must be configured with sufficient resources to run both applications at acceptable levels of performance. In most cases, this is a more effective use of resources than a disaster recovery scenario in which the recovery server (secondary location) is idle except when a disaster actually does occur.
Location L
Location M
Application A
Application A traffic
Replication Manager
Disk Driver
Application A data
Application B
Replication Manager Application B traffic
Application B replica
Figure 13.14 Bidirectional replication.
Disk Driver
Application A replica
Application B data
350
C HAPTE R TH I RTE E N
Using Frozen Images with Replication When replication is used to create frozen images of data for backup, mining, or other analysis at remote locations, two basic techniques can be used to create the frozen data images from which the remote backups or analyses proceed: Option 1. The application’s “live” data can be replicated. At any convenient time after initial synchronization has been achieved, the application can pause (so that all of its data is internally consistent and cache contents are flushed to persistent storage) and replication can be stopped. The replicated image at the secondary location is a consistent frozen image of application data that can be used immediately for backup or analysis. Option 2. Local frozen image techniques, such as Volume Manager split mirrors ( page 189) can be used to freeze an image at the primary location. Whatever technique is used to create the local frozen image, the application must pause at some convenient time so that its data is in a known consistent state for an instant. Once the frozen image has been established, the application resumes processing and the frozen image can be replicated to a secondary location for backup or analysis. In this case, the replica at the secondary location cannot be used until initial synchronization has been achieved. Figure 13.15 illustrates both of these techniques. Option 1 is particularly applicable with file replication techniques, since initial synchronization can be accomplished while primary location data (and indeed, even secondary location data) are in use by applications. The critical time point with this tech-
Operating Location Application
Option 1: File data is replicated as it is updated by application. Application pauses so that replication can be stopped to freeze data image at secondary location.
Backup or Analysis
File System
File System Replication Manager
Network Connection Volume Manager
Volume Manager
Disk Driver
Backup or Analysis Location
Option 2: Application pauses for creation of local frozen image. Frozen image is replicated. Replication is stopped when initial synchronization has been achieved.
Figure 13.15 Using replication to make a remote copy of a frozen image.
Disk Driver
Data Replication: Managing Storage Over Distance
351
nique is the replication stopping point rather than its starting point. Once replication is initiated, it makes no sense to pause the application to establish a consistent data state until initial synchronization between primary and secondary locations has been achieved. Thus, replication must be initiated early enough that initial synchronization can be achieved prior to the target time for quiescing the application so that the data image can be frozen (by stopping replication). With option 2, there is better control over the point at which the application is paused to establish the frozen image because the replica is made after the frozen image is established. Frozen image technologies, such as Volume Manager split mirrors and the copy on write snapshots available in some UNIX environments, typically establish frozen images in a few seconds at most.11 Once a frozen image is established, applications can be restarted. The frozen image in this case is literally frozen. It represents application data at the pause point. The replica to a remote secondary is generally not time-critical; initial synchronization can be “throttled,” or constrained, to use only limited resources to minimize impact on the application. Whenever the replica is complete, replication is stopped and backup or analysis tools at the secondary location can be used to manipulate the replica.
Volume Replication for Windows Servers: An Example VERITAS Software Corporation offers a volume replication manager called the VERITAS Volume Replicator (VVR) for the Windows 2000 operating system. VVR replicates data stored on sets of volumes called Replication Volume Groups (RVGs) between: ■■
A single source and a single target
■■
A single source and multiple targets
■■
Multiple sources, each with its own unique data to replicate and a single target
VVR preserves replication write ordering across an entire RVG. Hence it is possible for applications and database managers that store their data on multiple volumes to replicate their data using VVR without fear of data corruption due to out-of-order writes, even if the writes are to different volumes. VVR replicates data either synchronously or asynchronously. When asynchronous replication is used, VVR utilizes a Storage Replication Log at the primary 11
Although split mirrors require that all mirrors be synchronized before the breakaway occurs.
352
C HAPTE R TH I RTE E N
location to minimize the impact of replication on application performance. This section illustrates a simple case of volume replication in the Windows 2000 environment, using VVR as an example.
Managing Volume Replication VVR is highly integrated with the companion VERITAS Volume Manager for Windows 2000. The two are managed using the same Microsoft Management Console snap-in. When VVR is installed, additional information and commands are available through the Volume Manager console window. A view of the Volume Manager console for a computer named kilby is shown in Figure 13.16. In addition to the Disk Groups, Disks, and Volumes object groupings of the Volume Manager, the VVR installation displays a Replication Network object grouping, which contains Replication Data Set (RDS) objects. A replication data set is an administrative grouping of replication objects consisting of: ■■
A primary RVG.
■■
All the secondary RVGs to which the primary’s data is replicated.
Figure 13.16 Volume Manager console with VVR installed.
Data Replication: Managing Storage Over Distance
■■
353
The network links connecting the primary system to each secondary system.
In the console view shown in Figure 13.16, the Replication Network contains one RDS called ORCLRDS. The RDS contains a replicated volume group on server kilby, and another on server noyce.veritas.com. Each of the replicated volume groups contains a single data volume (ORCL-VOL-A) and a system replication log volume (ORCL-VOL-LOG). The local data volume (the one on server kilby) is presented as drive letter E:. The log volume does not have a drive letter, as it is only accessed by VVR.
Creating a Replicated Data Set As with most Windows 2000 online storage management tools, VVR actions are generally accomplished through the invocation of wizards. Figure 13.17 illustrates the first panel displayed when the Create RDS command is issued from the Volume Manager console. As the figure indicates, before a replicated data set can be created volume groups at both primary and secondary loca-
Figure 13.17 Introductory panel of the Create Replicated Data Set Wizard.
354
C HAPTE R TH I RTE E N
tions must have been created using Volume Manager management commands. Ideally, the primary and secondary host computers should be connected as well; otherwise, VVR cannot verify that primary and secondary volumes are compatible. Like other Windows 2000 storage management tools, VVR will automatically make as many of the decisions required to manage replication. This is called Express Mode of operation. Specifying the Express Mode for RDS creation is illustrated in Figure 13.18. Decisions that cannot be made automatically by VVR even when Express mode is specified include: ■■
The name to be given to the RDS (RepRDS, in the example).
■■
The name of the RVG to be replicated (RepRVG). All RVGs in an RDS must have the same name. In Express mode, the wizard assumes that the replicated volume group and volume names are the same on primary and secondary systems. When Custom mode is used, this need not be the case.
■■
The name of the server that will be the replication primary data source (in this case, kilby).
The administrator enters this data as part of the RDS creation process. Once the data have been entered and the administrator clicks the Next button, the volumes that will comprise the primary RVG are defined using the dialog shown in Figure 13.19.
Figure 13.18 Specifying Express mode and naming a VVR replicated data set.
Data Replication: Managing Storage Over Distance
355
Figure 13.19 Specifying data volume for VVR replication.
All of the volumes in an RVG must be made up of capacity from disks in the same disk group. When the administrator specifies a disk group from the drop-down list shown at the top of Figure 13.19, the list box in the middle of the panel displays all of the volumes made up of storage capacity located on disks in that group. In this figure, only one dynamic disk group is defined for the system (visible in the Disk Groups object listing in the Volume Manager console window shown in the background of the figure). The disk group RepDG provides the storage for two volumes, Oracle_P_Datavol and ORCL_VOL_LOG; Oracle_P_Datavol has been specified as a data volume to be replicated. In this example, the disk group hosts only one data volume. If a disk group contains multiple volumes, any number can be specified for replication with a single invocation of the wizard. This dialog panel is also used to specify whether replication I/O is to be logged at the primary location and, if so, the location of the log. In Figure 13.19, ORCL_VOL_LOG has been specified. This is the only available choice in this example, since the replication log must not reside on a volume whose data is being replicated. A replication volume group has a single replication log, because all writes to volumes in the group are
356
C HAPTE R TH I RTE E N
replicated in sequence to all secondary locations. If logging is not specified for a replication volume group, the group is effectively restricted to synchronous replication. In virtually all cases a replication log should be specified.
VVR Data Change Map Logs The panel in Figure 13.19 allows the administrator to specify a DCM Log, or Data Change Map log for each volume to be replicated. DCM logs record primary RVG block regions in which changes have been made. They are used during initial synchronization and when a replication log overflows because of a secondary location or network outage. Without a DCM log, initial synchronization must be either forced or check-pointed. Both of these options are described in the paragraphs that follow. Since each replicated volume has its own DCM log, it is possible for an administrator to control the replication log overflow behavior of an RDS on a volume-by-volume basis. Without a DCM log, only two options are available when the replication log overflows: stop application processing and stop replication for volumes that do not have DCM logs associated with them. If replication is stopped, it must be restarted at a later time, complete with a possibly time-consuming initial synchronization stage. With a DCM log, resynchronization is accomplished by copying all block regions indicated in the log as having been modified during the outage. In almost all cases, a DCM log is desirable. Once the volumes to be replicated have been specified, the remaining components of the RDS, the replication targets, are designated in the Select Secondary host(s) for replication panel of the Create Replicated Data Set wizard (Figure 13.20). In the window at the left are displayed the reachable Windows 2000 servers on which VVR is installed (the candidates to become secondary replication locations). The administrator uses the Add, Remove, and other buttons to specify one or more secondary replication locations. In this example, the computer named noyce in the veritas.com domain has been specified as the only secondary replication location. As many as 32 secondary hosts can be specified. Specification of secondary replication target locations completes the input phase of replicated data set creation. Typical of the other Windows 2000 online storage management wizards described so far in this book, this one takes no irrevocable action at this point. Instead, it displays the panel shown in Figure 13.21, which summarizes the input parameters. The administrator must click the Finish button to effect the configuration displayed; only then is the system set to begin replication—it does not automatically start replication. When the administrator clicks the Finish button, the Success dialog shown in Figure 13.21 is displayed to indicate that an RDS and RVG have been suc-
Data Replication: Managing Storage Over Distance
357
Figure 13.20 Specifying secondary locations for VVR replication.
cessfully created. Figure 13.22 shows detail views of some of the objects that comprise the RDS immediately there after. These views, from server noyce, the secondary replication location in this example, illustrate the distributed nature of VVR—that is, any server on which VVR is installed, and for which an administrator has adequate access privileges, could be used to manage any server in the replication network. The top view in Figure 13.22 gives detailed information about the replication data set: the names of the RDS and the primary location RVG, the size of the replication log, and the names, replication modes (synchronous or asynchronous) and log sizes of all secondary replication volume groups. The middle console view displays information about the secondary RVG selected in the panel at the left of the window. In addition to naming information, this display supplies information about: RVG State. In this example, the state of the RVG is reported as Empty, meaning that the RVG has not been started (described shortly) and that replication has not begun.
358
C HAPTE R TH I RTE E N
Figure 13.21 Completing VVR replicated data set creation.
Secondary Replicator Log Size. The primary replicator log is used during normal operation. Similar logs at secondary locations are used to aid recovery after a primary server failure. After recovery from a primary server failure, primary and secondary servers negotiate to determine the last updates from the primary server that were persistently recorded at the secondary location. All updates subsequent to that that are recorded in the primary replicator log and are transmitted to the secondary; they are logged in the secondary replicator log and applied atomically before replication is allowed to recommence. This greatly shortens the amount of time until replication recommences after a primary server failure Replication Mode. The mode of replication is either synchronous or asynchronous. In synchronous mode, every application write to a volume in the primary RVG is transmitted to all secondaries; receipt is acknowledged before the write is regarded as complete and the application is allowed to continue. In asynchronous mode, primary updates are allowed to “run ahead” of replication at secondary locations by a bounded amount. VVR also supports a “soft-synchronous” mode of replication. In soft synchronous mode, replication is synchronous until an event (such as a link failure) that would cause application I/O to fail occurs. When this happens, VVR switches to asynchronous mode until the link is again operational, at which time it drains the replicator log and reverts to synchronous mode. Replication Status. The replication status is the state of correlation between data at the primary and secondary locations. In the example, the sta-
Data Replication: Managing Storage Over Distance
359
Figure 13.22 View of VVR replicated data set from secondary server (noyce).
tus is reported as Stale, meaning that there is no known correlation between data at the primary location and data at this particular secondary location. Replicator Log Protection. A replicator log may be protected against overflow. Application I/O may be forced to fail when the replicator log is full, or VVR may revert to tracking updates in a Data Change Map (if one is available). In the example, Replicator Log Protection is off. Latency Protection. Filling of a replicator log can be forestalled by throttling, or slowing down, application when the log comes dangerously close to being full. For link or secondary outages of limited duration, this can delay the moment when the replicator log fills and the specified protection mode must be invoked. In the example, Latency Protection is disabled. If it is enabled, the administrator must supply a High Mark, the maximum
360
C HAPTE R TH I RTE E N
number of I/O requests in the replicator log, before throttling occurs, and a Low Mark, the maximum number of I/O requests in the replicator log before throttling is disabled. The secondary RVG details view also enumerates the replicated volumes in the group and provides information about their size and layout, whether there is an associated Data Change Map log, and the name of the corresponding volume in the primary RVG. In Figure 13.22, primary and secondary volume names are identical (a necessary condition of using Express mode to create the RDS), but this need not be the case. The bottom view in Figure 13.22 shows the details of a volume object that is part of an RDS. Most of the information in this view is also available in other views, at least in summary form. The one piece of information unique to this view is the amount of free space in the volume, expressed both in megabytes (or gigabytes for large volumes) and as a percentage of total volume capacity.
Initializing Replication Creating a replicated data set does not automatically start replication, the administrator must next establish the connection between primary and secondary replication volume groups by attaching the latter to the former. Only then, by explicitly invoking a command to do so, will replication begin. Figure 13.23 illustrates the use of the Attach Secondary… command to attach a secondary RVG to its primary. In the course of executing the Attach Secondary… command, the initial synchronization method is specified by responding to the dialog shown in the lower right corner of Figure 13.23. The administrator specifies from: None. This option bypasses initial synchronization of primary and secondary replication volume groups. If specified, replication begins immediately when the Start Replicated Volume Group command (described shortly) is issued. This option is useful for getting replication started quickly when both primary and secondary RVGs are newly created and no data of value is contained on either. Force. This option also bypasses initial synchronization of primary and secondary replication volume groups. It is useful when the contents of the primary RVG are known to be identical to those of the secondary RVG through means external to VVR (for example, if block-for-block image backups of primary volumes have been restored to corresponding volumes of the secondary RVG and neither RVG has been modified since the backup was initiated. When the method of synchronization is specified, VVR does not verify that the contents of primary and secondary volumes are identical.
Data Replication: Managing Storage Over Distance
361
Figure 13.23 Invoking the Attach Secondary command.
Auto Sync. If this option is specified, VVR copies the entire contents of volumes in the primary RVG to corresponding volumes in the secondary RVG, using the Data Change Map (DCM) log to track the progress of the copy. During the initial synchronization, write-ordering fidelity is not preserved. Updates to primary RVG volumes, however, can be replicated while initial synchronization is occurring. Such updates are held in the primary replication log and sent to the secondary when initial synchronization is complete. Check Point. This option is similar to the Force option, but it allows for application use of the primary data while initial synchronization is occurring. When this option is used, the start of an image (block-level) backup is recorded in the replication log. The backup is then made while data at the primary location is in use by applications. During this time, application updates to the primary location data are recorded in the replication log. When the backup is complete, it is shipped to the secondary location and restored to corresponding secondary location volumes. Once data has been restored to secondary location volumes, the link connecting primary and
362
C HAPTE R TH I RTE E N
secondary is started, and all application updates from the time of the check point are applied to the secondary, with write order fidelity preserved. Once the data on a secondary replication volume group has been synchronized with its primary, or synchronization has been arranged for by one of the methods just described, replication can be started using the Start Replicated Volume Group(s) command shown in Figure 13.24. Invoking the command displays the confirmation dialog shown at the lower right corner of the figure. When the command executes and replication begins, the console display is updated to indicate the actual state of replication for the secondary RVG; this is shown in Figure 13.25. In Figure 13.25, RVG State is shown as Started, meaning that the VVR Start Replicated Volume Group(s) command has executed successfully. Replication Status is shown as Active, meaning that primary and secondary volume contents have completed initial synchronization and normal replication is occurring. If either the Auto Sync or Check Point option had been specified for initial synchronization, Replication Status would show as Synchronizing until initial synchronization has been achieved.
Sizing the Replication Log During VVR replication, the primary replication log temporarily holds application updates to replicated data while VVR waits for network bandwidth to
Figure 13.24 Start RVG command and verification dialog.
Data Replication: Managing Storage Over Distance
363
Figure 13.25 Initial state of volumes.
transmit them to secondary locations. The replication log mechanism is also what enables VVR to “ride through” brief failures or overloads of a secondary location or the network connecting the primary to it. For resiliency, VVR requires that the primary replication log be allocated on a separate volume from any replicated volume in the RVG. For efficiency, VVR allocates primary replication log space contiguously, and does not allow the log to grow during replication. Obviously, the size of the primary replication log is key when VVR is replicating data asynchronously or in soft-synchronous mode. If the replication log fills, then either replication must cease altogether or VVR must adopt a fallback strategy using the DCM log to track changes so that replication can resume after the failure has been recovered. If replication ceases altogether, then a complete reinitialization must be conducted when both the secondary system and the network are again available. If the DCM log fallback strategy is adopted, all changed regions indicated in the DCM must be copied to affected secondary locations, and replication restarted. Unless the outage is very long (days or weeks), the latter is usually the preferable strategy. Obviously, the best strategy is to size the replication log to ensure that these alternate strategies are used rarely. The replication log can absorb application updates to data until it fills. The length of time to fill the replication log is
364
C HAPTE R TH I RTE E N
therefore the size of the log divided by the rate at which applications update data, or: Tlog fill (seconds) = Slog (bytes) / Rupdate (bytes/second) Thus, for example, if an application updates data at the rate of 3 megabytes per minute (50,000 bytes per second), and the replication log size has been set at 400 megabytes (as in the example of Figure 13.25 and preceding figures), then: Tlog fill (seconds) = 400,000,000 bytes / 50,000 bytes/second = 8,000 seconds or about 2 hours and 13 minutes. This is the longest network or secondary location outage that can be sustained before VVR would be forced to enter one of its fallback modes. Alternatively, and perhaps more usefully in planning, the size of the log required to sustain an outage of a given duration can be computed as: Slog (bytes) = Tlog fill (seconds) × Rupdate (bytes/second) Then, for example, to sustain an outage of eight hours (28,800 seconds) with an application that updates data at a rate of 50,000 bytes per second would require: Slog (bytes) = 28,800 seconds × 50,000 bytes/second = 1,440,000,000 bytes or about 1.44 gigabytes. This analysis, however, is based on the assumption that application data update rates are constant over relatively long periods of time. This is almost certainly an oversimplification for any real application, which will have both busy and idle periods throughout a business day. The replication log should therefore be sized using worst-case peak loading update rates for the applications whose data is being protected. In addition, recall that VVR replicates all writes to a volume group, whether they represent data or metadata. If a file or database is extended, all of the metadata updates required to allocate additional space are recorded in the replication log as well as the application data that is written to the extended space. Similarly, if insertions to a database table result in index updates, these are written to the replication log as well as the actual inserted data. Therefore, replication log must be sized to take these updates into account.
NOTENumerous conditions can result in minor variations in the stream of updates to a replicated volume group, so any calculations similar to those given in this section should be regarded as approximations. Log sizes should be adjusted upward accord-
Data Replication: Managing Storage Over Distance
365
ingly. In general, it is best to overestimate the replication log size required, since the consequences of filling the log are exposure to data loss if a disaster occurs while replication is being totally or incrementally resynchronized due to a replication log filling or overflowing.
Replication Log Overflow Protection Replication log overflow is the condition in which an application update cannot be written to a VVR primary replication log without overwriting an existing log entry that has not yet been transmitted to and acknowledged by all secondary locations. When protection against log overflow has not been enabled for a replication data set, log overflow makes further replication invalid, and requires a complete resynchronization of the entire RDS whose log overflowed. Because of the window of exposure to disaster it creates, this option should be used very sparingly. VVR also provides protection against log overflow by stalling application I/O for a preset period when the log is close to overflowing. If log overflow occurs anyway (as, for example, when a network link to a secondary is down), one of these three prespecified means of recovery is invoked: Override. When the “Override replication in case of log overflow” option has been enabled for a replication data set, the primary log is allowed to overflow, invalidating further replication just as if no protection had been applied. In this case, a complete resynchronization of the RDS is required before replication can resume. Fail. When the “Fail application I/O in case of log overflow” option has been enabled for a replication data set, any application I/O that would result in log overflow is failed. On the positive side, this pauses replication with both primary and secondary location data intact and valid; on the negative side, it stops the execution of primary location applications as they wait (apparently) forever for I/O to replicated volumes to complete. DCM. When the “Enable DCM log in case of log overflow” option has been enabled for a replication data set, log overflow enables DCM log. From that point, changes to data at the primary location are tracked using the DCM log. When the link to a secondary location is again available, data from the replication log is transmitted (with write-order fidelity preserved), followed by regions of the RVG indicated as changed in the DCM log. During the latter transmission, write-order fidelity is not preserved. Replication resumes when all data regions indicated in the DCM log as changed have been transmitted to secondaries and written to volumes there. In most cases, use of the DCM log for replication log overflow protection, because it provides both application continuity at the primary location and the
366
C HAPTE R TH I RTE E N
shortest path to resumption of disaster protection after the condition that resulted in the log overflow (most often a secondary system or network link failure) has been corrected.
Protecting Data at a Secondary Location A necessary, although possibly non-obvious feature of a disaster recovery strategy is to protect the copy of data used for recovery at a secondary location against failures at the secondary location while it is or is becoming a primary processing location. Thus, data on secondary volumes must be backed up periodically. VVR provides a facility called secondary check point to enable backups at secondary locations with minimal impact on primary location applications. A secondary check point starts with an administrative request made at a secondary location. The request causes a secondary checkpoint message to be sent to the primary. When the primary receives this message, it pauses replication to the secondary and records the secondary check point in its replication log. With replication paused, the secondary has a stable group of volumes from which an image backup can be made, either by copying the actual data image or by splitting a mirror if the secondary volumes are mirrored. A fileoriented backup is not possible because volume replication provides no context with which to determine the internal consistency of the replicated file system. Once the backup is complete (or started, if split mirrors are used), the secondary location administrator can send a message to the primary location requesting that replication be resumed from the point of the check point.
Network Outages One of the primary benefits of replication technology is that it permits replication to continue through brief network outages by recording changes to primary RVGs in the replicator log and then transmitting and applying them at the secondary location when the network outage is repaired. To illustrate this property, Figure 13.26 shows a simple batch command file that can be used to generate continuous I/O activity on the replicated volume used in these examples. The file is simply an endless loop that continuously creates a file by copying the contents of an existing file and then deletes the newly created file. To demonstrate the continuation of replication during a network outage, replication is paused by an administrative action. The result of this action is shown in Figure 13.27 where the VVR console Secondary RVG view reports the sec-
Data Replication: Managing Storage Over Distance
Figure 13.26 Generating file system activity.
Figure 13.27 Monitor View of primary log buildup.
367
368
C HAPTE R TH I RTE E N
ondary RVG’s Replication Status as Pause. The figure also shows the Replication Monitor View, which displays information about replicated data sets. The most significant piece of information here is the state of the primary replicator log. Due to the continuous file system generated by the command file in Figure 13.26, the primary replicator log has filled to 23 percent of its capacity. If replication were paused long enough, or if an actual network outage had persisted long enough to fill the log, replication would cease, because neither Replicator Log Protection nor Latency Protection has been specified for this RDS.
Using Replicated Data From the standpoint of file systems and applications, volumes emulate disks. As described in Chapter 1, a disk can write or read an arbitrary string of data to or from an arbitrary sequence of blocks in response to a single request, but it has no information about the content or meaning of the data or about any relationship that might exist between different requests. By contrast, a file system uses complex sequences of read and write operation to maintain the structural integrity of files and directories. For example, if an application appends data to the end of a file, the file may have to be extended. Transparently to the application, the file system allocates additional storage, reading and writing metadata structures to do so, writes the application’s data, and records the expanded extent of the file in still other metadata structures. Some of these writes may be cached, or deferred, in order to consolidate them or to optimize I/O scheduling. At the beginning and end of a file extension (or file or directory creation or any of a number of other operations) the file system’s disk or volume image is internally consistent. During the operation, however, there may be moments when the file system’s metadata structures on disk are not consistent with each other. Since a volume has no information about the interrelationship of I/O requests, it cannot distinguish between moments when a file system’s ondisk structure is internally consistent and those when it is not. All the volume sees is a sequence of read and write requests; it has no way of telling which requests make file system on-disk structures consistent and those that do not. Volume replication faithfully copies the sequence in which primary RVG write requests occur to secondary RVGs. Since the volumes of the primary RVG cannot determine when the file systems on them are internally consistent, neither can the replicated volumes at the secondary locations. Consequently, the file systems cannot be mounted and used by applications at secondary locations. It is, however, possible to make some limited use of replicated data at secondary locations, through the use of split mirrors. Mirrors can be added to the
Data Replication: Managing Storage Over Distance
369
volumes of a secondary RVG, allowed to resynchronize with their volumes, and then split. The frozen images contained on the split mirrors can be backed up, analyzed, or otherwise processed at the secondary location. Because secondary RVGs contain volumes, mirrors can be added to and split from them even though they cannot be mounted. Some application-level or administrative coordination is required to do this. As with all uses of frozen image technology, applications, file systems, and databases at a primary replication location must be made quiescent in order to force their volumes to consistent on-media states before mirrors are split from secondary replicated volumes. Not only must applications and data managers be quiescent, with all cached data flushed to storage, but the primary replicator queue for the application’s data volumes must be empty as well, with all updates transmitted to secondary volumes. Once these conditions are met, mirrors can be split from secondary volumes for other processing, and applications can resume.
RVG Migration: Converting a Secondary RVG into a Primary In replication data sets with one secondary location, it is sometimes desirable to reverse the roles of primary and secondary replication locations, making the former secondary location the site at which applications run and data is updated, and the former primary location the replication target. The VERITAS Volume Replicator provides a shortcut for making this task easier, in the form of the Migrate… command, which can be invoked on replication data sets (Figure 13.28). In this example, the RDS contains only one secondary RVG, but there may be as many as 32 volume replication targets. When the Migrate… command is invoked to move the role of primary RVG to another location, the administrator must specify the new secondary location in the Secondary Name box, also shown in Figure 13.28. The Migrate… command encapsulates all the actions required to: ■■
Suspend replication momentarily.
■■
Unmount the volumes of the (old) primary RVG to prevent updates to it.
■■
Reverse the roles of primary and secondary RVGs, including the logs.
■■
Mount the volumes of the (new) primary RVG to make them usable by applications.
The Migrate… command does not, however, encapsulate any logic to quiesce or reenable applications or database managers. These must be executed separately by the administrator.
370
C HAPTE R TH I RTE E N
Figure 13.28 Using the Migrate command to specify a new secondary location.
Figure 13.29 shows two console views of replication objects that illustrate the state of the example system after the Migrate… command has executed. Here: ■■
Server noyce (noyce.veritas.com) has become the primary replication location (top of the figure).
■■
Server kilby has become the secondary replication location (bottom of the figure).
■■
The RVG state is reported as Started at the (new) primary location, indicating that replication is enabled, but as Stopped on the secondary location, while it awaits an administrative command to start it.
Administrative action is required to start replication and any applications that will process data on the replicated volumes at the new primary location.
Data Replication: Managing Storage Over Distance
371
Figure 13.29 Viewing new primary and secondary RVGs.
File Replication for Windows Servers Several companies, including VERITAS Software Corporation, offer fileoriented replication managers. The examples in this section illustrate the VERITAS Storage Replicator (VSR)12 for both the Windows NT and Windows 2000 operating systems. VSR can replicate sets of files between: ■■
A single source and a single target (e.g., for off-host processing)
■■
A single source and multiple targets (e.g., for publication)
■■
Multiple sources, each with its own unique data to replicate, and a single target (e.g., for consolidation)
VSR replicates data asynchronously only; it cannot be forced to operate synchronously. It utilizes logs at both source and target to minimize the impact of 12
Older versions of this software were sold under the name Storage Replicator for [Windows] NT (SRNT).
372
C HAPTE R TH I RTE E N
replication on application performance. This first example in this section demonstrates file replication in the Windows NT environment using VSR. Like the other storage management software discussed in this book, the operation of VSR is managed through a console.
Replication Jobs VSR manages units of work called jobs. A job replicates data between a fixed set of source computers and a fixed set of target computers. Figure 13.30 shows the VSR console’s Configure tab, used to configure jobs and additional replication servers, and to install the VSR software remotely. In this scenario, two servers, whose network names are LEFT and RIGHT, are configured for replication (i.e., have VSR installed). VSR operates within replication neighborhoods, which are similar to network domains in that they are groups of computers interrelated for replication management purposes. Each VSR replication neighborhood contains a single replication management server (RMS). An RMS is the repository of the neighborhood’s replication information, such as membership and replication job schedules. All other servers in a replication neighborhood run a VSR component called the Replication Server Agent (RSA). In the simple replication neighborhood in Figure 13.30, server LEFT is the Replication Management Server. Both LEFT and RIGHT have RSA installed, so both can serve either as a source or target for replication. As jobs are defined, VSR checks for circularity. For example, it would not permit server LEFT to replicate a file to server RIGHT if the file were specified in a job that would replicate it back to server LEFT.
Figure 13.30 List of servers configured for replication.
Data Replication: Managing Storage Over Distance
373
To define a replication job, an administrator clicks the New Job button in the Configure view (Figure 13.30). Just as the other Windows storage management software components discussed in earlier chapters, VSR uses wizards to specify and perform management functions. Clicking the New Job button invokes the VSR New Job wizard, whose initial panel is shown in Figure 13.31, where the administrator first specifies the type of replication job. In this example, with only two servers in the replication neighborhood, only a Standard (one-to-one) job can be meaningfully defined. Next the administrator specifies options for the job, using the Replication Options panel given in Figure 13.32. Four options can be specified: Prescan. Prescanning the directories to be replicated allows VSR to determine the number and size of files in the job. This in turn determines the replication algorithms to be used. Prescanning enables VSR to estimate the duration of the synchronization stage of replication and display a progress indicator as it executes. This option is useful if the synchronization stage of a replication job is believed to be particularly time-consuming, or if there are time constraints on the duration of a synchronize and stop job. No Changes on Target. This option forces all of the job’s target directories to be read-only for the duration of the job. This prevents them from being written by local applications during replication. This option is useful for creating off-host frozen images of data, because it prevents inadvertent modifications during image creation. Exact Replica on Target. This option forces the contents of directories specified in the replication job to become identical at source and target
Figure 13.31 Specifying replication job type.
374
C HAPTE R TH I RTE E N
Figure 13.32 Replication job options.
locations. Any files in replicated directories at the target that do not have counterparts at the source are deleted. Like No Changes on Target, this option is also useful when creating a frozen image of source location data. Continue Replicating After Synchronization. When this option is selected, replication continues after the source and target directories have been synchronized. If it is not specified, replication ceases when all data from replicated directories at the source have been copied to all target locations. This option is useful if the point in time at which replication should cease is not predictable in advance, for example, when replication starts while source data is still evolving. When this option is specified, replication continues until stopped by administrative action. This gives an administrator precise control over the instant at which target data images are frozen. These four options apply to all replicated objects specified in a job. If different options are appropriate for certain data objects, those objects must be part of a different replication job. An unlimited number of replication jobs can be active simultaneously, so simultaneous replication of different data objects with different options is possible.
Specifying Replication Sources and Targets Next, in the Replication Pairs panel of the New Job wizard (Figure 13.33), pairs of servers that will be replication sources and targets are specified. Depending on the type of job, one or more pairs may be specified. Clicking the Add button (top left of the figure) displays the Add a Replication
Data Replication: Managing Storage Over Distance
375
a.
b.
c. Figure 13.33 Specifying the servers for a replication job.
Pair dialog (right side of the figure). Here the administrator specifies source and target servers for the job. Note there are two Select buttons: They both display lists of the servers in the replication neighborhood. Again, one or more target servers can be specified, depending on the type of job, and a single job can include both multiple sources and multiple targets.
376
C HAPTE R TH I RTE E N
Specifying Data to be Replicated The next step in defining a replication job is to specify the data to be replicated. Because the replicated objects are files, complex criteria for replicating or not replicating individual objects can be specified. Figure 13.34 shows the Replication Rules panel used for this purpose. In Figure 13.34, a list of storage devices on the source server is displayed. When a device is selected and the Add Rule button is clicked, the Rule dialog, shown on the right in the figure, is displayed. It is used to define each rule for file or directory inclusion or exclusion, as well as to specify target path(s) (a different path may be specified for each replication target). In Figure 13.34, the source data consists of all files in the directory MirrorQ on drive letter X: on server LEFT. As the bottom panel of the Rule dialog indicates, this data will be replicated to the directory R:\SRNT\Replica\ MirrorQ on server RIGHT (named as the target in Figure 13.33). As this example illustrates, the target path for replication is completely arbitrary. In this case, a default path suggested by VSR has been accepted. The default path is a subdirectory of a default replication directory that is set up when the package is installed. By clicking the Edit button, an administrator can change the target path to any valid path on the target server. Care should be taken when using this feature, however, especially if the Exact Replica on Target option has been specified. With this option, VSR will delete any files in the target path that do not have counterparts in the source path and overwrite any identically named files whose contents differ from the version of the file in the source path.
Figure 13.34 Defining replication rules.
Data Replication: Managing Storage Over Distance
377
Replication Schedules The next task in defining a replication job is to establish the schedule for when the job will run. Clicking the Next button displays the Replication Schedule panel (shown in Figure 13.35) where this task is accomplished. There are two basic interrelated uses for scheduled (as opposed to continuous) replication: periodic copying and network load management. Scheduled replication can be used to create periodic off-host copies of changing sets of data. For example, a point-of-sale application distributed across several time zones might record sales records in a file called Today_Sales at each sales location. One effective use of replication would be to consolidate each location’s Today_Sales file at a headquarters data center for corporate rollup and analysis, as well as centralized backup. To accomplish this, replication could be scheduled to begin near the end of each business day. Each time zone might correspond to a separate job, with start times staggered to correspond to business hours in the time zone. Another reason for scheduling replication is to manage network loading. When replication is used to create frozen images of data for backup or analysis, the network load of continuous replication may affect application performance during peak hours. Scheduling replication to start at specific times gives an administrator a measure control over the timing of the replication network load.
Figure 13.35 Establishing a replication schedule.
378
C HAPTE R TH I RTE E N
Of course, there is a cost for starting and stopping replication. Whenever replication restarts, VSR must analyze all replicated directories in the job to determine which (source or target) files have changed and therefore must be recopied. This creates a burst of processing and network overhead, which in most cases is minor. But if the number of files in a job is large and the rate of change is low, continuous replication may have less overall impact on applications. Each rectangle on the Replication Schedule grid represents an hour of a day of the week. A shaded (selected) rectangle indicates that replication should occur during that hour on that day. In Figure 13.35, the Enable Scheduled Starts check box is not checked, indicating that replication is continuous, meaning that all hours of all days are considered automatically specified and are shaded in the figure. When Enable Scheduled Starts is checked, an administrator can designate hours during which replication may or may not be active. By selecting (clicking on) a specific hour or span of hours, the administrator indicates that replication should be active during those hours. During hours that are not specified, replication is stopped. At each transition between a nonselected hour and a selected one, replication is scheduled to restart and a synchronization stage is implied.
Starting Replication Administratively When scheduled starts are enabled, replication automatically starts and stops according to schedule. Continuous replication jobs must be started by an administrator, using the Start Now command, which has been invoked (Figure 13.36) on the replication job called Copy Software Directory. The
Figure 13.36 Starting replication.
Data Replication: Managing Storage Over Distance
379
job’s prescan stage begins executing immediately if the Prescan option had been specified, as in Figure 13.32. If the Prescan option has not been specified, the synchronization stage of replication begins. By clicking the Monitor tab, an administrator can display an informational dialog that gives a running view of job status. Figure 13.37 shows two snapshots of this dialog for the replication job defined in the preceding section. The snapshot on the left was taken while the job was still in the prescan phase. VSR continually updates the # of Files and Folders count as it prescans directories to estimate how much work is required to complete synchronization. The snapshot on the right side of the figure was taken during the job’s synchronization phase. During this phase, the dialog continually updates: the number of bytes that have been sent to targets, the Percent complete progress indicator, and the estimate of the time remaining in the synchronization phase. Because VSR replicates file objects, some files can be read and file system tools can be used at the target while replication is in progress, including during synchronization. Figure 13.38 shows a Windows Explorer view of the replicated directory R:\SRNT\Replica\MirrorP captured while synchronization was occurring—, captured fairly soon after the start of replication, when only two subdirectories in the MirrorP path had been synchronized. As synchronization continues, subsequent Explorer displays would show more directories as they were copied to the target location.
Figure 13.37 Prescan and synchronization phases of replication.
380
C HAPTE R TH I RTE E N
Figure 13.38 Windows 2000 Explorer view of target during synchronization.
As replication proceeds, source and target data are eventually synchronized. From that point on, only changes to replicated files are transmitted from primary location to secondary. Figure 13.39 shows a continuous replication job that has passed from the synchronization stage to the dynamic stage in which only updates are replicated. The left panel in the figure represents an instant at which no data is being transferred from source to target (because no applications are updating replicated data). The right panel captures the point in time during which applications are updating replicated data at the source, so data is being transmitted from source to target. For this example, several large files (100 megabytes or more) were transferred from another location to the job’s source directories. Replication of the new files at the source results in a high rate of data transfer from source to target.
Figure 13.39 Dynamic phase of replication with and without data transfer.
Data Replication: Managing Storage Over Distance
381
Troubleshooting File Replication VSR includes tools to enable an administrator to analyze the state of replication, both in overview and in “drill-down” detail when problems are identified. Figure 13.40 presents two of these facilities, the Monitor Alerts view and the file in which events significant to replication are logged. The logs shown here are particularly useful in smaller installations with only a few servers. They can also be used to analyze and repair problems in larger replication neighborhoods once a problem has been isolated to a specific replication server. In a large neighborhood with dozens of replication servers, however, an overview of all replication activity is required as a first level of monitoring and problem identification. The Monitor tab’s General view, shown in Figure 13.41, is useful for this purpose. The General view in Figure 13.41 provides an at-a-glance overview of problems in a replication neighborhood. It graphically illustrates the three key aspects of replication that must be monitored on a neighborhoodwide basis—jobs, servers, and alerts. When this view warns that something is amiss in the neighborhood, the administrator can drill down to the detail views presented earlier and discover the exact problem and formulate a solution plan.
Figure 13.40 Replication logs.
382
C HAPTE R TH I RTE E N
Figure 13.41 The General Monitor view.
CHAPTER
14
Windows Online Storage Recommendations
Rules of Thumb for Effective Online Storage Management Throughout this book recommendations have been made regarding good online volume management practices. This chapter is a collection of issues that often arise for system administrators charged with managing online storage and recommendations for dealing with them. Most of the material compiled here is in the context of the Windows server environment discussed in this book, but the principles and recommendations offered here are not specific to Windows; they can be applied to other computing platforms, such as UNIX, that support the concept of online volumes. In many ways, computer system management is application-specific, especially when dealing with the important properties of online storage—capacity cost, data availability, and I/O performance. So though the recommendations in this chapter represent sound administrative practices in general, not all apply equally well in all application contexts. Therefore, administrators should use these guidelines in conjunction with their own knowledge of local application requirements when setting management policies and making decisions.
Choosing an Online Storage Type One major responsibility of a system administrator charged with the management of online storage is choosing the type of storage that is appropriate for 383
384
C HAPTE R F O U RTE E N
each application. The choice of online storage is challenging because of the broad spectrum of technologies available, each with its own component cost, failure-tolerance, and performance properties. Table 14.1 shows a sampling of a number of common online storage options, along with their cost, failuretolerance, and I/O performance properties. While the table is by no means complete, it makes the point that there is an extensive range of options with widely varying properties. The right kind of storage for any application depends on the characteristics and priorities of the application. Fortunately, most vendors of hardware RAID subsystems and volume management software offer most or all of the options listed in this table, so choosing an online storage configuration for an application is more often based on design and deployment than on capital purchase or vendor selection. Consequently, a choice that turns out to be suboptimal can usually be rectified by reconfiguration rather than by purchasing additional equipment. But even a reconfiguration of storage or data can be an expensive disruption to operations, so it is important for system administrators to make accurate choices the first time.
NOTEThe acronym JBOD used throughout Table 14.1 stands for “just a bunch of disks,” referring to any collection of disks without the coordinated control provided by a volume manager or control software.
Basic Volume Management Choices Table 14.1 is based on the premise that three major decision criteria determine the type of disks or volumes that should be allocated for a given application: capacity cost, failure tolerance, and I/O performance. There is a pretty clear ascending hardware cost progression from JBOD to striping to mirrored volumes replicated across long distances. As the table shows, however, there are functional and performance side effects to consider as well. The purpose of this section is twofold: to point out some of the interrelated factors to consider when configuring online storage for application use, and to give some general recommendations for online storage configuration.
Just a Bunch of Disks JBOD, usually pronounced “jay-bod,” refers to multiple physical disks, each accessed and managed separately by volume managers and file systems. Since each disk in a JBOD is independent of the others, consequences of disk failure are limited to loss of the data on the failed disk. A JBOD offers no protection
385
N N Plus volume manager or RAID subsystem cost and host interconnect cost.
N+1 Plus volume manager or RAID subsystem, incremental packaging and host interconnect cost. N+1 Plus volume manager or RAID subsystem, incremental packaging and host interconnect cost.
Striped Volume
“Wide” RAID Volume (≥10 disks)
“Narrow” RAID Volume (<10 disks)
DATA DISKS
OPTION
JBOD
COST FOR N
COMPONENT
ONLINE STORAGE
Higher (∼2500× disk MTBF)
Higher (∼1000× disk MTBF)
Same disk MTBF More data exposed; amount depends on volume width
TO JBOD)*
As for striped volume
As for striped volume
Higher due to request decomposition and parallel execution; depends on volume width
TO JBOD)
SPEED (RELATIVE
(RELATIVE
SPEED
Lower than striped volume due to parity update overhead (CPU and I/O)
Lower than striped volume due to parity update overhead (CPU and I/O)
Higher due to request decomposition and parallel execution; depends on volume width
RATE TO JBOD)
(RELATIVE
As for striped volume
As for striped volume
Higher due to request load balancing; depends on volume width
1 (Baseline for comparison)
TO JBOD)
(RELATIVE
REQUEST
READ
TRANSFER
DATA TRANSFER
RANDOM
LARGE WRITE DATA
LARGE READ
TOLERANCE
FAILURE
Table 14.1 Characteristics of Common Online Storage Options
(continues)
Much lower than striped volume because of parity update overhead (CPU and I/O)
Much lower than striped volume because of parity update overhead (CPU and I/O)
Higher due to request load balancing; depends on volume width
TO JBOD)
(RELATIVE
REQUEST RATE
WRITE
RANDOM
386 COST FOR N DATA DISKS
2N Plus volume manager or RAID subsystem, incremental packaging and host interconnect cost. 3N Plus volume manager or RAID subsystem, incremental packaging and host interconnect cost. Additional function: Split mirror. N+2 Plus RAID subsystem, incremental packaging and host interconnect cost.
OPTION
Two-Mirror StripedMirrored Volume
Three-Mirror Striped Mirrored Volume
RAID 6
COMPONENT
ONLINE STORAGE
Table 14.1 (Continued)
Much higher (>100,000,000 × disk MTBF)
Much higher (∼100,000,000 × disk MTBF)
Higher (∼10,000× disk MTBF)
TO JBOD)*
As for striped volume
Higher than striped volume due to multiple servers for each request
Higher than striped volume due to dual servers for each request
TO JBOD)
SPEED (RELATIVE
(RELATIVE
Lower than striped volume due to dual parity update overhead (CPU and I/O)
Lower than striped volume due to multiple write overhead (CPU and I/O)
Lower than striped volume due to dual write overhead (CPU and I/O)
TO JBOD)
(RELATIVE
SPEED
TRANSFER
As for striped volume
Higher than striped volume due to load balancing and multiple servers
Higher than striped volume due to load balancing and dual servers
TO JBOD)
(RELATIVE
RATE
REQUEST
READ
TRANSFER
DATA TOLERANCE
FAILURE
RANDOM
LARGE WRITE DATA
LARGE READ
Much lower than striped volume due to dual parity update overhead (CPU and I/O)
Lower than striped volume because of multiple update overhead (CPU and I/O)
Lower than striped volume because of dual update overhead (CPU and I/O)
TO JBOD)
(RELATIVE
REQUEST RATE
WRITE
RANDOM
387
Much higher (>100,000,000 × disk MTBF); provides longdistance disaster protection
DATA DISKS
4N or 2(N + 1) Plus volume manager or RAID subsystem, incremental packaging, host interconnect, and long-distance communication cost. Additional function: recovery from site disasters.
OPTION
As for twomirror volume for short bursts; lower for long bursts due to long-distance latency
Lower than comparable failure-tolerant option due to remote update overhead
TO JBOD)
(RELATIVE
SPEED
As for comparable failure-tolerant volume because reads are done locally
TO JBOD)
(RELATIVE
RATE
REQUEST
Slightly lower than comparable failure-tolerant volume because writes are done locally
TO JBOD)
(RELATIVE
REQUEST RATE
WRITE
RANDOM
*This column considers only susceptibility to disk failure. Buses, host bus adapters, RAID controllers, cache, power and cooling subsystems, and host computers themselves can all fail. All affect overall system availability. Moreover, the values listed in this column are first-order approximations. In most cases, the relative failure tolerance of two disk configurations depends on the numbers of disks being compared. Since this table is a sampling of volume management techniques, number of disks is largely ignored.
Replicated FailureTolerant Volume
(RELATIVE
(RELATIVE TO JBOD)*
COST FOR N
ONLINE STORAGE TO JBOD)
SPEED
TOLERANCE
TRANSFER
READ
TRANSFER
DATA COMPONENT
FAILURE
RANDOM
LARGE WRITE DATA
LARGE READ
388
C HAPTE R F O U RTE E N
against data loss due to disk failure, beyond that afforded by backup procedures, which are applicable to all volume organizations.
RECOMMENDATION 1 JBOD should not be used to store data that is vital to enterprise operation and that would be expensive or difficult to replace or reproduce.
Striped Volumes The two benefits of using striped volumes are: ■■
Management simplification. More capacity is aggregated into fewer manageable objects (volumes).
■■
I/O performance improvement. Due to load balancing across disks for small I/O requests or parallel execution of large I/O requests, as described in Chapter 3.
These benefits do not come without a data availability cost, however. The cost of striping data across several disks is the increased susceptibility to failure of striped storage over JBOD storage. The MTBF of a population of N disks is the same, whether or not the disks are part of a striped volume. But when a disk in a striped volume fails, all the data in the striped volume becomes inaccessible. There is no reasonable way to salvage files from the surviving disks of a striped volume, so one disk failure results in the loss of all data in the striped volume, not just data on the failed disk.
RECOMMENDATION 2 Striped volumes should not be used to store data that is vital to enterprise operation and that would be expensive or difficult to replace or reproduce.
This is not to say that there are no applications for striped volumes. For data that is temporary in nature, or that is easily reproduced if a failure destroys it, striped volumes can improve performance and manageability. Compiler temporary files, Web pages, intermediate results from complex calculations, and online catalogs or reference lists may all fall into this category.
RECOMMENDATION 3 Striped volumes should be considered for storing large volumes of low-value or easily reproduced data to which high-performance access is required.
An administrator who determines that it is appropriate to use nonfailuretolerant storage for certain data objects is faced with a choice between spanned volumes, striped volumes, and JBOD. Spanned volumes are primarily useful for rapid, low-impact accommodation of unplanned growth require-
Windows Online Storage Recommendations
389
ments. The I/O performance of striped volumes (discussed in Chapter 3) is superior to that of spanned volumes for most applications, so striping should generally be used if time and resources permit.
RECOMMENDATION 4 For nonfailure-tolerant storage, spanned or striped volumes are preferred over individually managed disks whenever the amount of storage required significantly exceeds the capacity of a single disk because they represent fewer storage objects to manage.
RECOMMENDATION 5 Striped volumes are preferred over spanned volumes because they generally improve I/O performance with no offsetting penalty cost.
Failure-Tolerant Storage: RAID versus Mirrored Volumes If an administrator determines that failure-tolerant volumes are required for a particular application, a choice between mirrored and RAID volumes must be made. Protection against loss of data due to disk failure requires not only additional “overhead” disks, but also enclosure bays, host ports, bus addresses, and power and cooling, all of which must be paid for but that cannot be used to store user data. RAID volumes are attractive in this context because of their low overhead hardware cost. Mirrored volumes require one or more disks plus a small amount of overhead per user data disk. RAID volumes require one disk and its associated overhead per N disks for user data, where N can be chosen by the administrator within broad limits. In the extreme, RAID can be implemented for as little as 3 percent disk overhead cost for a 32-disk volume—the widest volume supported by the Disk Management Component of the Windows 2000 operating system. The VERITAS Volume Manager supports wider striped volumes (up to 256 subdisks)—as compared to a minimum of 100 percent overhead for mirroring. The low hardware cost of RAID also comes with offsetting penalties: Failure tolerance. A RAID volume survives only one disk failure, no matter how many disks comprise the volume. Mirrored (striped) volumes can survive the failure of as many of half of its disks (although not all failures are survivable). Moreover, the performance impact of disk failure on a mirrored volume is negligible, whereas for a RAID volume it can be significant, particularly for reads, which dominate many application I/O loads. Performance. Writing to RAID volumes is an inherently high-overhead operation because for each application write, parity and data must both be
390
C HAPTE R F O U RTE E N
updated in a coordinated way. Disk subsystems with controller-based RAID implementations can eliminate most of this overhead from application response time by using nonvolatile write-back cache. Host-based volume managers minimize its impact by logging updates on persistent storage. Ultimately, however, most of the RAID overhead reads and writes must occur. The result is that a RAID volume tends to become saturated at lower I/O loads than a mirrored volume of equivalent capacity. Manageability. RAID volumes require more management attention than mirrored volumes. The administrator must carefully match I/O loads to volume capabilities on an ongoing basis. The risks of data loss due to disk failure must be understood, and strategies for dealing with them must be developed (e.g., increased backup frequency). These issues exist with mirrored volumes but generally in much simpler or lower-impact form. Function. Some important extended functions of mirrored volumes, such as mirror addition and splitting (discussed in Chapter 4) are either inherently impossible or rarely implemented for RAID volumes. The lower hardware cost of RAID volumes must be balanced against these considerations. One way to evaluate the incremental value of mirrored volumes is to convert the cost of the additional disks and overhead hardware required into administrator hours. The incremental number of incidents (compared to a mirrored installation) can then be estimated, thence the cost of dealing with them. Estimating the incremental cost required to manage RAID volumes can help determine how long it would take the incremental hardware required for mirroring to pay for itself in administrator time savings. The purchasing authority can make an informed decision about whether the capital cost savings of RAID are in fact savings in total cost of ownership.
RECOMMENDATION 6 Host-based RAID volumes should be avoided in applications in which there is a high rate of updates (over about 10 percent of the aggregate I/O request-handling capacity of the disks comprising the volume). Host-based RAID volumes are recommended only for “read-mostly” data.
RECOMMENDATION 7 Disk controller RAID volumes equipped with nonvolatile write-back cache may be used for more write-intensive applications (up to about 40 percent of the aggregate I/O request capacity of the disks comprising the volume).
RECOMMENDATION 8 Mirrored volumes should be used to store all business-critical data and metadata, including dirty region logs for mirrored volumes that use them, RAID volume update logs, database redo logs, and so on.
Windows Online Storage Recommendations
391
RAID Volume Width Once it has been determined that disk controller-based or host-based RAID storage meets a particular application requirement, the next question to answer is how large (wide) a RAID volume is optimal. The fundamental reason for choosing RAID over mirrored storage is lower hardware (disk and overhead) cost. The more disks comprising a RAID volume, the higher the susceptibility to data loss due to a second disk failure. In Chapter 4, Figure 4.9 suggests the lower overhead cost associated with larger RAID volumes; and page 77 presents a rough quantification of the diminished chances of data loss due to disk failure when RAID volumes are used. Figure 4.16 suggests that with a five-disk RAID volume, the chances of data loss due to two simultaneous disk failures are 1/2,500th of whatever they would be with an otherwise equivalent JBOD (i.e., using the same number of disks with the same MTBF). Figure 14.1 applies a similar analysis to an 11-disk array (10 disks of user data and one disk of check data). One can calculate that the chances of data loss due to disk failure in this case are 1/1,000th of what they would be with a JBOD—two and one-half times greater than with the five-disk volume. While the protection afforded by RAID is significant, it is clear from this analysis that the chance of data loss due to disk failure rises sharply with increasing numbers of disks in the volume.
RECOMMENDATION 9 RAID volumes with more than 10 disks should not be configured except for the storage of easily replaceable “online archives”—data that is online for round-the-clock business convenience but that is never modified, or at most, modified infrequently and easily reproducible.
Population: ~100 disks (Nine 11-disk RAID arrays)
Expected failures in 5,000 hours ≈ 1 disk
Event that can cause data loss: failure of any other disk in the array Population: 10 disks (remaining volume members)
Expected failures of 1 disk in 50 hours = 10/100 (disks) × 50/5,000 hours = 1/1,000
Figure 14.1 Probability of data loss with large RAID volumes.
392
C HAPTE R F O U RTE E N
Number of Mirrors If mirrored volumes meet application requirements, the administrator must decide whether to use two-mirror or three-mirror (or more) mirrored volumes. The parameters for making this choice are: Hardware cost. Each additional mirror copy represents a complete additional set of disks and overhead hardware. I/O performance. Each application write request must be written to subdisks in all copies of a mirrored volume. In some implementations, all of these write operations must finish before the application request is complete. In others, writing one mirror and a log entry is sufficient. In either case, impact on application I/O response time is minimal because all data and log writes can be concurrent. Overall I/O load on the system when writing data is, however, proportional to the number of mirrors in the mirrored volume (more if logging is in use). Each mirror increases processor utilization (albeit by a small amount) and, more importantly, consumes I/O bandwidth and disk command execution resources. Thus, applications with high update rates tend to saturate their I/O subsystems at lower application update levels when using mirrored volumes with three or more copies. Failure tolerance. As Figure 14.2 suggests, using an analysis similar to that shown in Chapter 4, Figure 4.15, the chances of data loss due to disk failure with three-mirror volumes are extremely low. Population: ~100 Disks (33 three-way mirrored volumes)
Expected failures in 5,000 hours = 1 disk
Events that can reduce failure tolerance: failure of one of failed disk’s mirrors
Population: 2 Disks
Expected failures of 1 disk in 50 hours = 2/100 (disks) × 50/5,000 hours = 1/5,000
Event that can cause data loss: failure of remaining mirror Population: 1 disk
Expected failures of 1 disk in 25 hours = 1/100 (Disks) × 25/5,000 hours = 1/20,000
Chance of data loss (both events): = 1/5,000 x 1/20,000 = 1/100,000,000 (relative to JBOD)
Figure 14.2 Three-mirror volume failure rate.
Windows Online Storage Recommendations
393
When a disk containing subdisks that are part of a three-mirror volume fails, not only is data still intact, but it is protected against failure of one of the two remaining disks. As Figure 14.2 indicates, a disk failure in a population of one disk during a 25-hour period (assuming a 25-hour overlap in the outage times of the first two disks) preceded by two other specific failures has a miniscule chance of occurring. With reasonable management procedures in place (e.g., prompt repair), the chance of data loss due to disk failure in this configuration is vanishingly small.
RECOMMENDATION 10 Continuous (i.e., not split periodically for backup) three-mirror volumes are recommended for data that is absolutely critical to enterprise operation.
RECOMMENDATION 11 The disks comprising a three-mirror volume should be connected to hosts using independent paths (cables, host bus adapters, connectors) to protect against path failure as well as disk failure.
RECOMMENDATION 12 The disk failure tolerance of three-mirror volumes is so high that four-mirror (or more) volumes should be used only if one or two of the mirrors are located remotely from the others for disaster recoverability (e.g., using optical Fibre Channel or other bus extension technology).
RECOMMENDATION 13 Mirroring should be combined with proactive failed disk discovery and replacement procedures that are automated to the greatest possible extent for maximum failure tolerance. Mean time to repair (MTTR), which includes volume content resynchronization time, is an extremely important contributor to data reliability.
RECOMMENDATION 14 If a mirror is regularly split from a three-mirror volume, any analysis similar to that shown in Figure 14.2 should take this into account. Susceptibility to failure is greater during the interval between splitting the third mirror and the completion of resynchronization after the third mirror storage is returned to the volume.
RECOMMENDATION 15 If possible, restoration of the storage comprising a split mirror to the original volume should be done during periods of low application I/O load, because resynchronization and regeneration are I/O-intensive activities that can adversely affect application performance.
394
C HAPTE R F O U RTE E N
Hybrid Volumes: RAID Controllers and Volume Managers Hybrid volumes, in which host-based striping and mirroring are combined with hardware subsystem-based RAID virtual disks, offer several advantages not available with either technology by itself: Capacity aggregation. Host-based spanned or striped volumes can aggregate the capacity of the largest virtual disks supported by one or more disk subsystems. In Chapter 3, “Figure 3.10 illustrates this usage. If an application requires a larger volume than the largest virtual disk that can be made available by a disk subsystem, using a host-based volume manager to stripe data across hardware-based failure-tolerant virtual disks provides an excellent combination of high capacity and performance with moderate-cost, lowimpact failure tolerance. Capacity partitioning. The opposite is also true. Host-based volume management can subdivide very large virtual disks into units of more convenient capacity. In Chapter 3, Figure 3.2 illustrates the use of host-based volume management to subdivide a very large virtual disk made available by a RAID controller into smaller volumes, which are more convenient units for application use. When this technique is adopted, the hardware-based virtual disk should be failure-tolerant, using either mirroring or RAID technology implemented within the RAID controller. Performance aggregation. Host-based volume management can be used to aggregate the performance of multiple hardware subsystems by striping data across two or more virtual disks, each emitted by a different RAID controller. Figure 3.10 illustrates this usage. Failure-tolerance enhancement. Host-based volume management can be used to increase failure tolerance by using virtual disks presented by hardware subsystems as the mirrors of a mirrored volume. Some performance benefit may also accrue from this due to load balancing across the virtual disks, as described in Chapter 3. Host-based volume management can increase the overall failure tolerance of hardware subsystem-based storage, as Figure 6.1 in Chapter 6, Windows Volume Managers,” illustrates. Hostbased mirroring of the virtual disks presented by two RAID controllers that are connected to separate buses and separate host bus adapters makes the system tolerant of bus and host bus adapter failures as well as disk failures. Disaster tolerance for data. With long-distance interconnects such as optical Fibre Channel, it is possible to achieve a level of disaster tolerance for data by locating the two mirrors of a volume at a distance of several kilome-
Windows Online Storage Recommendations
395
ters from each other. If both mirrors are hardware subsystem virtual disks, failure tolerance is also nominally very high. The combined use of host-based volume managers and RAID controller virtual disks has been presented in several contexts throughout the book, specifically in these figures: Figure 1.16 on page 21. Figure 1.17 on page 22. Figure 3.2 on page 32. Figure 3.10 on page 43. Figure 4.1 on page 54. Figure 5.8 on page 112. Figure 5.9 on page 113. The next subsections highlight administrative issues to consider when using such hybrid volumes.
Host-Based and Subsystem-Based RAID Today’s RAID controllers generally include some form of nonvolatile writeback cache that significantly reduces the RAID write penalty (described on page 71) without the risk of creating a “write hole,” or undetected mismatch between data and parity. This generally makes them a better choice for implementing RAID data protection than host-based volume managers, which must use careful writing or update logging techniques to block the possibility of write holes. Figure 3.10 shows this usage.
RECOMMENDATION 16 The prospective purchaser of a RAID controller with write-back cache should undertake a thorough study of the behavior of write-back cache. In particular, issues such as holdup time or worst-case flush time and failure tolerance of the cache itself should be thoroughly understood.
Host-Based and Subsystem-Based Mirrored Volumes For mirrored volumes, the choice between host-based volume manager and RAID controller implementation is much less clear. The intrinsic write penalty is much less significant for mirrored volumes than for RAID volumes. One
396
C HAPTE R F O U RTE E N
advantage of using RAID subsystem-based virtual disks presented by different controllers as members of host-based mirrored volumes is that the host-based mirrored volume provides protection against I/O bus, HBA, enclosure power and cooling, and RAID controller failure, as well as against disk failure. Comparing Figure 6.1, which depicts this principle, to Figure 6.2 makes it clear that external RAID subsystems are better suited to this purpose.
RECOMMENDATION 17 Host-based mirrored volumes whose mirrors are failure-tolerant virtual disks presented by RAID controllers can increase system failure tolerance if they are configured so that each virtual disk is on a separate path.
Using Host-Based Volume Managers to Manage Capacity From a capacity management standpoint, host-based volume managers can augment RAID controller capabilities either by aggregating or subdividing the virtual disks presented by RAID controllers. Some RAID subsystems have internal architectural limitations that prevent them from aggregating large amounts of storage into a single array and presenting very high-capacity virtual disks. Host-based volume managers are not generally limited in the largest single-volume capacity they support. A host-based volume manager can therefore generally be used to aggregate RAID controller virtual disks into striped or spanned volumes for presentation to file systems and applications. For RAID subsystems that do support very large virtual disks, a host-based volume manager can perform the opposite service—subdividing virtual disk capacity into conveniently sized volumes for application use. Using 40gigabyte disks, a RAID controller could conceivably present an 11-disk array as a 400-gigabyte volume. This is very efficient from a capacity overhead standpoint (only 10 percent overhead capacity), but may be too large for application convenience. For this situation, the Volume Manager can be used to create subdisks of convenient capacity from which volumes can be created for application use.
RECOMMENDATION 18 Use host-based volumes to aggregate RAID controller-based virtual disks. Use hostbased striping of virtual disks to increase capacity and performance; use host-based mirroring of virtual disks to increase overall system failure tolerance.
RECOMMENDATION 19 Use host-based mirroring and/or striping of disk controller-based RAID virtual disks rather than the reverse (e.g., host-based RAID of disk controller-based striped volumes).
Windows Online Storage Recommendations
397
Combining Host-Based Volumes and RAID Subsystems for Disaster Recoverability Using long-distance I/O bus technology such as Fibre Channel, RAID subsystems and host-based volume management can be combined to provide a level of disaster recoverability for data without the operational complexity of data replication.1 Standard Fibre Channel supports links up to 10 kilometers in length; some vendors’ implementations extend this even further. Figure 14.3 shows the combination of RAID subsystems and host-based volume management for this purpose. In the configuration shown in this figure, individual disk failures are handled locally (assuming that the RAID subsystem supports sparing and disk hot swapping). A disaster that incapacitates the entire recovery site (at the right of the figure) does not affect the primary processing site; one mirror of its host-based volumes remains available. Similarly, a disaster that incapacitates the entire primary processing site is recoverable as soon as a host computer can be attached to the (intact) data at the recovery site and to clients. A common extension of this technique is to run different applications at the two sites and use the technique depicted in Figure 14.3 at both sites, so that an application can be recovered after a disaster at either of the two sites.
RECOMMENDATION 20 Consider the use of long-distance (metropolitan) mirroring as a possible alternative to data replication for achieving disaster tolerance for data. 1 For a fuller discussion of data replication, contrasting it with mirrored volumes, see the white paper entitled Volume Replication and Oracle located at www.VERITAS.com/whitepapers.
Hos t Compu t er Disaster Recoverable Volume
Host-Based Volume Manager
Failure Tolerant Virtual Disk
RAID Controller
Mirrored or RAID Array
Up to 10 kilometers
Volume Manager mirrors data across local and remote failuretolerant virtual disks.
Failure Tolerant Virtual Disk
RAID Controller
Mirrored or RAID Array
Figure 14.3 Disaster recoverability using hardware and software volumes.
398
C HAPTE R F O U RTE E N
Unallocated Storage Capacity Policies Maintaining a percentage of unallocated storage capacity in a disk group is a useful means of managing online storage to avoid application failures. When an application requires more storage, its volumes can be extended quickly and easily by an administrator while it is online, using the unallocated capacity. If expanding a volume drops unallocated capacity below a safety threshold, the event is very visible. Additional storage can be installed and added to the disk group to maintain an adequate cushion for anticipated application requirements.
Determination of Unallocated Storage Capacity The amount of unallocated storage capacity to maintain in a given disk group is very specific to application characteristics. Some applications increase storage requirements rarely; others require a steady trickle of additional storage, while still others require large increments at less predictable intervals.
RECOMMENDATION 21 Administrators should understand application storage usage characteristics, so that ad hoc incremental storage requirements can be anticipated and met without disrupting application service.
Distribution of Unallocated Storage Applications may require additional storage for operational reasons; for example, business volume increases require more storage for order-processing applications. Applications may also be cyclic, for example, requiring additional storage for month-end or year-end processing. Such storage can often be released for other use at other times of the month or year. Finally, periodic addition of storage might be required for administrative reasons, for example, to make a third mirror copy of a database so that backup can be performed while the database is operational, as described in Chapter 4.
RECOMMENDATION 22 Whatever the need for additional storage, administrators must ensure that the amount of unallocated storage in each disk group is adequate. Not only must an appropriate level of unallocated storage be maintained, but the distribution of unallocated storage across disks must be such that management operations such as failure-tolerant volume expansion can be carried out without violating volume failure-tolerance and performance restrictions.
Windows Online Storage Recommendations
399
For example, if an additional mirror must be added to a mirrored-striped volume, each subdisk of the added mirror must be located either on the same disk as the subdisk it extends or on a separate disk from any of the volume’s existing subdisks. When an administrator makes a request to extend a volume, the Volume Manager checks the unallocated space in the disk group containing the volume to make sure that extension is possible without causing violations of this type. It is up to the administrator, however, to maintain a distribution of unallocated capacity that allows such operations to succeed. One way to maximize allocation flexibility is to manage the disks in a disk group in units of a single capacity or of a small number of discrete capacities. This maximizes the Volume Manager’s flexibility to allocate storage when new subdisks are required for new volumes, for volume extension, or for moving a subdisk from one disk to another. The system used in the figures for this book, for example, has seven 2-gigabyte disks and one 1-gigabyte disk. It might be useful to manage the capacity of this disk group in units of 250 megabytes, 500 megabytes, or 1 gigabyte.
RECOMMENDATION 23 The policy of managing disk group capacity in fixed-size quanta whose size is a submultiple of the smallest disk in the group should be seriously considered for the flexibility advantages it brings. The quantum size should be chosen based on systemwide application characteristics.
Amount of Unallocated Capacity Determining how much unallocated capacity to maintain depends strongly on application characteristics. In most cases, there are lower and upper bounds beyond which less or additional unallocated storage would be of little use. For example, an installation may observe a policy of maintaining a level of 8 to 10 percent of a disk group’s total capacity as unallocated space. But as the capacity of the disk group grows, the amount of unallocated space maintained by this policy might grow beyond any reasonable expectation of exploiting it effectively. If unallocated space is typically used in quantities of around 1 to 10 gigabytes to relocate subdisks, or to accommodate data processing peaks, then growth of the disk group to 1 terabyte of total capacity would mean 100 gigabytes reserved for this purpose. If the typical number of subdisk moves or volume adds is one or two, then a significant amount of storage capacity would never be used.
RECOMMENDATION 24 Any policy for maintaining a minimum percentage of a disk group’s capacity as unallocated space should include a cap to avoid maintaining wastefully large amounts of free space.
400
C HAPTE R F O U RTE E N
Spare Capacity and Disk Failures While storage capacity is managed in subdisk units, it is entire disks that fail. When a disk fails, all nonfailure-tolerant volumes having subdisks on it fail and all failure-tolerant volumes become degraded.
RECOMMENDATION 25 Since it is usually entire disks that fail, spare capacity reserved for recovering from disk failures should generally take the form of entire disks whose capacity is at least as large as that of the largest disk in a failure-tolerant volume in the disk group.
RECOMMENDATION 26 An administrator should reserve one or more spare disks for every 10 disks that are part of failure-tolerant volumes, with a minimum of one spare disk for any disk group that contains failure-tolerant volumes.
Interdependent sets of data are often found in commercial server environments. For example, the data in a database, its archive logs, and its redo log all depend on each other. If a volume holding database data fails, causing data loss, the ordinary administration practice would be to: ■■
Remedy the root cause of the failure (e.g., replace one or more disks).
■■
Restore the database to some baseline from a backup copy.
■■
Play the archive and redo logs against the restored copy to bring the database state as close to current as possible.
If the database logs reside on the same volume as the data, however, both data and logs will be inaccessible and database recovery will not be possible.
RECOMMENDATION 27 When laying out volumes on disks, an administrator should locate data objects that depend on each other on separate volumes that occupy separate disks, so that a single disk failure does not incapacitate both data and its recovery mechanism.
Disk Groups and Hardware RAID Subsystems As described earlier, the Volume Manager can aggregate virtual disks presented by RAID controllers to increase capacity, performance, or failure tolerance. In contrast, in smaller RAID subsystems with only one or two host bus connections, virtual disks impose a configuration constraint, because when the RAID subsystem is moved from one host to another, all of its virtual disks must move with it.
Windows Online Storage Recommendations
401
RECOMMENDATION 28 If all of a RAID subsystem’s host ports are likely to be reconfigured from one host to another, virtual disks presented by that RAID subsystem should be placed in the same disk group so that they can be moved as a unit. (This recommendation obviously does not apply to so-called enterprise RAID subsystems with many host ports, which are likely to be connected to different hosts.)
RECOMMENDATION 29 Unless the Volume Manager is being used to aggregate the capacity or performance of two or more RAID subsystems, volumes should be configured from subdisks within one RAID subsystem wherever possible. (When aggregating the capacity or performance of two or more RAID subsystems, it is necessary to include the virtual disks presented by all of the subsystems in the same disk group.)
Failed Disks, Spare Capacity, and Unrelocation When a disk holding subdisks for one or more failure-tolerant volumes fails, the failure-tolerant volumes are said to be reduced, or degraded. Three-mirror (or more) mirrored striped volumes remain failure-tolerant, while RAID volumes do not; and a two-mirror striped volume will also fail if the disk holding the companion of the failed subdisk fails. When a failure-tolerant volume becomes degraded, the system administrator’s priority should be to restore failure tolerance by substituting a replacement subdisk for the failed one. The first prerequisite is to be aware of disk failures.
RECOMMENDATION 30 Administrators should monitor Volume Manager event logs (Figure 10.3), as well as system event logs on a regular basis, looking for failed disk events. System management consoles can be used to generate active messages based on events reported in system event logs.
A major reason for maintaining free disk capacity is restoration of failure tolerance when a disk fails. An administrator creates a subdisk from unallocated storage capacity on a separate disk and substitutes it for the failed disk. This is one important reason for maintaining quanta of unallocated capacity in a disk group large enough to substitute for the largest subdisk in the disk group. A second administrator task arises when a failed disk has been replaced with a working replacement. In all probability, a system administrator will have configured volumes to spread disks across buses, to locate the disks in a disk group within a single cabinet, or according to some other performance, availability, or manageability criterion. When a disk fails and is replaced by a sub-
402
C HAPTE R F O U RTE E N
stitute, this optimization is often lost for the sake of preserving failure tolerance. Figure 14.4 illustrates a situation in which optimal physical configuration of a mirrored volume has been sacrificed to maintain failure tolerance. The two two-mirror volumes in Figure 14.4 have been configured with their disks on different buses connected to different HBAs. When a disk fails, degrading one of the volumes, the only replacement disk available is used to restore failure tolerance. The replacement disk is on the same bus as the volume’s surviving disk. The configuration is suboptimal from an availability standpoint because a bus or HBA failure will incapacitate the entire volume. It is also suboptimal from a performance standpoint, because data for all writes must move on the same bus. By using the spare disk as a replacement, the administrator has decided that restoring data failure tolerance is more important than optimal performance. When a functional replacement disk is available in the original (optimal) location for a volume, the administrator must decide whether to unrelocate any subdisks relocated after the failure, thereby restoring them to their original locations. Moving a subdisk to its original, optimal, location seems like an obvious choice; but it is an I/O-intensive operation. Every block on the replacement subdisk must be read and rewritten to the unrelocated one. This can place enough of a background I/O load on a system to adversely affect application processing.
RECOMMENDATION 31 While it is usually a good administrative practice to unrelocate subdisks to their original locations, the time at which to perform the unrelocation must be chosen carefully so as to not interfere with application performance during critical periods.
Original mirrored volumes
Failed disk
Host Computer Volume A Block 000
HBA
Block 51609600
Volume B Block 000
Volume Manager
Disk Driver
Block 51609600
Volume C
HBA
Block 000
Replacement disk
Block 51609600
Volume using replacement disk
Figure 14.4 Suboptimal volume configuration due to failed disk.
Windows Online Storage Recommendations
403
Using Disk Groups to Manage Storage The Volume Manager for Windows 2000 supports multiple disk groups.2 Disk groups are useful for managing storage in clusters and provide a convenient means for organizing and managing disk storage resources on an application basis.
Using Disk Groups to Manage Storage in Clusters In a Microsoft Cluster Server environment, the disk group is the unit in which storage fails over from one computer to another. Only entire disk groups fail over. Thus, volumes that hold data for applications that are required to fail over should belong to disk groups that only hold data for that application. The disk groups should be part of the application’s resource group, so that failover can occur. This has implications for disk group and volume allocation.
RECOMMENDATION 32 In a cluster, each application that fails over independently of other applications should have its data stored on volumes in disk groups exclusive to that application. This allows an application’s storage to fail over with it yet cause no adverse effects on other applications.
Using Disk Groups to Control Capacity Utilization The subdisks comprising any given volume must be allocated from disks within a single disk group. Thus, creating multiple disk groups effectively results in separate storage capacity pools. Raw physical storage in one of these pools is available exclusively for use within the pool and cannot be used in other disk groups unless a disk is specifically moved from one group to another by an administrator. This feature can be beneficial or detrimental, depending on an organization’s application needs: ■■
If a critical application requires frequent volume expansion, allocating its storage in a private disk group can help guarantee that capacity is available when required and that when storage capacity is added to the system, it is not absorbed by other applications.
■■
If a critical application unexpectedly requires additional storage, and none is available in the disk group from which its volumes are allocated, the
2
The Volume Manager for Windows NT will also support multiple disk groups. See Table 6.3.
404
C HAPTE R F O U RTE E N
application will fail, even if the required amount of storage is available in other disk groups.
RECOMMENDATION 33 System administrators must decide, based on projected application and administrative needs, whether to use disk groups to create disjoint storage pools or to manage all storage as a common pool. In general, multiple pools give the administrator greater flexibility, while a common pool may be more convenient for applications.
Data Striping and I/O Performance The Volume Manager enables administrators to “tune” any type of striped volume, including RAID and striped mirrored volumes, by adjusting the stripe unit size. The use of stripe unit size to affect I/O performance is described starting on page 43. In summary, most I/O-bound applications can fairly be characterized as either: ■■
I/O request-intensive, making I/O requests faster than the hardware to which they are made can satisfy them.
■■
Data transfer-intensive, moving large single streams of data between memory and storage.
With rare exceptions, transaction-oriented applications (e.g., credit verification, point of sale, order taking) are I/O request-intensive. Scientific, engineering, audio, video, and imaging applications are typically data transferintensive. If it is known at volume creation time that a striped volume will be used predominantly for one or the other of these I/O load types, stripe unit size can be set to optimize I/O performance.
Striping for I/O Request-Intensive Applications I/O request-intensive applications are typically characterized by small (i.e., 2 to 16 kilobytes) data transfers for each request. They are I/O-bound because they make so many I/O requests, not because they transfer much data. For example, an application that makes 1,000 I/O requests per second with an average request size of 2 kilobytes uses at most 2 megabytes per second of data transfer bandwidth. Since each I/O request occupies a disk completely for the duration of its execution, the way to maximize I/O throughput for I/O request-intensive applications is to maximize the number of disks that can be executing requests concurrently. Clearly, the largest number of concurrent I/O
Windows Online Storage Recommendations
405
requests that can be executed on a volume is the number of disks that contribute to the volume’s storage. Each application I/O request that “splits” across two stripe units occupies two disks for the duration of its execution, reducing the number of requests that can be executed concurrently. Therefore, it is desirable to minimize the probability that I/O requests “split” across stripe units in I/O request-intensive applications. Two factors influence whether an I/O request with a random starting address will split across two stripe units: ■■
The request starting address relative to the starting address of the storage allocation unit (the file extent)
■■
The size of the request relative to the stripe unit size
Figure 14.5 shows how I/O requests can split across a volume’s disks. The volume here is assumed to hold a data file for a database with a page size (the smallest unit in which the database management system issues I/O requests) of two blocks. In almost all cases, the database management system will allocate pages in alignment with the blocks in a file, so the first page occupies blocks 0 and 1, the second page occupies blocks 2 and 3, and so forth.3 Since the database management system always reads and writes pages or multiples of pages, its I/O single-page requests will never split across stripe units, because they will always be two-block requests addressed to even-numbered blocks. All possible requests of this kind can be satisfied by accessing data within a single stripe unit. Requests for two or more consecutive pages, however, may split across stripe units, as Figure 14.5 illustrates. The two-page (four-block) request addressed 3
As usual with the examples in this book the stripe units, page sizes and volume sizes are artificially small.
Read Volume Blocks 2-3 Write Volume Blocks 0-1 Two-block requests starting on even-numbered volume blocks do not split across stripe units. Write Volume Blocks 4-7 Read Volume Blocks 6-9 Four-block requests starting on even-numbered volume blocks may or may not split across stripe units.
Disk A' SubDisk A
Volume Manager
Disk B' SubDisk B
Volume Block 000
Volume Block 008
Volume Block 001
Volume Block 009
Volume Block 002
Volume Block 010
Volume Block 003
Volume Block 011
Volume Block 004
Volume Block 012
Volume Block 005
Volume Block 013
Volume Block 006
Volume Block 014
Volume Block 007
Volume Block 015
etc.
etc.
Figure 14.5 Application I/O requests split across stripe units.
Stripe unit
406
C HAPTE R F O U RTE E N
to Volume Block 4 is satisfied entirely by data within SubDisk A, while the request addressed to Volume Block 6 requires data from both SubDisk A (Volume Blocks 6 and 7) and SubDisk B (Volume Blocks 8 and 9). Of the four possible two-page requests that can be made to a stripe unit (starting at blocks 0, 2, 4, and 6 of the stripe unit), only the one addressed to Volume Block 6, requires the Volume Manager to address I/O operations to two disks. Assuming a uniform distribution of I/O request starting page addresses, an average of one request in four (25 percent) would be split in this scenario. If the stripe unit were twice as large, an average of one request in eight would split (12.5 percent). If the stripe unit were four times as large (32 blocks), one request in 16, or 6.25 percent would split. It seems, therefore, as though larger stripe unit sizes reduce the probability of split I/O requests. While this is true, the primary objective of striping data across a volume is to spread I/O requests across the volume’s disks. Too large a stripe unit size is likely to reduce this spreading effect.
RECOMMENDATION 34 A good compromise stripe unit size for I/O request-intensive applications is one that results in about a 3 to 5 percent probability of splitting in a uniform distribution of requests.
Using more realistic values, a 2-kilobyte (four-block) database page size would tend to indicate an ideal stripe unit size of 100 blocks.4 Moreover, this would typically be rounded up to the nearest power of 2 (128 blocks, or 65,536 bytes) for administrative simplicity.
Striping for Data Transfer-Intensive Applications Data transfer-intensive applications typically request a large amount of data with every request—between 64 kilobytes and a megabyte or more. When a large amount of data is requested, the data transfer phase of the request represents the majority of the request execution time. Thus, improving I/O performance is essentially tantamount to reducing data transfer time. A single disk can only transfer data as fast as the data passes under the disk’s read-write head. Thus, a disk that rotates at 10,000 RPM and has 200 blocks on a certain track cannot transfer data to or from that track any faster than about 17.06 megabytes per second (200 blocks × 512 bytes/block/0.006 seconds/revolution). An application request for 500 kilobytes would require five platter 4
If an average of three of every 100 uniformly distributed four-block I/O requests are to split. The is based on the same simplifying assumption of uniformity distributed I/O request starting block numbers used in the example starts on page 48. The result overstates the actual percentage of split I/O requests for database applications, which tend to make full page I/O requests.
Windows Online Storage Recommendations
407
revolutions, or 30 milliseconds, to execute—ignoring initial access time and settling time each time the disk switches between read/write heads over different surfaces. If the request were addressed to a volume of five identical disks, however, each disk would ideally deliver one-fifth of the data, and the request would complete in a correspondingly shorter time. Thus, if a striped volume is optimized for data transfer-intensive applications, each application I/O request will split evenly across all of the volume’s disks (or all but the disk containing check data in the case of a RAID volume). Thus if an application makes requests for 256 kilobytes, an ideal stripe size for a four-disk striped volume would be 64 kilobytes. If the application request size is 512 kilobytes, then a 128-kilobyte stripe unit size (or a “wider” volume with more columns) would be indicated.
RECOMMENDATION 35 The ideal stripe unit size for data transfer-intensive applications that use a striped volume is the typical I/O request size of the application divided by the number of data disks in the stripe.
Rules of Thumb for Data Striping Both of the foregoing examples of optimizing a volume’s stripe unit size for particular application I/O patterns are based on a consistent application I/O request size. In practice, not all applications are so consistent. Administrators are advised to attempt to optimize volume stripe unit size for a particular application only if: ■■
The application’s I/O request pattern is known to consist of requests of a certain size.
■■
The application is the only client of the volume during periods for which I/O performance is to be optimized.
As demonstrated by the examples, the Volume Manager default stripe unit size is applicable to both typical I/O request-intensive applications and typical data transfer-intensive applications. Administrators should have a good theoretical or empirical basis for changing this value, because optimizing stripe unit size for one type of I/O load makes performance for other types of I/O suboptimal. Unless almost all I/O is of the type for which stripe unit size is optimized, there can be adverse application performance consequences.
Staggered Starts for Striped Volumes File systems and database management manage the space within the volumes they control by subdividing it into volume block ranges in which they store either metadata, user data, or other administrative information. This subdivi-
408
C HAPTE R F O U RTE E N
sion tends to have a side effect of causing some volume block ranges to be accessed more frequently than others. For example, if a file system uses an internal log to record metadata transactions, the blocks that comprise the log will be accessed every time the file system updates any metadata, whereas only the metadata blocks actually updated will be accessed for each transaction. Similarly, a database management system will access the volume blocks in which an index is rooted or a database transaction log much more frequently than the leaf blocks or data pages of the database. Since file systems and database management systems use the same algorithms for subdividing the space in each volume they manage, the same block ranges in each volume they manage tend to have similar access patterns. Allocating several striped volumes (with or without failure tolerance) on the same disks can negatively impact performance. Suppose, for example, that a database manager always allocates index root blocks at the lowest possible volume block addresses. If a database is allocated on each of a set of volumes striped across one set of disks, the lowest volume block addresses for all volumes will fall into subdisks on the same physical disk. Thus, all of the database management system instances will tend to access the same disk repeatedly, creating a “hot spot”—a disk that is saturated with I/O activity to the point that it becomes an upper bound on application performance, even though unused I/O capacity is available on other disks. Fortunately, this condition can be averted by using a little foresight when allocating volumes. If each volume striped across a set of disks is allocated with a different “first” disk (i.e., with its first subdisk allocated on a different disk), then each database or file system’s most heavily accessed blocks will reside on a different disk and I/O load will tend to be balanced across physical resources.
RECOMMENDATION 36 When multiple striped (failure-tolerant or not) volumes share a set of disks, it is good administrative practice to allocate the volumes in such a way that each volume’s block range starts on a different disk.
Figure 14.6 shows two striped volumes sharing the capacity of three disks. Volume V’s block address space starts (i.e., Plex Block 0) on Disk A', while Volume W’s block address space starts on Disk B'. This staggered start approach distributes metadata for each volume across different subsets of the contributing disks, which both improves system resiliency and balances the I/O load due to metadata accesses.
Windows Online Storage Recommendations
409
Striped Volume Width and Performance Windows Volume Managers enable the creation of volumes with data on up to 32 disks. Prudent system management practice often suggests that volumes of lesser width be used in practice, however. For example, if a striped volume with 32 columns is allocated on disks whose MTBF is 500,000 hours, a disk failure, hence a volume failure, would be expected roughly every two years. More serious than the expected failure frequency, however, is the consequence of failure. If the striped volume were built from 20-gigabyte disks, over 640 gigabytes of data would be made inaccessible because of a single disk failure. Thus, striped volumes with significantly fewer than 32 columns are suggested. The reasons for data striping are to improve data transfer performance or to improve I/O request performance. Data transfer performance improves when multiple disks transfer data in parallel to satisfy a single application request. But spreading the execution of a request across too many disks can result in inefficient disk use. For example, an application request for 256 kilobytes spread across 32 disks would mean that each disk would contribute only 8 kilobytes of data, or 16 blocks, to the request’s execution. On a typical modern disk, this is only about 5 percent of the data on a track. The aggregate rotational latency of N nonsynchronized disks that are accessed at the same time is N/(N + 1) times the revolution time. The 32-disk volume will therefore have a rotational latency of almost a full disk revolution added to a per-disk transfer time of less than 5 percent of a revolution. About 75 percent of this perfor-
Disk A'
Progression of Plex (volume) block addresses
SubDisk A Striped Volume V
Striped Volume W
Disk B' SubDisk B
Disk C' SubDisk C
SubDisk A Block 000–Plex Block 000
SubDisk B Block 000–Plex Block 004
SubDisk C Block 000–Plex Block 008
SubDisk A Block 001–Plex Block 001
SubDisk B Block 001–Plex Block 005
SubDisk C Block 001–Plex Block 009
SubDisk A Block 002–Plex Block 002
SubDisk B Block 002–Plex Block 006
SubDisk C Block 002–Plex Block 010
SubDisk A Block 003–Plex Block 003
SubDisk B Block 003–Plex Block 007
SubDisk C Block 003–Plex Block 011
etc.
etc.
etc.
SubDisk D Block 000–Plex Block 008
SubDisk E Block 000–Plex Block 000
SubDisk C Block 000–Plex Block 004
SubDisk D Block 001–Plex Block 009
SubDisk E Block 001–Plex Block 001
SubDisk C Block 001–Plex Block 005
SubDisk D Block 002–Plex Block 010
SubDisk E Block 002–Plex Block 002
SubDisk C Block 002–Plex Block 006
SubDisk D Block 003–Plex Block 011
SubDisk E Block 003–Plex Block 003
SubD isk C Block 003–Plex Block 007
Figure 14.6 Staggered allocation of striped volumes.
410
C HAPTE R F O U RTE E N
mance can be obtained from four disks (aggregate rotational latency of 80 percent of a revolution added to 50 percent of a revolution for data transfer). Almost 90 percent can be obtained from eight disks (aggregate rotational latency of 89 percent of a revolution added to 25 percent of a revolution for data transfer).
RECOMMENDATION 37 The lower data loss exposure and greater tractability of smaller volumes may be a good trade-off for the slightly diminished data transfer performance of “narrower” striped volumes.
The advantage of more columns in striped volumes is greater for I/O requestintensive I/O loads. Assuming that frequently accessed files are large enough to stripe across 32 disks with stripe unit size that minimizes the probability of “split” application I/O requests, more disks mean that more concurrent requests can be serviced, leading to shorter I/O queuing times and, therefore, better application performance.
APPENDIX 1
Disk and Volume States
Disk and Volume Status Descriptions The two tables in this appendix list and define the disk and volume status categories that appear in the Status column when a disk or volume object is identified in the right panel of the Volume Manager or Logical Disk Manager console window. In Table A1.2, the parenthetical phrase “(At Risk)” following some of the status categories indicates that the Volume Manager encountered some I/O errors on the volume, but did not compromise the volume data. “At Risk” can be interpreted to mean that volume integrity is deteriorating. An administrator can clear the At Risk state on the volume by using the Reactivate Disk or Reactive Volume command.
411
412
APPENDIX ONE
Table A1.1 Volume Manager Disk Status Categories and Descriptions STATUS
DESCRIPTION
Online
The disk is accessible and has no known problems. This is the normal disk status. No user action is required. Both dynamic disks and basic disks may display the Online status.
Online (Errors)
This status indicates that the disk is in an error state or that I/O errors have been detected on a region of the disk. All the volumes on the disk display Failed or Failed Redundancy status, meaning that it may not be possible to create new volumes on the disk. Only dynamic disks display this status. To restore the disk to Online status and bring volumes having subdisks allocated on it to Healthy status, right-click on the failed disk and select Reactivate Disk.
Offline
The disk is not accessible. The disk may be corrupted or intermittently unavailable. An error icon appears on the offline disk. Only dynamic disks display Offline status. If a disk’s status is Offline, and the disk name changes to Missing, it means the disk was recently available on the system but can no longer be located or identified by the Volume Manager. The missing disk may be corrupted, powered down, or disconnected.
Unreadable
The disk is not accessible. The disk may have experienced hardware failure, corruption, or I/O errors. The disk’s copy of the Volume Manager’s disk configuration database may be corrupted. An error icon appears on the unreadable disk. Both dynamic and basic disks may display Unreadable status. Disks may display Unreadable status while they are spinning up or when the Volume Manager is rescanning all the disks on a system. In some cases, an unreadable disk has failed and is not recoverable. For dynamic disks, Unreadable status usually results from corruption or I/O errors on part of the disk, rather than failure of the entire disk. Unreadable disks can be rescanned (using the Rescan Disks command); in some cases, a computer reboot may change an unreadable disk’s status.
Unrecognized The disk has an original equipment manufacturer’s (OEM) signature and the Volume Manager will not use the disk. Disks formatted for use on UNIX systems display Unrecognized status. Only Unknown disk types display Unrecognized status. No Media
No media is inserted in a CD-ROM or removable drive. This status changes to Online when media is loaded into the CD-ROM or removable drive. Only CDROM or Removable disk types display No Media status.
Foreign Disk The disk has been moved from another Microsoft Windows NT computer and has not been set up for use. Only dynamic disks display this status. To add the disk so that it can be used, right-click on the disk and select Import Foreign Disk. All volumes on the disk become visible and accessible. Because volumes can span more than one disk (e.g., a mirrored volume), all disks containing storage for a volume must remain together. If some of a volume’s disks are not moved, the volume displays Failed Redundancy or Failed status. No Disk Signature
Displays for new disks. Disks with no signature cannot be used. Rightclick on the disk and select Write Signature from the menu. When the signature is written, disk status changes to Basic Disk, after which the disk can be accessed or upgraded.
Disk and Volume States
413
Table A1.2 Volume Manager Volume Status Categories and Descriptions STATUS
DESCRIPTION
Healthy
The volume is accessible and has no known problems. This is the normal volume status. No user action is required. Both dynamic volumes and basic volumes display Healthy status.
Healthy (At Risk)
The volume is currently accessible, but I/O errors have been detected on an underlying disk. If an I/O error is detected on any part of a disk, all volumes on the disk display Healthy (At Risk) status. A warning icon appears on the volume. Only dynamic volumes display Healthy (At Risk) status. When a volume’s status is Healthy (At Risk), it usually means an underlying disk’s status is Online (Errors). To return the underlying disk to Online status, reactivate the disk (using the Reactivate Disk command). Once a disk returns to Online status, volumes on it should display Healthy status.
Initializing
The volume is being initialized. Dynamic volumes display the Initializing status. No user action is required. When initialization is complete, a volume’s status changes to Healthy. Initialization normally completes very quickly.
Resynching
The volume’s mirrors are being resynchronized so that all contain identical data. Both dynamic and basic mirrored volumes can display Resynching status. No user action is required. When resynchronization completes, the mirrored volume’s status changes to Healthy. Resynchronization may take some time, depending on the size of the mirrored volume. Although a mirrored volume can be accessed by applications while resynchronization is in progress, configuration changes (such as breaking a mirror) should be avoided during resynchronization.
Regenerating
Data and parity are being regenerated for a RAID volume. Both dynamic and basic RAID volumes may display Regenerating status. No user action is required. When regeneration is complete, RAID volume status changes to Healthy. RAID volumes can be accessed by applications while regeneration is in progress.
Failed Redundancy
The data on the volume is no longer failure-tolerant because one or more underlying disks is not online. A warning icon appears on volumes with Failed Redundancy. Failed Redundancy status applies only to mirrored and RAID volumes. Both dynamic and basic volumes may display Failed Redundancy status. A volume with Failed Redundancy status can be accessed by applications, but if another disk on which the volume has storage fails, the volume and its data may be lost. To minimize the chance of such a loss, the volume should be repaired as soon as possible. Failed Redundancy status also displays for moved disks that contain spanned volume subdisks if not all of the volume’s subdisks are accessible. To avoid this situation, the entire disk set comprising a volume must be moved as a unit.
continues
414
APPENDIX ONE
Table A1.2 (Continued) STATUS
DESCRIPTION
Failed Redundancy (At Risk)
Data on the volume is no longer failure-tolerant and I/O errors have been detected on an underlying disk. If an I/O error is detected on any part of a disk, all volumes on the disk display the Failed Redundancy (At Risk) status. A warning icon appears on the volume. Only dynamic mirrored or RAID volumes display Failed Redundancy (At Risk) status. When a volume’s status is Failed Redundancy (At Risk), the status of an underlying disk is usually Online (Errors). To return the disk to Online status, reactivate the disk (using the Reactivate Disk command). Once a disk is returned to the Online status, volume status changes to Failed Redundancy.
Failed
The volume cannot be started automatically. An error icon appears on the failed volume. Both dynamic and basic volumes display Failed status.
Formatting
The volume is being formatted using the specifications chosen by the administrator.
APPENDIX 2
Recommendations at a Glance
summary of recommendations for managing online volumes in Windows operating system environments.
A
RECOMMENDATION 1 JBOD should not be used to store data that is vital to enterprise operation and that would be expensive or difficult to replace or reproduce.
RECOMMENDATION 2 Striped volumes should not be used to store data that is vital to enterprise operation and that would be expensive or difficult to replace or reproduce.
RECOMMENDATION 3 Striped volumes should be considered for storing large volumes of low-value or easily reproduced data to which high-performance access is required.
RECOMMENDATION 4 For nonfailure-tolerant storage, spanned or striped volumes are preferred over individually managed disks whenever the amount of storage required significantly exceeds the capacity of a single disk because they represent fewer storage objects to manage.
RECOMMENDATION 5 Striped volumes are preferred over spanned volumes because they generally improve I/O performance with no offsetting penalty cost. 415
416
AP P E N D IX T WO
RECOMMENDATION 6 Host-based RAID volumes should be avoided in applications in which there is a high rate of updates (over about 10 percent of the aggregate I/O request-handling capacity of the disks comprising the volume). Host-based RAID volumes are recommended only for “read-mostly” data.
RECOMMENDATION 7 Disk controller RAID volumes equipped with nonvolatile write-back cache may be used for more write-intensive applications (up to about 40 percent of the aggregate I/O request capacity of the disks comprising the volume).
RECOMMENDATION 8 Mirrored volumes should be used to store all business-critical data and metadata, including dirty region logs for mirrored volumes that use them, RAID volume update logs, database redo logs, and so on.
RECOMMENDATION 9 RAID volumes with more than 10 disks should not be configured except for the storage of easily replaceable “online archives”—data that is online for round-the-clock business convenience but that is never modified, or at most, modified infrequently and easily reproducible.
RECOMMENDATION 10 Continuous (i.e., not split periodically for backup) three-mirror volumes are recommended for data that is absolutely critical to enterprise operation.
RECOMMENDATION 11 The disks comprising a three-mirror volume should be connected to hosts using independent paths (cables, host bus adapters, connectors) to protect against path failure as well as disk failure.
RECOMMENDATION 12 The disk failure tolerance of three-mirror volumes is so high that four-mirror (or more) volumes should be used only if one or two of the mirrors are located remotely from the others for disaster recoverability (e.g., using optical Fibre Channel or other bus extension technology).
RECOMMENDATION 13 Mirroring should be combined with proactive failed disk discovery and replacement procedures that are automated to the greatest possible extent for maximum failure tolerance. Mean time to repair (MTTR), which includes volume content resynchronization time, is an extremely important contributor to data reliability.
Recommendations at a Glance
417
RECOMMENDATION 14 If a mirror is regularly split from a three-mirror volume, any analysis similar to that shown in Figure 14.2 should take this into account. Susceptibility to failure is greater during the interval between splitting the third mirror and the completion of resynchronization after the third mirror storage is returned to the volume.
RECOMMENDATION 15 If possible, restoration of the storage comprising a split mirror to the original volume should be done during periods of low application I/O load, because resynchronization and regeneration are I/O-intensive activities that can adversely affect application performance.
RECOMMENDATION 16 The prospective purchaser of a RAID controller with write-back cache should undertake a thorough study of the behavior of write-back cache. In particular, issues such as holdup time or worst-case flush time and failure tolerance of the cache itself should be thoroughly understood.
RECOMMENDATION 17 Host-based mirrored volumes whose mirrors are failure-tolerant virtual disks presented by RAID controllers can increase system failure tolerance if they are configured so that each virtual disk is on a separate path.
RECOMMENDATION 18 Use host-based volumes to aggregate RAID controller-based virtual disks. Use hostbased striping of virtual disks to increase capacity and performance; use host-based mirroring of virtual disks to increase overall system failure tolerance.
RECOMMENDATION 19 Use host-based mirroring and/or striping of disk controller-based RAID virtual disks rather than the reverse (e.g., host-based RAID of disk controller-based striped volumes).
RECOMMENDATION 20 Consider the use of long-distance (metropolitan) mirroring as a possible alternative to data replication for achieving disaster tolerance for data.
RECOMMENDATION 21 Administrators should understand application storage usage characteristics, so that ad hoc incremental storage requirements can be anticipated and met without disrupting application service.
418
AP P E N D IX T WO
RECOMMENDATION 22 Whatever the need for additional storage, administrators must ensure that the amount of unallocated storage in each disk group is adequate. Not only must an appropriate level of unallocated storage be maintained, but the distribution of unallocated storage across disks must be such that management operations such as failure-tolerant volume expansion can be carried out without violating volume failure-tolerance and performance restrictions.
RECOMMENDATION 23 The policy of managing disk group capacity in fixed-size quanta whose size is a submultiple of the smallest disk in the group should be seriously considered for the flexibility advantages it brings. The quantum size should be chosen based on systemwide application characteristics.
RECOMMENDATION 24 Any policy for maintaining a minimum percentage of a disk group’s capacity as unallocated space should include a cap to avoid maintaining wastefully large amounts of free space.
RECOMMENDATION 25 Since it is usually entire disks that fail, spare capacity reserved for recovering from disk failures should generally take the form of entire disks whose capacity is at least as large as that of the largest disk in a failure-tolerant volume in the disk group.
RECOMMENDATION 26 An administrator should reserve one or more spare disks for every 10 disks that are part of failure-tolerant volumes, with a minimum of one spare disk for any disk group that contains failure-tolerant volumes.
RECOMMENDATION 27 When laying out volumes on disks, an administrator should locate data objects that depend on each other on separate volumes that occupy separate disks, so that a single disk failure does not incapacitate both data and its recovery mechanism.
RECOMMENDATION 28 If all of a RAID subsystem’s host ports are likely to be reconfigured from one host to another, virtual disks presented by that RAID subsystem should be placed in the same disk group so that they can be moved as a unit. (This recommendation obviously does not apply to so-called enterprise RAID subsystems with many host ports, which are likely to be connected to different hosts.)
Recommendations at a Glance
419
RECOMMENDATION 29 Unless the Volume Manager is being used to aggregate the capacity or performance of two or more RAID subsystems, volumes should be configured from subdisks within one RAID subsystem wherever possible. (When aggregating the capacity or performance of two or more RAID subsystems, it is necessary to include the virtual disks presented by all of the subsystems in the same disk group.)
RECOMMENDATION 30 Administrators should monitor Volume Manager event logs (Figure 10.3), as well as system event logs on a regular basis, looking for failed disk events. System management consoles can be used to generate active messages based on events reported in system event logs.
RECOMMENDATION 31 While it is usually a good administrative practice to unrelocate subdisks to their original locations, the time at which to perform the unrelocation must be chosen carefully so as to not interfere with application performance during critical periods.
RECOMMENDATION 32 In a cluster, each application that fails over independently of other applications should have its data stored on volumes in disk groups exclusive to that application. This allows an application’s storage to fail over with it yet cause no adverse effects on other applications.
RECOMMENDATION 33 System administrators must decide, based on projected application and administrative needs, whether to use disk groups to create disjoint storage pools or to manage all storage as a common pool. In general, multiple pools give the administrator greater flexibility, while a common pool may be more convenient for applications.
RECOMMENDATION 34 A good compromise stripe unit size for I/O request-intensive applications is one that results in about a 3 to 5 percent probability of splitting in a uniform distribution of requests.
RECOMMENDATION 35 The ideal stripe unit size for data transfer-intensive applications that use a striped volume is the typical I/O request size of the application divided by the number of data disks in the stripe.
420
AP P E N D IX T WO
RECOMMENDATION 36 When multiple striped (failure-tolerant or not) volumes share a set of disks, it is good administrative practice to allocate the volumes in such a way that each volume’s block range starts on a different disk.
RECOMMENDATION 37 The lower data loss exposure and greater tractability of smaller volumes may be a good trade-off for the slightly diminished data transfer performance of “narrower” striped volumes.
Glossary of Storage Terminology
This glossary contains definitions of storage-related terms found in this book and in other storage-related documentation. access path The combination of host bus adapter, host I/O bus, bus address and logical unit number used by hosts to communicate with a physical or virtual disk or other storage device. In some configurations, there are multiple access paths for communicating with I/O devices. cf. multipath I/O active-active (controllers) Synonym for dual active controllers. active component A hardware component that requires electrical power to operate. Storage subsystem active components include power supplies, disks, fans, and controllers. By contrast, enclosure housings are not usually active components. active-passive (components) Synonym for hot standby. adapter Synonym for I/O adapter A hardware device that converts between the timing and protocol of two I/O buses. address (1) Any means for uniquely identifying a block of data stored on recording media. Disk block addresses typically decompose into a cylinder, head, and relative sector number at which data may be found. (2) A number that uniquely identifies a location (bit, byte, word, etc.) in a computer memory. addressing An algorithm by which areas of disk media or computer system main memory in which data is stored are uniquely identified. cf. block addressing, C-H-S addressing, explicit addressing, implicit addressing aggregation (1) Synonym for consolidation; the combination of multiple disk data streams into single aggregated volume data stream. (2) The combination of two or more I/O requests for adjacently located data into a single request to minimize processing overhead and I/O latency. American National Standards Institute (ANSI) A standards organization whose working committees are responsible for many of the computer system421
422
Glossar y of Storage Terminology
related standards in the United States. The ANSI working committees most closely associated with I/O interests are X3T10 and X3T11, responsible for SCSI and Fibre Channel standards, respectively. ANSI American National Standards Institute. array A disk array. application I/O request An I/O request made by a volume manager’s or disk subsystem’s clients, as distinguished from I/O operations requested by the volume manager or disk subsystem control software. Includes both operating system I/O requests (e.g., paging, swapping and file system metadata operations, etc.) and those made by user applications. application read request Synonym for application I/O request. application write request Synonym for application I/O request. asynchronous I/O requests (1) I/O requests that are unrelated to each other and whose execution may therefore overlap in time. (2) I/O requests that do not block execution of the applications making them. Applications must ascertain that asynchronous read requests have executed to completion before using data from them and that asynchronous write requests have executed to completion before taking other actions predicated on data having been written. asynchronous operations Operations that bear no time relationship to each other. Asynchronous operations may overlap in time. atomic operation An indivisible operation that occurs either in its entirety or not at all. Writes to mirrored volumes and file system metadata updates must be atomic, even if a system failure occurs while they are occurring. auto-swap Abbreviation for automatic swap. cf. cold swap, hot swap, warm swap automatic failover Synonym for failover (q.v). automatic swap The functional substitution of a replacement unit (RU) for a defective one, performed by the system itself while it continues to function normally. Automatic swaps are functional rather than physical substitutions and do not require human intervention. cf. cold swap, hot swap, warm swap
B backing store Any nonvolatile memory. Often used in connection with cache, which is a (usually) volatile random access memory used to speed up I/O operations. Data held in a volatile cache must be replicated in nonvolatile backing store to survive a system crash or power failure. Berkeley RAID Levels A family of disk array data protection and mapping techniques described in papers by researchers at the University of California at Berkeley. There are six Berkeley RAID levels, usually referred to by the names RAID Level 1 through RAID Level 6. block Short for disk block. block addressing An algorithm for referring to data stored on disk or tape media in which fixed-length blocks of data are identified (either explicitly or implicitly) by unique integers in a dense space. cf. addressing, C-H-S addressing booting Synonym for bootstrapping. bootstrapping The loading of operating system code from a disk or other storage device into a computer’s memory. Bootstrapping typically occurs in stages, starting
Glossar y of Storage Terminology
423
with a very simple program to read a sequence of blocks from a fixed location on a predetermined disk into a fixed memory location. The data read is the code for the next stage of bootstrapping; it typically causes an operating system to be read into memory, which then begins to execute. bridge controller A disk controller that forms a bridge between two external I/O buses. Bridge controllers connect single-ended SCSI disks to differential SCSI or Fibre Channel host I/O buses, for example. buffer Memory used to hold data momentarily as it moves along an I/O path. Buffers allow devices with different native data transfer speeds to intercommunicate. cf. cache
C cache Any memory used to reduce the time required to respond to an application read or write request. Read cache holds data in anticipation that it will be requested by a client. Write cache holds data written by a client until it can be safely stored on nonvolatile disk storage media. cf. buffer, disk cache, write-back cache channel (1) The electrical circuits that sense or cause the state changes in recording media and convert between those state changes and electrical signals that can be interpreted as data bits. (2) A synonym for I/O bus. The term channel has other meanings in other branches of computer technology. The definitions given here are commonly used when discussing I/O devices and subsystems. cf. device channel, device I/O bus, I/O bus, host I/O bus check data In a failure-tolerant volume or disk array, stored data that can be used to regenerate user data that becomes unreadable. C-H-S addressing Abbreviation for cylinder-head-sector addressing. cluster A collection of interconnected computers with links to common clients and common storage. These computers run cluster management software that coordinates the computers’ activities, including data access and application failover. cold swap The substitution of a replacement unit (RU) in a system for a defective one, where the system must be powered down to perform the substitution. A cold swap is a physical as well as a functional substitution. cf. automatic swap, hot swap, warm swap concurrency The property of overlapping in time. Usually refers to the execution of I/O operations or I/O requests. configuration (1) The physical installation or removal of disks, cables, HBAs, and other system components. (2) Assignment of the operating parameters of a system or subsystem. Disk array configuration, for example, includes designation of member disks or subdisks, as well as parameters such as stripe unit size, RAID type, and so on. cf. physical configuration. consolidation The accumulation of data for a number of sequential write requests in a cache, so that a smaller number of larger write operations can be used for more efficient device utilization. control software A software program that runs in a disk controller and manages one or more disk arrays. Control software presents each disk array to its operating environment as one or more virtual disks. Control software is functionally equivalent to host-based volume management software.
424
Glossar y of Storage Terminology
controller (1) The control logic in a disk drive that performs command decoding and execution, host data transfer, serialization and deserialization of data, error detection and correction, and overall device management. (2) The storage subsystem hardware and software that performs command transformation and routing, aggregation, error recovery, and performance optimization for multiple disks. controller-based array Synonym for controller-based disk array. controller-based disk array A disk array. A disk array’s control software executes in a disk subsystem controller. The disks comprising a controller-based array are necessarily part of the disk subsystem that includes the controller. cf. volume controller cache A cache that resides within a disk controller and whose primary purpose is to improve disk or array I/O performance. cf. cache, disk cache, host cache copyback The copying of disk or subdisk contents to one or more replacement disks. Copyback is used to create or restore a particular physical configuration for an array or volume (e.g., a particular arrangement of disks on device I/O buses). CRU Abbreviation for customer replaceable unit. customer replaceable unit A component of a system that is designed to be replaced by “customers,” that is, individuals who may not be trained as computer system service personnel. cf. field-replaceable unit, replaceable unit cylinder-head-sector (C-H-S) addressing A form of addressing data stored on a disk in which the cylinder, head/platter combination, and relative sector number on a track are specified. cf. block addressing
D data availability The expected continuous span of time over which applications can access correct data stored by a population of identical disk subsystems in a timely manner. Expressed as mean time to (loss of ) data availability (MTDA). data manager A computer program whose primary purpose is to present applications with a convenient view of data and to map that view to the view presented by disks or volumes. File systems and database management systems are both data managers. data reliability The expected continuous span of time over which data stored by a population of identical disk subsystems can be correctly retrieved. Expressed as mean time to data loss (MTDL). data transfer capacity The maximum amount of data per unit time that can be moved across an I/O bus. For disk subsystem I/O, data transfer capacity is usually expressed in megabytes per second (millions of bytes per second, where 1 million = 106). cf. throughput data transfer-intensive (application) An application characterization. A data transfer-intensive application is I/O-intensive and makes large, usually sequential, I/O requests. data transfer rate The amount of data per unit time moved across an I/O bus in the course of executing an I/O load. The data transfer capacity of an I/O subsystem is an upper bound on its data transfer rate. For disk subsystem I/O, data transfer rate is usually expressed in megabytes per second (millions of bytes per second, where 1 million = 106). cf. data transfer capacity, throughput
Glossar y of Storage Terminology
425
degraded mode Synonym for reduced mode (q.v), which is the preferred term. device A storage device. device bus Synonym for device I/O bus, which is the preferred term. device channel A channel used to connect storage devices to a host I/O bus adapter or disk controller. Device I/O bus is the preferred term. device fanout The ability of a host bus adapter or disk controller to connect to multiple storage devices using a single host I/O bus address. Device fanout increases the amount of storage to which a computer system can be connected. device I/O bus An I/O bus used to connect storage devices to a host bus adapter or disk controller. directory A persistent data structure in a file system that contains information about other files. Directories are usually organized hierarchically; that is, a directory may contain information about files and other directories. Directories are used to organize collections of files for application or user convenience. disk A nonvolatile, randomly addressable, rewriteable physical data storage device. This definition includes both rotating magnetic and optical disks and solidstate disks, or nonvolatile electronic storage elements. It excludes devices such as write-once-read-many (WORM) optical disks and software-based RAM disks that emulate disks using dedicated host computer random access memory for storage. disk array A collection of disks in a disk subsystem, managed by the subsystem controller’s control software. The control software presents the disks’ storage capacity to hosts as one or more virtual disks. cf. volume disk array subsystem A disk subsystem with the capability to organize its disks into disk arrays. disk block The unit of data storage and retrieval in a fixed-block architecture disk. Disk blocks are of fixed size (with the most common being 512 bytes) and are numbered consecutively. The disk block is also the unit of data protection from media errors, for example by error correction code (ECC). cf. sector disk cache (1) A cache that resides within a disk. (2) A cache in a controller or host computer whose primary purpose is to improve disk, volume, or disk array I/O performance. cf. cache, controller cache, host cache disk striping A data mapping technique in which constant-length sequences of virtual disk or volume data addresses are mapped to sequences of disk addresses in a regular rotating pattern. Disk striping is sometimes called RAID Level 0 because it is similar to other RAID data mapping techniques. Disk striping includes no data protection, however. disk subsystem A storage subsystem that supports only disk devices. double buffering An application and data manager technique used to maximize data transfer rate by keeping two concurrent I/O requests for adjacently addressed data outstanding. An application starts a double-buffered I/O stream by making two I/O requests in rapid sequence. Thereafter, each time an I/O request completes, another is immediately made. If a disk or disk subsystem can process requests fast enough, double buffering enables data transfer at a disk or array’s full-volume transfer rate. driver Synonym for driver software. driver software An I/O driver.
426
Glossar y of Storage Terminology
dual active (components) A pair of components, such as the disk controllers in a failure-tolerant disk subsystem that share a task or set of tasks when both are functioning normally. When one component in a dual-active pair fails, the other takes on the entire task load. Most often used to describe controllers. Dual active controllers are connected to the same set of devices and provide a combination of higher I/O performance and greater failure tolerance than a single controller. Dual active components are also called active-active components.
E embedded controller Synonym for embedded disk controller. embedded disk controller A disk controller that mounts in a host computer’s housing and attaches directly to the host’s internal I/O bus. Embedded controllers are an alternative to host bus adapters and external host I/O buses. They differ from host bus adapters in that they provide mirroring and RAID functions. emit Synonym for export. explicit addressing A form of addressing used with disks in which data’s address is explicitly specified in I/O requests. cf. implicit addressing export To cause to appear and behave as. For example, volume managers and control software export disk like storage devices to applications. Synonym for present and emit. extent A set of consecutively addressed blocks allocated by a file system to a single file. An extent may be of any size. The data in a file is stored in one or more extents. external controller Synonym for external storage controller. external disk controller Synonym for external storage controller. external storage controller A storage controller that connects to host computers by means of external I/O buses. An external storage controller usually mounts in the enclosure that contains the disks it controls.
F failback The restoration of a failed system component’s share of a load to a replacement component. For example, when a failed disk controller dual active configuration is replaced, the devices that were originally controlled by the failed controller are usually failed back to the replacement controller to rebalance I/O load. failed over An operational mode of failure-tolerant systems in which the function of a component is being performed by a functionally equivalent component. A system with a failed-over component is generally not failure-tolerant, because failure of the redundant component may stop the system from functioning. failover The substitution of one functionally equivalent system component for another. The term is applied to pairs of disk controllers connected to the same storage devices and host computers. If one of the controllers fails, failover occurs and the survivor takes over its I/O load. The term is also used to describe the automatic or administrator-directed movement of an application from one computer in a cluster to another. failure tolerance The capability of a system to continue operation when one or more of its components fails. Disk subsystem failure tolerance is achieved by
Glossar y of Storage Terminology
427
including redundant instances of components whose failure would stop the system from operating, coupled with facilities by which redundant components can be made to take over the function of failed ones. fanout Synonym for device fanout fast SCSI A form of SCSI that provides 10 megatransfers per second. Wide-fast SCSI transfers 16 bits concurrently and therefore transfers 20 megabytes per second. Narrow-fast SCSI transfers 8 bits concurrently and therefore transfers 10 megabytes per second. cf. wide SCSI, Ultra SCSI, Ultra2 SCSI fault tolerance Synonym for failure tolerance. FBA Fixed-block architecture. Fibre Channel A serial I/O bus capable of transferring data at 100 megabytes per second. Fibre Channel exists in both arbitrated loop and switched topologies, using either optical or copper media. Fibre Channel was initially developed through industry cooperation, unlike parallel SCSI, which was initially developed by a single vendor and submitted to standardization after the fact. field replaceable unit (FRU) A system component that is designed to be replaced “in the field,” that is, without returning the system to a factory or repair depot. FRUs may either be customer-replaceable or their replacement may require trained service personnel. cf. customer-replaceable unit file An abstract data object made up of (a) an ordered byte vector of data stored on a disk or tape, (b) a unique symbolic name and, (c) a set of properties, such as ownership and access permissions that allow the object to be managed by a file system or backup manager. Unlike the permanent address spaces of disks, files may be created or deleted, and expand or contract in size during their lifetimes. file system A software component that structures the address space of a disk or volume as files so that applications may more conveniently access data. Some file systems are supplied as operating system components; others are available as independent software packages. fixed-block architecture (FBA) An architectural model of disks in which storage space is organized as a linearly addressed set of blocks of a fixed size. FBA is the disk architectural model on which SCSI is predicated. cf. count-key-data formatting The preparation of a disk for file system use by writing required metadata in some or all of its blocks. FRU Abbreviation for field replaceable unit full-volume data transfer rate The average rate at which a disk can transfer a large amount of data (up to its entire contents) in response to one I/O request. Fullvolume data transfer rate accounts for any delays (e.g., intersector gaps, track switching time, and seeks between adjacent cylinders) that may occur during a large data transfer.
G geometry In this book, a mathematical description of the layout of blocks on a disk. The main aspects of a disk’s geometry are (a) number of recording bands and the number of tracks and blocks per track in each, (b) number of data tracks per cylinder, and (c) number and layout of spare blocks reserved to repair media defects.
428
Glossar y of Storage Terminology
gigabyte Shorthand for 1,000,000,000 (109) bytes. This book uses the 109 convention commonly found in I/O-related literature rather than the 1,073,741,824 (230) convention sometimes used when discussing computer system random access memory (RAM).
H high availability The ability of a system to perform its function continuously (without interruption) for a significantly longer period of time than the combined reliabilities of its individual components would suggest. High availability is most often achieved through failure tolerance. High availability is not an easy term to quantify, as both the bounds of a system that is called highly available and the degree of availability must be clearly understood. host A host computer. host adapter A host bus adapter. Host bus adapter is the preferred term. host-based array Synonym for host-based disk array. host-based disk array A disk array whose control software executes in one or more host computers rather than in a disk controller. The member disks of a hostbased array may be part of different disk subsystems. cf. controller-based array, volume host bus A host I/O bus, which is the preferred term. host bus adapter (HBA) Preferred term for an I/O adapter that connects a host I/O bus to the host’s memory system. cf. host adapter, I/O adapter. host cache Any cache that resides within a host computer. When a host cache is managed by a file system or database, the data items stored in it are file or database objects. When host cache is managed by a volume manager, the cached data items are sequences of volume blocks. cf. cache, controller cache, disk cache host computer Any computer system to which disks or disk subsystems are attached and accessible for data storage and I/O. Mainframes, servers, workstations and personal computers, as well as multiprocessors and clustered computers, are all host computers. host environment Synonym for hosting environment. hosting environment A disk subsystem’s host computers, inclusive of operating system and other required software. The term host environment is used to emphasize that multiple host computers are being discussed or to emphasize the importance of an operating system or other software to the discussion. host I/O bus An I/O bus used to connect a host computer’s host bus adapter to storage subsystems or storage devices. cf. device I/O bus, I/O bus, channel hot disk A disk whose I/O request execution capacity is saturated by its I/O load. A hot disk can be caused by application I/O or by operating environment I/O, such as paging or swapping. hot file A frequently accessed file. Hot files are often the root cause of hot disks. hot spare (disk) A disk being used as a hot standby component. hot standby (component) A redundant component in a failure-tolerant storage subsystem that has power applied and is ready to operate, but that does not operate as long as the primary component for which it is standing by is functioning. Hot standby components increase storage subsystem availability by enabling subsys-
Glossar y of Storage Terminology
429
tems to continue functioning if a component fails. When used to refer to a disk, hot standby specifically means a disk that is spinning and writeable, for example, for rebuilding. hot swap The substitution of a replacement unit (RU) for a defective one while the system is operating. Hot swaps are physical operations typically performed by humans. cf. automatic swap, cold swap, warm swap
I implicit addressing A form of addressing used with tapes in which data address is implied by the access request. Tape requests do not include an explicit block number, instead specify the next or previous block from the current tape position. cf. explicit addressing independent access disk array A disk array that can execute multiple application I/O requests concurrently. cf. parallel access disk array inherent cost The cost of a system expressed in terms of the number and type of physical components it contains. Inherent cost enables comparison of disk subsystem technology alternatives by expressing cost in terms of number of disks, ports, modules, fans, power supplies, cabinets, etc. Because it is not physical, software is treated as having negligible inherent cost. initiator The system component that originates an I/O command over an I/O bus or network. I/O adapters, network interface cards (NICs) and disk controller device I/O bus control ASICs (application-specific integrated circuits) are all initiators. cf. LUN, target, target ID. interconnect A set of physical components by which system elements are connected and through which they can communicate. I/O buses and networks are both interconnects. I/O Abbreviation for input/output. The process of moving data between a computer system’s main memory and a nonvolatile storage device or a network connected to other computer systems. I/O may consist of reading, or moving data into the computer system’s memory, and writing, or moving data from the computer system’s memory to another location. I/O adapter Synonym for host bus adapter. An adapter that converts between the timing and protocol of a host’s memory bus and that of an I/O bus. I/O adapters differ from embedded disk controllers, which not only convert between buses, but also perform other functions such as device fanout, data caching, and RAID. I/O bottleneck Any resource in an I/O path (e.g., device driver, host bus adapter, I/O bus, disk controller, cache, or disk) that limits the performance of the path as a whole. I/O bus Any path used to transfer data and control information between components of an I/O subsystem. An I/O bus consists of an interconnect, connectors, and all associated electrical components such as drivers, receivers, and transducers. I/O buses are typically optimized for data transfer and support simpler topologies than networks. An I/O bus that connects a host computer’s host bus adapter to storage controllers or devices is called a host I/O bus. An I/O bus that connects disk controllers or host I/O bus adapters to devices is called a device I/O bus. cf. channel, device channel, device I/O bus, host I/O bus, network
430
Glossar y of Storage Terminology
I/O driver A host computer software component (usually part of an operating system) whose function is to control the operation of host bus adapters. I/O drivers communicate between applications and I/O devices, and in smaller systems, participate in data transfer, although this is rare with disk drivers. I/O-intensive An application characterization. An I/O-intensive application is one whose performance depends more strongly on I/O performance than on processor or network performance. I/O load A sequence of I/O requests made to an I/O subsystem. I/O loads include both application I/O and host environment overhead I/O, such as swapping, paging, and file system metadata access. I/O load balancing Synonym for load balancing. I/O operation A read, write, or control operation on a disk. I/O operations are executed by volume managers or control software to satisfy application I/O requests made to volumes or virtual disks. cf. I/O request I/O request A request by an application or data manager to read or write a specified amount of data from a disk or volume. An I/O request specifies the transfer of a number of blocks of data between consecutive block addresses and contiguous memory locations. cf. I/O operation I/O subsystem A collective term for all of the disks and storage subsystems attached to a computer system or cluster.
J JBOD Acronym for “just a bunch of disks”; pronounced “jay-bod.” Used to refer to any collection of disks without the coordinated control provided by a volume manager or control software.
K kilobyte 1,024 (210) bytes of data. This book uses the software convention (210) for kilobytes and the data transmission conventions for megabytes (106) and gigabytes (109). This is due primarily to the contexts in which the terms are normally used.
L large I/O request An I/O request that specifies the transfer of a large amount of data. “Large” obviously depends on the context, but typically refers to 64 kilobytes or more. cf. small I/O request large read request Synonym for large I/O request. large write request Synonym for large I/O request. latency (1) The time between the making of an I/O request and completion of the request’s execution. I/O latency includes both request execution time and time spent waiting for resources in the I/O path to be available. (2) Short for rotational latency, the time between the completion of a seek and the instant of arrival of the first block of data to be transferred at the disk’s read/write head. latent fault The failure of a passive component in an active passive pair. Latent faults prevent passive components from operating when required and, therefore, defeat failure tolerance. Well-designed failure-tolerant systems test for latent faults periodically.
Glossar y of Storage Terminology
431
LBA Abbreviation for logical block address. load balancing The adjustment of component roles and I/O requests so that I/O demands are spread as evenly as possible across an I/O subsystem’s resources. I/O load balancing may be done manually (by a human) or automatically (without human intervention). cf. I/O load optimization, load sharing load optimization The division of an I/O load or task in such a way that overall performance is optimized. With components of equal performance, load optimization is achieved by load balancing. If individual component performance differs markedly, load optimization may be achieved by directing a disproportionate share of the load to higher-performing components. cf. load balancing, load sharing load sharing The division of an I/O load or task among several disk subsystem components, without any attempt to equalize each component’s share of the load. When a disk subsystem is load sharing, it is possible for some of the sharing components to be operating at full capacity and limiting performance, while others are underutilized. cf. I/O load balancing, load optimization logical block A block of data stored on a disk and associated with an address for purposes of reading or writing. Logical block typically refers to a host or controller view of data on a disk. Within a disk, there is a further conversion between the logical block addresses presented to hosts and the physical media locations at which the corresponding data is stored. cf. physical block, virtual block logical block address (LBA) The address of a logical block. Logical block addresses are used in hosts’ I/O commands. The SCSI disk command protocol, for example, uses logical block addresses. logical unit The entity within a SCSI device that executes I/O commands. SCSI I/O commands are sent to a target and executed by a logical unit within that target. A SCSI disk has a single logical unit. Tape drives and disk array controllers often incorporate multiple logical units to which control and I/O commands are addressed. Each logical unit presented by an array controller corresponds to a virtual disk. cf. LUN, target, target ID logical unit number (LUN) The SCSI identifier of a logical unit within a target. LUN Logical unit number.
M mapping Conversion between two data addressing spaces. For example, conversion between physical disk block addresses and block addresses of volumes or virtual disks presented to the operating environment by a volume manager or control software. media (1) The material in a storage device on which data is recorded. (2) A physical link on which data is transmitted between two points. megabyte Shorthand for 1,000,000 (106) bytes. This book uses the 106 convention commonly found in I/O-related literature rather than the 1,048,576 (220) convention common in computer system random access memory and software literature. megatransfer The transfer of 1 million data units. Used to express the characteristics of parallel I/O buses like SCSI, for which the data transfer rate (megabytes per second) depends upon the amount of data transferred in each data cycle (megatransfers per second). cf. SCSI, fast SCSI, Ultra SCSI, Ultra2 SCSI, wide SCSI
432
Glossar y of Storage Terminology
member (disk) A disk that is part of a disk array or volume. metadata Data that describes other data. For volumes and disk arrays, metadata includes subdisk sizes and locations, stripe unit size, and state information. File system metadata includes file names, file properties, security information, and lists of block addresses at which each file’s data is stored. mirror Synonym for mirrored disk. mirrored array Synonym for mirrored disk. mirrored disk The disks of a mirrored array. mirroring A form volume and disk array failure tolerance in which control software maintains two or more identical copies of data on separate disks. Also known as RAID Level 1. mirrored volume A failure-tolerant volume or disk array that implements mirroring to protect data against loss due to disk or device I/O bus failure. MTBF Abbreviation for mean time between failures; the average time from start of use to first failure in a large population of identical systems, components, or devices. MTDA Abbreviation for mean time until (loss of ) data availability; the average time from start of use until a component failure causes loss of user data accessibility in a large population of failure-tolerant volumes or disk arrays. Loss of availability may not imply loss of data. For some types of failure (e.g., failure of a nonredundant disk controller), data remain intact and can be accessed after the failed component is replaced. MTDL Abbreviation for mean time until data loss; the average time from start of use until a component failure causes permanent loss of user data in a large population of volumes or disk arrays. The concept is similar to the MTBF used to describe physical device characteristics, but takes into account the possibility that RAID redundancy can protect against data loss due to single component failures. MTTR Abbreviation for mean time to repair; the time required to repair a fault in a device or system and restore the device or system to service, averaged over a large number of faults. multipath (I/O) A facility for a host to direct a stream of I/O requests to a disk or other device on more than one access path. Multipath I/O requires that devices be uniquely identifiable by some means other than by bus address. multithreaded Having multiple concurrent or pseudo-concurrent execution sequences. Used to describe processes in computer systems. Multithreaded processes are one means by which I/O request-intensive applications can use independent-access volumes and disk arrays to increase I/O performance.
N naming The mapping of an address space to a set of objects. Naming is typically used either for human convenience (e.g., to attach symbolic names to files or storage devices) or to establish a level of independence between two types of system components (e.g., identification of files by inode names or identification of computers by IP addresses). namespace The set of valid names recognized by a file system. One of the four basic functions of file systems is management of a namespace so that invalid and duplicate names do not occur.
Glossar y of Storage Terminology
433
NAS Acronym for network attached storage. network An interconnect that links computers to other computers or to storage subsystems. Networks are typically characterized by flexible interconnection of large numbers of devices rather than by optimal transfer of large streams of data. cf. channel, I/O bus network attached storage (NAS) A class of systems that provide file services to host computers. A host system that uses network attached storage uses a file system device driver to access data using file access protocols such as NFS (network file system) or CIFS (common internet file system). NAS systems interpret these commands and perform the internal file system and device I/O operations necessary to execute them. cf. storage area network nontransparent failover A failover from one component of a disk subsystem to another that is visible to the external environment. Usually refers to paired controllers, one of which presents the other’s virtual disks at different host I/O bus addresses or on a different host I/O bus after a failure. cf. transparent failover nonvolatility A property of data. Nonvolatile data is preserved even when certain environmental conditions are not met. Used to describe data stored on disks and tapes. Data stored on these devices is preserved when electrical power is cut. cf. volatility normal operation A system state in which all components are functioning properly, no recovery actions (e.g., reconstruction) are being performed, environmental conditions are within operational range, and the system is able to perform its intended function.
O open interconnect A standard interconnect. operating environment The host environment within which a storage subsystem operates. The operating environment includes the host computer(s) to which the storage subsystem is connected, host I/O buses and adapters, host operating system instances, and any required software. For host-based volumes, the operating environment includes I/O driver software for member disks, but does not include the volume manager.
P parallel access array A disk array model in which all member disks operate in unison and participate in the execution of every application I/O request. A parallel access array is inherently able to execute one I/O request at a time. True parallel access arrays require physical disk synchronization; much more commonly, independent arrays approximate true parallel access behavior. cf. independent access array parity RAID A collective term used to refer to Berkeley RAID Levels 3, 4, and 5. parity RAID array A RAID array whose data protection mechanism is one of the Berkeley RAID Levels 3, 4, or 5. partition (1) A consecutive range of logical block addresses on a disk presented as an address space. (2) Synonym for subdisk. partitioning (1) Presentation of the usable storage capacity of a physical or virtual disk to an operating environment in the form of several block address spaces
434
Glossar y of Storage Terminology
whose aggregate capacity approximates that of the underlying disk or array. (2) The process of subdividing a disk into partitions. path Access path. physical block An area on recording media in which data is stored. Distinguished from the logical and virtual block views presented to host computers by storage devices. physical block address The address of a physical block. A number that can be converted to a physical location on storage media. physical configuration The installation or removal of disks, cables, host bus adapters, or other components required for a system or subsystem to function. Physical configuration typically includes address assignments, such as PCI slot numbers, SCSI target IDs, and LUNs cf. configuration physical disk A disk. Often used to emphasize a contrast with virtual disks and volumes. plex A collection of subdisks managed by a volume manager that provides data mapping and in some cases, failure tolerance. Within a plex, a single data mapping and a single type of data protection are employed. policy processor In a disk controller, host bus adapter, or storage device, the processor that schedules device activities. Policy processors usually direct additional processors or sequencers that perform the lower-level functions required to implement policy. port (1) An I/O adapter used to connect a storage controller to storage devices. (2) A synonym for device I/O bus. present To cause to appear and behave as. Control software presents virtual disks to its host environment. Synonym for export and emit. proprietary interconnect Synonym for proprietary I/O bus. proprietary I/O bus An I/O bus whose transmission characteristics and protocols are the intellectual property of a single vendor and that require the permission of that vendor to be implemented in the products of other vendors. protocol A set of rules or standards for using an interconnect so that information conveyed on the interconnect can be correctly interpreted by all parties to the communication. Protocols include such aspects of communication as data item ordering, message formats, message and response sequencing rules, and block data transmission conventions.
R RAID Acronym for redundant array of independent disks, a family of techniques for managing multiple disks to deliver desirable cost, data availability, and performance characteristics to host environments. RAID array Synonym for a redundant array of independent disks. RAM disk A quantity of host system RAM managed by software and presented to applications as a high-performance disk. RAMdisks emulate disk I/O functional characteristics, but unless they are augmented by special hardware to make their contents nonvolatile, they lack one of the key capabilities of disks and are not treated as disks. cf. solid state disk random I/O Synonym for random reads.
Glossar y of Storage Terminology
435
random I/O load Synonym for random reads. random reads, random writes An I/O load whose consecutively issued read and/or write requests do not specify adjacently located data. Random I/O is characteristic of I/O request-intensive applications. cf. sequential I/O rank (1) A set of physical disk positions in an enclosure, usually denoting the disks that are or can be members of a single array. (2) The set of corresponding target identifiers on all of a controller’s device I/O buses. As in the preceding definition, the disks identified as a rank by this definition usually are or can be members of a single array. (3) Synonym for a stripe in a plex. (Because of the diversity of meanings commonly attached to this term, this book does not use it.) raw partition A disk partition managed directly by a database management system. The term raw partition is frequently encountered when discussing database management systems because some database management systems use files to organize their underlying storage, while others make block I/O requests directly to raw partitions. read/write head The magnetic or optical recording device in a disk. Read/write heads are used both to write data by altering the recording media’s state and to read data by sensing the alterations. Disks typically use the same head to read and write data, although some tapes have separate read and write heads. rebuilding The regeneration and writing onto one or more replacement disks of all of the data from a failed disk in a mirrored or RAID volume or array. Rebuilding can occur while applications are accessing data. reconstruction Synonym for rebuilding. reduced mode A mode of failure-tolerant system operation in which not all of the system’s components are functioning, but the system as a whole is operational. A mirrored or RAID volume in which a disk has failed operates in reduced mode. reduction The removal of a redundant component from a failure-tolerant system, placing the system in reduced mode. Volume or disk array reduction most often occurs because of disk failure; however, some implementations allow reduction for system management purposes. redundancy The inclusion of extra components of a given type in a system (beyond those required by the system to carry out its function). redundant (components) Components installed in a system that can be substituted for each other if necessary to enable the system to perform its function. Redundant power distribution units, power supplies, cooling devices, and controllers are often configured in storage subsystems. The disks comprising a mirrored volume are redundant. A RAID volume’s disks are collectively redundant, since surviving disks can perform the function of one failed disk. redundant array of independent disks A volume or disk array in which part of the storage capacity is used to store redundant information about user data stored on the remainder of the storage capacity. Redundant information enables regeneration of user data if a disk or access path to it fails. redundant configuration Synonym for redundant system. redundant system A system or configuration of a system in which failure tolerance is achieved through the presence of redundant instances of all components that are critical to the system’s operation.
436
Glossar y of Storage Terminology
regeneration Re-creation of user data from one disk of a RAID volume or array using check data and user data from surviving disks. Regeneration can recover data when a disk fails, or when an unrecoverable media error is encountered. Data are regenerated by executing the parity computation algorithm on the appropriate user and check data. replacement disk A disk available for use as or used to replace a failed disk in a failure-tolerant volume or array. replacement unit (RU) A collection of system components that is always replaced (swapped_ ) as a unit when any part of the collection fails. Replacement units may be field replaceable, or they may require that the system of which they are part be returned to a factory or repair depot for replacement. Field replaceable units may be customer replaceable, or replacement may require trained service personnel. Typical disk subsystem replacement units include disks, controller modules, power supplies, cooling devices and cables. Replacement units may be cold, warm, or hot swapped. request-intensive (application) A characterization of applications. Also known as throughput-intensive. A request-intensive application is an I/O-intensive application whose I/O load consists primarily of large numbers of I/O requests for relatively small amounts of data. Request-intensive applications are typically characterized by random I/O. rotational latency The interval between the end of a disk seek and the time at which a block of data specified in the I/O request first passes the disk head. Rotational latency is difficult to calculate exactly, but an assumption that works well in practice is that, on average, requests have to wait for half of a disk revolution time of rotational latency. Half of a disk revolution time is therefore defined to be average rotational latency. row A stripe unit-aligned set of blocks with corresponding subdisk block addresses in each of a volume’s subdisks. RU Abbreviation for replaceable unit. cf. CRU, FRU
S SAN Acronym for storage area network. saturated disk A disk for which the I/O load is at least as great as its capability to satisfy the requests comprising the load. In theory, a saturated disk’s I/O queue eventually becomes infinitely long if the I/O load remains constant. In practice, user reaction and other system factors generally reduce the rate of new I/O request arrival for a saturated disk. scale To grow or support growth of a system in such a way that all of the system’s capabilities remain in constant ratio to each other. For example, a storage subsystem whose data transfer capacity increases by the addition of buses at the same time its storage capacity increases by the addition of disks may be said to scale. script (1) A sequence of operating system commands designed to accomplish a frequently repeated task. (2) A parameterized list of primitive I/O bus operations executed autonomously (without policy processor assistance) by a host bus adapter.
Glossar y of Storage Terminology
437
SCSI Acronym for Small Computer Storage Interface. sector The unit in which data is physically stored and protected against errors on a fixed-block architecture disk. A sector typically consists of a synchronization pattern, a header field containing the block’s address, user data, a checksum or error correcting code, and a trailer. Adjacent sectors on a track are often separated by servo information used for track centering. cf. disk block serial SCSI Any implementation of SCSI that uses a single-signal bus (as opposed to multiconductor parallel cables). Optical and electrical Fibre Channel, SSA (serial storage architecture), and 1394 are examples of serial SCSI implementations. sequential I/O Synonym for sequential writes. sequential I/O load Synonym for sequential writes. sequential reads Synonym for sequential writes. sequential writes An I/O load consisting of consecutively issued read or write requests to consecutively addressed data blocks. Sequential I/O is characteristic of data transfer-intensive applications. cf. random I/O single point of failure A nonredundant component or path in a system whose failure would make the system inoperable. Abbreviated SPOF. Small Computer Storage Interface (SCSI) A collection of ANSI standards and proposed standards that define I/O buses primarily intended for connecting storage subsystems and devices to hosts through host bus adapters. Originally intended primarily for use with small (desktop and desk-side work station) computers, SCSI has been extended to serve most computing needs and is the most widely implemented server I/O bus in use today. small I/O request Synonym for small write request. small read request Synonym for small write request. small write request An I/O read or write request that specifies the transfer of a relatively small amount of data. How small obviously depends on the context, but most often refers to 8 kilobytes or fewer. cf. large I/O request solid-state disk A disk whose storage consists of solid-state random access memory rather than magnetic or optical media. A solid-state disk generally offers very fast access compared to rotating magnetic disks, because it eliminates mechanical seek and rotation time. It may also offer very high data transfer capacity. Cost per byte, however, is typically quite high and volumetric density is lower. Solid-state disks include mechanisms such as battery backup or magnetic backing store that make data stored on them nonvolatile. cf. RAMdisk spare Synonym for spare disk. spare disk A disk specifically reserved for the purpose of substituting for a disk of equal or lesser capacity in case of a failure. spiral data transfer rate The full-volume data transfer rate of a disk. split I/O request (1) An I/O request to a volume or virtual disk that requires two or more I/O operations to satisfy, because the volume data addresses map to more than one disk. (2) An application I/O request that is divided into two or more subrequests by a data manager because the amount of data requested is too large for the operating environment to handle. SPOF Acronym for single point of failure.
438
Glossar y of Storage Terminology
standard interconnect An I/O interconnect (either a host interconnect or a device interconnect) whose specifications are readily available to the public and that can therefore easily be implemented in a vendor’s products. storage area network (SAN) A network whose primary purpose is the transmission of I/O commands and data between host computers and storage subsystems or devices. Typically, SANs use serial SCSI to transmit data between servers and storage devices. They may also use bridges to connect to parallel SCSI subsystems and devices. cf. network attached storage storage array A collection of disks or tapes that are part of a storage subsystem, managed as a unit by a body of control software. storage device A collective term for disk drive, tape transport, and other mechanisms capable of nonvolatile data storage. storage subsystem One or more storage controllers and/or host bus adapters and the storage devices such as disks, CD-ROMs, tape drives, media loaders, and robots connected to them. stripe The set of stripe units at corresponding locations of each of a plex’s subdisks. stripe size The number of blocks in a stripe. A plex’s or array’s stripe size is its stripe unit size multiplied by the number of subdisks. stripe unit A number of consecutively addressed blocks in a single extent. A volume manager or disk array control software uses stripe units to map virtual disk block addresses to disk block addresses. stripe unit size The number of blocks in a stripe in a disk array that uses striped data mapping. Also, the number of consecutively addressed virtual disk blocks mapped to consecutively addressed blocks on a single member extent of a disk array. striped array Synonym for striped volume. striped disk array Synonym for striped volume. striped volume A volume or disk array with striped data mapping but no failure tolerance. Striped volumes and arrays are used to improve I/O performance with easily replaced data. stripeset A synonym for striped array. striping Short for disk striping; also known as RAID Level 0 or RAID 0. A mapping technique in which fixed-size consecutive ranges of virtual disk data addresses are mapped to successive subdisks in a cyclic pattern. subdisk A number of consecutively addressed blocks on a disk. Subdisks are created by volume managers as building blocks from which plexes and volumes are created. subdisk block number The relative position of a block within a subdisk. Subdisk block numbers are used to construct the higher-level plex data-mapping construct, not for application data addressing. swap, swapping The installation of a replacement unit in place of a unit in a system. A unit is any component of a system that may be field replaced by a vendor service representative (FRU) or by a consumer (CRU). A physical swap operation may be cold, warm, or hot, depending on the state in which the disk subsystem must be in order to perform it. A functional swap operation may be an automatic swap or it may be part of a physical swap operation requiring human intervention.
Glossar y of Storage Terminology
439
synchronous operations Operations that have a fixed time relationship to each other. Commonly used to denote I/O operations that occur in time sequence; that is, a successor operation does not occur until its predecessor is complete. system disk Synonym for system volume. system partition Synonym for system volume. system volume A disk, partition, or volume containing the programs required to boot an operating system. The system disk is the disk from which the operating system is bootstrapped (initially loaded into memory). System volumes frequently contain operating system program images, paging files, and crash dump space, although this is not always the case. They may also contain software libraries shared among several applications.
T target In SCSI standards, a system component that receives I/O commands. cf. initiator, LUN, target ID target ID The SCSI bus address of a target device or controller. terabyte Shorthand for 1,000,000,000,000 (1012) bytes. This book uses the 1012 convention commonly found in I/O literature rather than the 1,099,511,627,776 (240) convention sometimes used when discussing random access memory. throughput The number of I/O requests executed per unit time. Expressed in I/O requests per second, where a request is an application request to a storage subsystem to perform a read or write operation. throughput-intensive (application) A request-intensive application. transparent failover A failover that has no functional effect on the external environment. Often used to describe paired disk controllers, one of which presents the other’s virtual disks at the same host bus addresses after a failure. cf. nontransparent failover
U Ultra SCSI A form of SCSI capable of 20 megatransfers per second. Single-ended Ultra SCSI is restricted to shorter cable lengths than versions of SCSI with lower data transfer rates. Ultra2 SCSI A form of SCSI capable of 40 megatransfers per second. In addition to a higher maximum data transfer rate than older forms of SCSI, Ultra2 SCSI includes low-voltage differential (LVD) signaling, which provides reliable data transfer for low-power devices distributed on a bus up to 12 meters long. usable capacity The storage capacity in a volume, disk array, or disk that is available for storing user data. Usable capacity of a disk is total capacity minus any capacity reserved for media defect compensation and disk metadata. Usable capacity of a volume or disk array is the sum of the usable capacities of its subdisks minus capacity required for check data.
V VBA Abbreviation for virtual block address. virtual block Synonym for volume block.
440
Glossar y of Storage Terminology
virtual block address The address of a volume block or virtual block. Volume block addresses are used in hosts’ I/O requests addressed to volumes and virtual disks. SCSI disk commands addressed to controller-based RAID arrays actually specify virtual block addresses in their logical block address fields. virtual device Synonym for virtual disk. virtual disk A disklike storage device presented to an operating environment by disk array controller control software. From an application standpoint, virtual devices are equivalent to physical devices. Some low-level operations (e.g., operating system bootstrapping) may not be possible with virtual devices. volatility A property of data. Volatility refers to the property that data will be obliterated if certain environmental conditions are not met. For example, data held in DRAM (Dynamic random access memory) is volatile, since if electrical power to DRAM is cut, data in it is obliterated. cf. nonvolatility volume block address Synonym for virtual block address. volume block A block in the address space presented by a volume or virtual disk. Virtual blocks are the atomic units in which volume managers and control software present storage capacity to their operating environments.
W warm spare (disk) An installed spare disk that is powered on, but is not spinning. warm swap The substitution of a replacement unit (RU) in a system for a functionally identical one, where in order to perform the substitution, the system must be stopped (caused to cease performing its function), but need not be powered down. Warm swaps are physical operations performed by humans. cf. auto-swap, cold swap, hot-swap. wide SCSI Any form of parallel SCSI using a 16-bit data path. In a wide SCSI implementation, the data transfer rate in megabytes per second is twice the number of megatransfers per second because each data cycle transfers two bytes. cf. fast SCSI, Ultra SCSI, Ultra2 SCSI write-back cache A caching technique in which write request completion is signaled as soon as data is in cache and actual writing to disk media occurs at a later time. Write-back caching carries an inherent risk of an application taking action predicated on the write completion signal. A system failure before data is written to disk media may result in media contents that are inconsistent with that action. For this reason, write-back cache implementations include mechanisms that preserve cache contents across system failures (including power failures) and to flush the cache to disk when the system restarts. cf. write-through cache write hole A potential data corruption in failure-tolerant volumes and disk arrays that results from a system failure while application I/O is outstanding, causing partial completion of sequences of I/O operations. Data corruption can occur if on-disk data are left in an inconsistent state. For mirrored arrays, read requests can return different results depending on which disk executes the read operation. For RAID arrays, an unrelated later disk failure can result in data regeneration using incorrect input. Blocking write holes requires that the volume manager or control software keep a nonvolatile log of which volume or array data may be at risk.
Disk and Volume States
441
write penalty Low apparent application write performance to RAID volumes or RAID array virtual disks. The write penalty is inherent in RAID data protection techniques, which require multiple member I/O operations for each application write request. The write penalty ranges from minimal with controller-based implementations that use write-back cache to substantial for host-based implementations that log all write operations. write-through cache A caching technique in which write request completion is not signaled until data is stored on disk media. Write throughput with write-through caching is approximately the same as that of a non-cached system, but if the written data are held in cache, subsequent read performance may improve dramatically. cf. write-back cache
Z zoning A method of subdividing a storage area network (SAN) into disjoint zones, or subsets of devices attached to the network. SAN-attached devices outside a zone are invisible to devices within the zone. Moreover, with switched SANs, traffic within each zone may be physically isolated from traffic outside the zone.
Index
A Active/active access mode, 244, 246–247, 251 Addressing, 12–14 Administrative failover, 296–297 Anticipatory cache policy, 83 Applications: availability, in server clusters, 271 blocking, 342 cache in, 88 data transfer-intensive (see Data transferintensive applications) and disk failure, 269 failure of, 82 I/O request-intensive (see I/O request-intensive applications) moving, 298 optimal stripe size for, 407 scalability, in server clusters, 272 storage needs of, 398–399, 403–404 transaction-processing, 57–58, 404 Array managers, 106, 110–118, 255–269. See also OpenManage Array Manager disk groups in, 134 functions of, 110–111 Arrays, 54 adding disks to, 267 controller-based, 54 of disks (see Disk arrays) host-based, 54 for multiple I/O paths, 245 Asynchronous replication, 339–341, 358
B Backup: disadvantages of, 322 importance of, 82 Bad block revectoring (BBR), 15–16
Basic disks, 130 Basic Input Output System (BIOS), 91–92 Bidirectional replication, 348–349 Bootable volumes, 97–98 Boot flag, 93
C Cache, 82–88 in applications, 88 in database management systems, 88 for file system metadata, 86–87 functioning of, 82 in magnetic disks, 83–84 in operating systems, 85–86 policies for, 83 in RAID, 84–85 read (see Read cache) write (see Write cache) Causality, 344–345 Challenge/defense protocol, 280–282 Check data, 53, 63–67 function of, 70–71 Checksums: use in error prevention, 10–11 use in replication, 348 CHKDSK, 86 Circular logs, 102 Cluster disk groups, 284, 304, 403 as cluster resources, 288–289, 291–292, 298–299, 304 creating, 284 failover in, 288–289 managing, 287–288 and VCS, 307 Cluster managers, 276–280 functions of, 273, 275, 277–278, 305 for Windows NT, 276
443
444
Index Cluster resource groups, 273, 297 bringing online, 294–296, 316 creating, 290–291 disk groups as, 288–289, 291–292, 298–299 starting/stopping, 275, 311–312 Volume Manager disk groups as, 298–299 Cluster resources: adding to VCS service groups, 313–317 dependencies of, 274–275 disk groups as, 288–289, 291–292, 298–299, 304 host-based volumes as, 287–288, 304 monitoring, in VCS, 318–319 MSCS parameters for, 303–304 stateful versus stateless, 278 underlying resources for, 302 VCS parameters for, 314 volumes as, 307 Command-line interface (CLI), 240–242 Computer cables, failure of, 243 Computer cooling systems, failure of, 81 Computer games, transient data in, 3 Concatenated volumes, 168–169, 229 Contiguous extension, 194 Continuous replication, 327–328, 377, 378 Controller-based arrays, 54 Correspondence tables, 15–16 Crashes, 3–4 recovery from, 101–103 Cylinder-head-sector (C-H-S) addressing, 12, 92 disadvantages of, 13–14 Cylinders, on magnetic disks, 12
D Data: availability of, in failure-tolerant volumes, 77–80 backing up (see Backup) check (see Check data) choosing storage for, 383–384 consistency of, in replication, 328 consolidation of, 321 copying to newly created volumes, 198–199, 207–209, 223 cost of storing, 5, 61, 384 criteria for inclusion in cache, 83 deciding on location of, 11–13, 298 encoding on magnetic disks, 8–10 frozen images of, 156, 175–176, 326–327 locating on magnetic disks, 11–15 loss of, 3–4 means of storing, 4 mission-critical, 17, 61, 62 off-host processing of, 322 persistent (see Persistent data) publication of, 321 random access to, 6 replication of (see Replication)
temporary, 41–42 transient (see Transient data) writing to magnetic disks, 16–17 Database management system cache, 88 Databases: logs for, 339 management software for, 335 replication of, 334–336 subdivisions of, 407–408 Data protection, 322 in RAID, 63, 82 at secondary locations, 366 Data transfer-intensive applications, 40, 404 striped volumes and, 47–48, 406–407 Defective blocks, 15 Dirty regions, 101–102 Disasters: in primary location, 341, 342 protecting against, 322, 327–328, 331, 349, 394–395 recovering from, 336, 366, 397, 400 Disk arrays: managing, 253–269 and mission-critical data, 61 volumes made from, 111–118 Disk buses, failure of, 80–81 Disk cache, 83–84 Disk class driver, 105 Disk class layer, 105 Disk controllers, 19–24 advantages of, 253 failure of, 81 functions of, 19–20 Disk failure, 234–236 dealing with, 263–269 predicting, 18–19, 231 in RAID, 216–220, 400–401 in simple volumes, 35 and unallocated storage, 400–402 Disk groups, 110 adding to, 276 in Array/Volume Managers, 134 changing properties of, 293 dynamic (see Dynamic disk groups) managing storage with, 403–404 in MSCS clusters (see Cluster disk groups) as quorum devices, 283, 305–306 and RAID subsystems, 400–401 unallocated storage in, 398–399, 403–404 Disks: basic, 130 dynamic (see Dynamic disks) Fibre Channel (see Fibre Channel disks) hot spare, 268–269 logical, 94–95 magnetic (see Magnetic disks) naming, in Windows, 105–106 preparing for MSCS cluster use, 284 replacing, 78–79, 218–220, 265–266
Index replicating, 329–333 states of, 411–414 supported by Windows 2000 Volume Manager, 108, 134, 244 upgrading, 130–131, 284 virtual (see Virtual disks) Drive letters, 105 assigning, 136, 171 conserving, 200 Drivers: for HBAs, 23 stacked, 103–104 in Windows, 23 Dynamic disk groups: versus cluster disk groups, 284, 304 creating, 166–167 Dynamic disks, 96–99 functions of, 98–99 metadata on, 97, 98–99 partition tables on, 97 subdividing, 96 upgrading to, 130–131 volumes on, 97–99, 197 Dynamic link libraries (DLLs), in server cluster management, 286–287, 304 Dynamic multipathing (DMP), 244 Dynamic random access memory (DRAM), 85 Dynamic volumes, 97–99, 197
E Electrical power, failure of, 81, 84, 235–236 Embedded disk controllers, failure of, 81 Error correction codes (ECCs), 10–11 Exclusive OR function, 63–64, 71–72 Explicit importing, 307 Extended partitions, 94–95 External disk controllers, failure of, 81
F Failback, 278, 297–298 Failover, 277–278 administrative, 296–297 in cluster disk groups, 288–289 controlling, 293 forced, 286, 296 in VCS service groups, 311–313 Failure tolerance: and host-based volume managers, 394 of mirrored-striped volumes, 60, 203, 212 of RAID, 30, 63, 69, 79–80, 389, 390 of simple volumes, 30 of spanned volumes, 30, 35 of striped volumes, 30, 39–41, 60, 388 of three-mirror volumes, 109, 392–393 of two-mirror volumes, 401 of virtual disks, 256 Failure-tolerant volumes, 53–88, 389–393 component failure in, 80–82 crash recovery of, 102–103
creating file systems in, 152–153 and data availability, 77–80 extending, 195 functioning of, 77 as quorum devices, 283, 305 repair of, 110, 234–239, 401–402 resynchronizing, 197 spare disks in, 400 uses of, 17, 263 Fallback mode, 342–343 Fanout, 111 FAT32, 6, 201–202 limitations of, 201–202 Fibre Channel disks: addressing schemes on, 14 and multiple I/O paths, 243 File Allocation Table (FAT), 6 expanding volumes in, 34–35, 109 Files: characteristics of, 4–5 compression of, 137, 172 journaled systems for, 86–87 network copies of, 322–323 replication of, 333–334, 347, 371–382 size allocation for, 136, 172 on spanned volumes, 35 File shares, 301–304, 309–310 naming, 309 File systems: cache for, 86–87 creating, 152–153 handling I/O requests to, 104 journaled, 86–87 specifying, 136 subdivisions of, 407–408 in UNIX, 5 in Windows, 5, 6–7, 93, 201 Forced failover, 286, 296 Formatting: full versus quick, 172 for magnetic disks, 6–7 of mirrored volumes, 153, 171–173 of simple volumes, 138–139 Frozen images, 156, 175–176, 321 and replication, 326–327, 350–351, 369 fsck, 86 FTDISK, 93, 99–101 shortcomings of, 100–101
G Gather writing, 51–52
H Heartbeat messages, 277 responses to, 280 Host-based arrays, 54 Host-based volume managers: aggregation/partitioning with, 394, 396 and failure tolerance, 394
445
446
Index Host-based volumes: combining with RAID subsystems, 397 as MSCS cluster resources, 287–288, 304 in Windows servers, 119–131 Host bus adapters (HBAs), 23–24 BIOS in, 91–92 drivers for, 23 failure of, 80–81, 243 Host computers, failure of, 81, 254 Host I/O interfaces, 17–18 Hot spare disks, 268–269 Hybrid volumes, 394–397
I Implicit importing, 307 Installable File System (IFS), 6 Intel Architecture (IA): partitions in, 93–94 startup with, 91–94 Interface ASICs, failure of, 80–81 I/O manager, 104 I/O paths, multiple, 243–251 changing, 251 configurations of, 244 contents of, 243 monitoring, 247 setting functions of, 245–251 uses of, 243 I/O performance: mirrored volumes and, 57–59, 392 multiple I/O paths and, 243 quantifying, 49–51 during replication, 337–339 spanned volumes and, 35–36 striped volumes and, 40–41, 48–51, 147, 388, 404–410 unrelocation and, 402 visual display of, 110 I/O request-intensive applications, 40, 404 mirrored volumes and, 57–58 striped volumes and, 43–47, 48–49, 404–406 I/O requests: filtering, 343–344 split, 405–406 I/O stack, in Windows, 103–106 I/O subsystem cache. see Cache I/O throttling, 342, 359–360
J Jobs, in replication, 372–374 Journaled file systems, 86–87 Just a bunch of disks (JBOD), 384–388
L Least busy algorithm, 55 Link outages, and replication, 342–343 Load balancing, 244, 246–247, 251 in server clusters, 272 Logical block addressing, 13–14
Logical Disk Manager: console of, 120–126 creating volumes with, 133–162 disk groups in, 134 event log of, 239–240 invoking commands in, 122–123 menu choices in, 126 and RAID, 255 simplicity of, 140–141 upgrading disks with, 130–131 wizards in, 126, 128–130 Logical disks, 94–95 Logical units (LUNs), 254 naming, 106 Logs: circular, 102 for databases, 339 for dirty regions, 101–102 for Logical Disk Manager events, 239–240 managing, 344 overflow protection for, 365–366 for replication, 340–341, 342, 362–366 for VCS events, 318 for VVR Data Change Map, 356–360 Loosely coupled servers, 276
M Magnetic disks, 4, 5–19 addressing schemes for, 12–14 advantages of, 5–6 cache in, 83–84 calculating capacity of, 12 controllers for, 19–24 cost of, 5, 61 cylinders on, 12 data encoding on, 8–10 data transfer rates of, 406–407 error correction on, 10–11 failure of (see Disk failure) formats for, 6–7 identifying for service purposes, 235–236, 265 intelligent, 17–19 interchangeability of, 17–18 locating data on, 11–15 maximizing capacity of, 14–15 measuring performance of, 77–78 media defects on, 15–16 multiple volumes on, 224–225 operation of, 7–8 partitions on, 91 as quorum devices, 283 in RAID, 54 reliability of, 6 speed of, 43, 47 subdividing, 31–32 synchronizing, 70, 79–80 tracks on, 11–12 universality of, 6 writing data to, 16–17
Index Magnetic tapes, 54 Management interfaces, 22 Mapping, 27 Master boot record (MBR), 92–94 contents of, 93 function of, 94 Mean time between failures (MTBF), 77–78 Metadata, 7 cache for, 86–87 on dynamic disks, 97, 98–99 Microsoft Cluster Server (MSCS), 276, 278–283 configuring volumes for use in, 306 resource parameters for, 303–304 versus VCS, 306 and volumes, 282–283 Microsoft Management Console (MMC), 107 Mirrored-striped volumes, 59–61, 109, 203–212 adding mirrors to, 399 creating, 204–209 dynamic expansion of, 209 failure tolerance of, 60, 203, 212 mission-critical data and, 61 plexes in, 59–61, 203 splitting, 80, 210–212 Mirrored volumes, 30, 55–62. See also Mirrored–striped volumes adding mirrors to, 158–160, 177–181, 185 allocating subdisks in, 151 capacity of, 151, 169 component failure in, 80–82 cost of, 61, 80, 392 crash recovery of, 102–103 creating, 150–162 design assumptions of, 336–337 disadvantages of, 323 extending, 189, 195 failure tolerance of, 30, 78–79, 156, 158, 174, 389 formatting, 153, 171–173 four-mirror, 109, 393 host/subsystem-based, 395–396 and I/O performance, 57–59, 392 layouts for, 169–171 number of mirrors in, 392–393 plexes in, 169–170, 180–181 versus RAID, 80, 389–390 and read requests, 58–59 rejoining, 158–160, 185, 393 removing mirrors from, 160–162, 181 replacing disks in, 78–79 and replication, 336–337, 368–369 resynchronizing, 154–155, 174, 185 speed of, 58–59 splitting, 61–62, 156–158 three-mirror (see Three-mirror volumes) two-mirror (see Two-mirror volumes) update logging for, 101–102
447
uses of, 30, 53, 184–185, 390 and write requests, 58–59, 392 Mount points, 199–201
N Neighborhoods, in replication, 372 Network outages, 366–368 NTFS, 6, 86 expansion of, 109
O One-to-many replication, 323 OpenManage Array Manager, 258–262 console of, 258 Operating systems: cache in, 85–86 loading, 95
P Paramagnetism, 7 Parity: interleaving, 75–76 updating, 72–73 Partitions, 91 creating/reconfiguring, 126–130, 133–134 extended, 94–95 in Intel Architecture, 93–94 limitations of, 96 naming, in Windows, 105–106 in server clusters, 279 types of, 93 upgrading, 110 on virtual disks, 394, 396 in Windows, 99, 130 Partition tables, 93 on dynamic disks, 97 validating, 95 Persistent data, 3, 4–5 advantages of, 3 Plexes, 29 in mirrored-striped volumes, 59–61, 203 in mirrored volumes, 169–170, 180–181 preferred, 109 in RAID, 213 Predictive failure response, 109 Preferred algorithm, 55
Q Quorum devices, 280, 283 disk groups as, 283, 305–306 failure-tolerant volumes as, 283, 305 three-mirror volumes as, 283, 306
R RAID subsystem-based replication, 325–326, 332 synchronous/asynchronous, 326 Random access, 6 Read-ahead cache policy, 83
448
Index Read cache, 82–83 recovery of, 85, 86 Read requests, and mirrored volumes, 58–59 Redundancy, 53 Redundant Array of Independent Disks (RAID), 53–55, 63–77, 121–224, 253–254 cache in, 84–85 classes of, 54–55 combining with host-based volumes, 397 component failure in, 80–82, 84–85 copying data to, 223 cost of, 63, 67–70, 389–390 crash recovery of, 103 creating, 213–216 disk failure in, 216–220, 400–401 and disk groups, 400–401 embedded, 21–22, 112–113, 118, 254 extending, 220–224 external, 20–21, 22, 112–113, 118, 254 failure tolerance of, 30, 63, 69, 79–80, 389, 390 functions of, 63, 110–111 host/subsystem-based, 395 and Logical Disk Manager, 255 manageability of, 390 versus mirroring, 80, 389–390 number of columns in, 109 optimal number of disks in, 67, 69–70, 76, 391 performance of, 389–390 plexes in, 213 pros and cons of, 63, 84, 213, 389–390 replacing disks in, 218–220 replication using, 325–326, 332 resynchronizing, 214–216 speed of, 75–77 striping in, 70–77 update logging for, 102 uses of, 30, 53, 390 views of, 258 and virtual disks, 400–401 volumes in, 63–77, 189, 195 and Windows 2000 Volume Manager, 262–263 writing data to, 69, 70–76 Replicated data sets (RDSs), 353–356 Replication, 321–382 alternatives to, 322–323 asynchronous, 339–341, 358 bidirectional, 348–349 checksums in, 348 continuous, 327–328, 377, 378 versus copying, 326 of databases, 334–336 design assumptions of, 323, 336–337 and disaster protection, 322, 327–328, 331, 349 and disaster recovery, 336 of disks, 329–333 elements of, 326–328 of files, 333–334, 347, 371–382 frozen images and, 326, 350–351, 369
initializing, 360–362, 378–381 I/O performance in, 337–339 jobs in, 372–374 and link outages, 342–343 logs for, 340–341, 342, 362–366 and mirroring, 336–337, 368–369 neighborhoods in, 372 one-to-many, 323 RAID subsystem-based, 325–326, 332 resynchronization in, 347–348 schedules for, 377–378 server-based, 324–325 software architecture for, 343–344 specifying data for, 376 specifying sources and targets in, 374–375 synchronization in, 326, 345–347 synchronous, 358 troubleshooting, 382 uses of, 321–322, 348–351, 368–369, 377 of volumes, 329–333, 334, 345–346, 352–353, 368 and WANs, 337 write ordering for, 344–345 Replication management servers (RMSs), 372 Replication volume groups (RVGs), converting secondary into primary, 369–370 Resource dependencies, 274–275 Resynchronization, 154–155 of mirrored volumes, 154–155, 174, 185 of RAID, 214–216 in replication, 347–348 of three-mirror volumes, 197 Retentive cache policy, 83 Round-robin algorithm, 55 Run-length-limited (RLL) codes, 9–10
S Scatter reading, 51–52 Scheduling algorithms, 55–56 SCSI disks, addressing schemes on, 14 Secondary check points, 366 Self-describing volumes, 99 Self-Monitoring, Analysis, and Reporting Technology (SMART), 18–19 Server-based replication, 324–325 advantages of, 324–325 synchronous/asynchronous, 325 Server clusters, 271–276 architecture of, 273 benefits of, 271–273 disk groups in, 288–289, 291–292, 298–299, 304 failure and recovery of, 282–283, 305 heartbeat messages in, 277, 280 partitioning of, 279 resources for (see Cluster resources) use of DLLs in, 286–287, 304 and Windows operating systems, 276–278
Index Servers: loosely coupled, 276 replication using, 324–325 Servo signals, 12 Shared-nothing architecture, 273 Signatures, 95 Simple volumes, 30, 31–33 advantages of, 31, 32 creating, 133–142 disk failure in, 35 extending, 189, 195 failure tolerance of, 30 formatting, 138–139 maximizing capacity of, 33 uses of, 30, 36–37 Spanned volumes, 30, 33–36 creating, 142–146 extending, 189, 195 failure tolerance of, 30, 35 file storage on, 35 and I/O performance, 35–36 uses of, 30, 36–37, 142, 388–389 Split mirrors, uses of, 184–185 Stacked drivers, 103–104 Storage capacity, unallocated, 398–400 amount of, 399 controlling with disk groups, 403–404 determining, 398 and disk failure, 400–402 distributing, 398–399 uses of, 401 Striped volumes, 30, 37–52, 388–389 allocating, 408 creating, 146–150 and data transfer-intensive applications, 47–48, 406–407 expanding, 42, 189, 195 failure tolerance of, 30, 39–41, 60, 388 and I/O performance, 40–41, 48–51, 147, 388 and I/O request-intensive applications, 43–47, 48–49, 404–406 mapping on, 38–39 number of columns in, 109, 147–148, 410 number of disks in, 146, 408–410 optimizing, 51–52 pros and cons of, 388 size of stripes on, 38, 48–51, 109, 146–147, 168, 404, 406, 407 staggered starts for, 407–408 uses of, 30, 41–43, 388–389 Striping: and I/O performance, 404–410 overwriting data in, 74 in RAID, 70–77 uses of, 229, 409 on virtual disks, 112–113 Subdisks, 28 allocating, in mirrored volumes, 151 choosing locations for, 191
449
extending, 190 moving, 109, 231–234 unrelocating, 402 Switching, 311 Synchronization, 153, 261–262 in replication, 326, 345–347 Synchronous replication, 358
T Tapes. See Magnetic tapes Temporary data, 41–42 Three-mirror volumes, 62, 109, 175–187 advantages of, 177 creating, 177–181 extending, 189–198 failure tolerance of, 109, 392–393 mission-critical data and, 62 as quorum devices, 283, 306 resynchronizing, 197 splitting, 175–177, 181–187, 210, 393 uses of, 175–176 Throughput, 50 Tracks, on magnetic disks, 11–12 Transaction-processing applications, 57–58, 404 Transient data, 3–4 loss of, 4 Two-mirror volumes: failure tolerance of, 401 splitting, 210 storage patterns of, 26–27
U Universal Naming Convention (UNC), 105–106 UNIX: file system check program of, 86 file system in, 5 Unrelocation, 402
V VERITAS Cluster Server (VCS), 276, 306–319 adding resources in, 313–317 copying resources in, 314 event log in, 318 failover in, 311–313 file modes in, 313–314 functions of, 306 versus MSCS, 306 resource monitoring in, 318–319 resource parameters in, 314 service groups in, 307–310 VERITAS Storage Replicator (VSR), 371–372 VERITAS Volume Replicator (VVR), 351–352 Data Change Map log of, 356–360 Express mode of, 354 versus Windows 2000 Volume Manager, 352 Virtual disks, 111, 255–256 aggregating, 111–113, 118, 394, 396 creating, 259–260
450
Index Virtual disks (Continued) failure tolerance of, 256 partitioning, 394, 396 and RAID subsystems, 400–401 reasons for use, 263 striping on, 112–113 synchronizing, 261–262 Volume Managers, 23, 319 disk groups in, 134 functions of, 24, 112, 118 host-based, 394, 396 for Windows NT (see Windows NT Volume Manager) for Windows 2000 (see Windows 2000 Volume Manager) Volumes, 25–30 adding mirrors to, 158–160 advantages of, 27–28 bootable, 97–98 characteristics of, 25 combining types of, 125 components of, 26, 28–29 concatenated, 168–169, 229 copying data to, 198–199, 207–209, 223 creating/reconfiguring, 126–130, 133–162 of disk arrays, 111–118 dynamic, 97–99, 197 expanding, 27,34–35, 110, 189–242, 398–399 failure-tolerant (see Failure-tolerant volumes) formatting, 138–139, 153, 171–173 host-based (see Host-based volumes) hybrid, 394–397 importance of maintaining, 160, 162, 183 logs for, 101 managing, 240–242, 384–393 mirrored (see Mirrored volumes) mirrored-striped (see Mirrored-striped volumes) monitoring performance of, 225–231 in MSCS clusters, 271–319 multiple types on same disk, 224–225 naming, in Windows, 105–106 as quorum devices, 283 recommendations about, 383–410, 415–420 recovering from system crashes, 101–103 replication of, 329–333, 334, 345–346, 352–353, 368 self-describing, 99 simple (see Simple volumes) software for managing, 106 spanned (see Spanned volumes) states of, 411–414 striped (see Striped volumes) synchronization of, 153 as VCS resources, 307 and VCS service groups, 307–310 virtualization in, 27
in Windows NT (see Windows NT) in Windows 2000 (see Windows 2000)
W Wide area networks (WANs) and replication, 337 Windows: disk/volume naming in, 105–106, 200 drivers in, 23 file replication in, 371–382 file system check program of, 86 file system in, 5, 6–7, 93 I/O stack in, 103–106 scheduling algorithms in, 56 server clusters and, 276–278 Windows NT: backward compatibility in, 91 cluster managers for, 276 file system of, 6–7 types of partitions on, 99 volumes in, 34, 99–101, 108, 134, 197 Windows NT Volume Manager, 108 disks supported by, 134 extending volumes with, 197–198 Windows 2000: disks in, 91–99 file systems supported by, 201 types of partitions in, 99, 130 volumes in, 99–118, 134, 197–202 Windows 2000 Volume Manager, 106–110, 163–175, 319 and CLI in, 240–242 console of, 163–165, 183, 244, 251 and disk failure, 263–296 disks supported by, 108, 134, 244 event log of, 239–240 extending volumes with, 198, 220–224 forms available, 106 functions of, 106–108, 107–110, 165 invoking commands in, 166 load balancing in, 244, 246–247, 251 menu choices in, 126 and mirrored-striped volumes, 204–209 monitoring performance with, 225–231 monitor interval of, 247 and RAID, 262–263 versus VVR, 352 Wizards, 126 in Logical Disk Manager, 128–130 Wolfpack. See Microsoft Cluster Server Write cache, 84 for file systems, 86–87 risks of, 85 Write requests: and mirrored volumes, 58–59, 392 ordering for replication, 344–345
Z Zoned data recording (ZDR), 14–15